Specific Differential Entropy Rate Estimation for Continuous-Valued Time Series

We introduce a method for quantifying the inherent unpredictability of a continuous-valued time series via an extension of the differential Shannon entropy rate. Our extension, the specific entropy rate, quantifies the amount of predictive uncertainty associated with a specific state, rather than averaged over all states. We relate the specific entropy rate to popular `complexity' measures such as Approximate and Sample Entropies. We provide a data-driven approach for estimating the specific entropy rate of an observed time series. Finally, we consider three case studies of estimating specific entropy rate from synthetic and physiological data relevant to the analysis of heart rate variability.


Introduction
The analysis of time series resulting from complex systems must often be performed 'blind': in many cases, mechanistic or phenomenological models are not available because of the inherent difficulty in formulating accurate models for complex systems. In this case, a typical analysis may assume that the data are the model, and attempt to generalize from the data in hand to the system. For example, a common question to ask about a times series is how 'complex' it is, where we place complex in quotes to emphasize the lack of a satisfactory definition of complexity at present [60]. An answer is then sought that agrees with a particular intuition about what makes a system complex: for example, trajectories from periodic and entirely random systems appear simple, while trajectories from chaotic systems appear quite complicated. On a more practical level, it is common to compare two time series from similar systems, in which case one wants to meaningfully ask: is the phenomenon resulting from system A more or less complex than the phenomenon resulting from system B?
There are many possible definitions of the complexity of a time series. See [50,60] for comprehensive reviews. Some notable attempts at formal definitions include Kolmogorov complexity [42], stochastic complexity [54], forecast complexity [24], and Grassberger-Crutchfield-Young statistical complexity [16]. Perhaps the most well-developed theory of complexity, which incorporates and expands on many of these quantities in the special case of discrete-valued time series, is computational mechanics [59]. For example, see [29] for an elucidation of the amount of information, in a formal sense, stored in a single observation from a discrete-valued stochastic process.
Practical definitions of complexity for continuous-valued time series are much less well-developed. The most common definitions rely on some notion of the difficulty in predicting a time series. There are currently at least two schools of thought for the (un)predictability-based notions of complexity when applied to systems with continuous states: Kolmogorov-Sinai entropy [34,62] and Shannon entropy rate [61]. Approaches based on the former treat the data as a trajectory from a deterministic dynamical system, and seek to estimate the Kolmogorov-Sinai entropy based on the trajectory [33]. This school of thought goes back to some of the earliest work applying nonlinear dynamics to observational data [14]. Approaches based on the latter treat the data as a realization from a stochastic process, and focus on entropy rate from a statistical perspective [39]. While these approaches seem very similar, and are typically treated as such in much of the applied literature, they in fact give diverging answers to similar questions. In particular, the Kolmogorov-Sinai entropy of a stochastic dynamical system is infinite, while the differential Shannon entropy rate of a deterministic dynamical system is infinite [48]. These facts have been noted in some of the earliest work on estimating entropy rates from continuous-valued time series [21], but are largely ignored in the applied literature. Moreover, methods proposed to estimate Kolmogorov-Sinai entropy may in fact be estimating Shannon entropy rate, and vice versa. The situation may be further confused by the fact that the Kolmogorov-Sinai entropy does correspond to a Shannon entropy rate, in this case the supremum over the discrete Shannon entropy rates induced by finite partitions of the state space of a dynamical system [1].
In addition to the methodological divide between the two dominant approaches to entropy rate estimation, neither has been used to provide a specific entropy rate for the system as a function of its state. That is, estimates are typically reported as time averages which, under certain conditions, converge to state space averages. However, it may be desired to know the entropy rate associated with a system now, at the present state, rather than on average. It is difficult to define such a state-specific entropy rate in the Kolmogorov-Sinai framework. For stochastic dynamics, such a state-specific entropy rate can be defined over ensembles of the system starting at the specified state. Thus, one of the aims of this paper is to provide an estimator for such a specific entropy rate.
The contributions of this paper are threefold. First, we reemphasize the dependence of the short term predictability of a nonlinear dynamical systems on its current state, and propose an information theoretic quantity, the specific entropy rate, that captures this dependence. Second, we propose a statistically principled approach to estimating the specific entropy rate from a continuousvalued time series that takes advantage of recent advances in conditional density estimation. Finally, we demonstrate the new approach with both synthetic and real data to highlight its strengths and weaknesses, with a special emphasis on interevent interval data as found in heart rate variability analysis. Throughout, we also make connections to modern practices in entropy rate estimation, both of the Kolmogorov-Sinai and differential schools, and seek to highlight how our estimator fits into those frameworks.

Methodology
In the following sections, we define the specific entropy rate of a stochastic dynamical system and develop an approach for its estimation from data. In Section 2.1, we fix our notation and define a stochastic dynamical system. In Section 2.2, we review the entropy rate of a stochastic dynamical system, and define the specific entropy rate. In Sections 2.3 and 2.4, we propose a method for estimating the specific entropy rate from finite data. Finally, in Section 2.5 we make connections between the specific entropy rate and other commonly used entropy rate estimators.

Stochastic Dynamical System
Consider an observed scalar real-valued time series x 1 , x 2 , . . . , x T . We explicitly model the time series as a realization from an autonomous stochastic dynamical system [20,9]. That is, unlike for autonomous deterministic dynamical systems which assume that a deterministic update rule acts on the precisely known state of the system, we assume that the states are stochastic, and moreover that transitions from state to state occur according to a transition density. Thus, we view x 1 , x 2 , . . . , x T , as a realization from the system {X t } t∈Z , where we use the standard convention of using upper / lower case to denote a random variable / its realization. For n > m, let X n m = (X m , X m+1 , . . . , X n−1 , X n ) denote the n − m + 1 block of states for the dynamical system from time m to time n. Similarly, let X m −∞ = (. . . , X m−1 , X m ) denote the semi-infinite past until time m, and let X ∞ n = (X n , X n+1 , . . .) denote the semi-infinite future starting at time n. Then a general model [9] for how the state evolves assumes that the future state X t can be expressed as a random transformation of its past X t−1 −∞ , where t represents dynamical noise, that is, noise that influences the dynamics of the system, to be contrasted with observational noise which impacts the observations of the system but not its dynamics. Equivalently, (1) can be expressed explicitly in terms of the transition density f x | x t−1 −∞ as More typically, the dynamical noise is taken to be additive, in which case where typically { t } is taken to be independent and identically distributed and t is taken to be independent of previous values of X s , s < t. Finally, we note that we consider solely scalar time series in this paper. While much of the theory can be translated to the case of multivariate time series by replacing the scalar observable X t with a d-dimensional vector observable X t , the impact of this change on the computational and statistical burdens of an approach such as the one we develop here are less easily overcome.

Differential Entropy Rate and Its Estimation
Let {X t } t∈Z be a discrete-time, continuous-state stochastic dynamical system as defined in the previous section. Recall that for a continuous-valued random variable X with density f (x), the differential entropy [47] of X is given by We will always take the logarithm with base e, and thus all differential entropies are in nats. For the remainder of this paper, because our focus is on continuousstate systems, when we use the term entropy, we refer to differential entropy. For random variables (X, Y ) with joint density f (x, y), the joint entropy of X and Y is defined similarly as Applying (7) to a stochastic dynamical system {X t } t∈Z with a block-p joint distribution f t at time t, the block-p entropies at time t are given by There are two definitions of differential entropy rate which are equivalent for a strong-sense stationary stochastic process [13,28]. The first, which we denote ash 1 (X), defines the entropy rate in terms of the rate of growth of block-p entropies,h The second, which we denote ash 2 (X), defines entropy rate in terms of the entropy of a one-step-ahead future conditional on a sufficiently long past, While these are equivalent for strictly stationary stochastic processes, they need not be for an arbitrary process. Because we are interested in quantifying the predictability of a stochastic process over time, we take (10) as our definition of entropy rate,h(X) ≡h 2 (X). Clearly, care must be taken when interpreting the densities that appear in the definitions of entropies and entropy rates we have defined thus far, and this interpretation depends on the assumptions that the practitioner is willing or able to make about the system under consideration. In practice, the assumption is typically made that {X t } t∈Z is strong-sense stationary [25], or at least can be made so via transformations such as differencing or detrending. These assumptions are typically violated in practice. We make a less restrictive assumption on the system under consideration, namely that it is conditionally stationary [8].
A process is conditionally stationary if the conditional distribution function of X t+1 given (X t , . . . , X t−p+1 ) = x does not depend on t for some fixed p: that is, the statistical future of the process conditional on a past of sufficient length does not depend on when that past was observed. Strong-sense stationary processes and Markov processes are special cases of this type.
The value ofh(X) depends on h [X t+1 | X t 1 ] and thus on the conditional structure of the stochastic process. Consider the conditional entropy of X t given the block X t−1 t−p of length p. Under the assumption of conditional stationarity of order p, this conditional entropy can be rewritten as where going from (13) to (14) we have applied conditional stationarity. Thus, we see that the order p conditional entropy depends on two properties of the stochastic process: the state-specific entropy conditional on a particular past x p 1 , and the overall density of the pasts X p 1 . This decomposition motivates defining the state-specific entropy rate of order p at time t as = We will call h (p) t the specific entropy rate of order p, or simply the specific entropy rate where the order p is clear. We will specify a procedure for choosing p in Section 2.4. The specific entropy rate quantifies the unpredictability of the process conditional on the specific past x t−1 t−p observed immediately before time t. We see that (19) emphasizes the well-known fact that the difficulty in prediction can depend on the current state for both deterministic and stochastic nonlinear dynamics [76,75]. This is not the case for linear time series models, where the specific entropy rate is independent of the present state of the system. We note that our specific entropy rate is similar in spirit to the specific information of a stimulus [18] from computational neuroscience, local information measures from [44,43], and the Lyapunov-like index [76] from statistical nonlinear time series analysis. The specific information of a stimulus notes that the mutual information between two random variables R and S can be decomposed as where H denotes the discrete Shannon entropy. Thus, the specific information of a particular stimulus s for a response R is taken to be I , using a similar decomposition as (16). The local information measures go one further step back, defining the local information measures in terms of the argument of the expectation associated with the information measure. For example, the local entropy rate of order p at x p+1 in our definition. The Lyapunov-like index is defined in terms of divergences with respect to the past of conditional expectations of the future given the past, and thus measures uncertainty about the future given the past using solely the first moment of the predictive density.
In practice, the predictive density f (x p+1 | x p 1 ) is unknown and must be inferred from observations of the system under consideration. Thus, we consider the plug-in estimator for the specific entropy rate, namelŷ where we substitute an estimatorf (x p+1 | x p 1 ) for the unknown predictive density f (x p+1 | x p 1 ). Finally, if an estimator for the overall entropy rate (10) of the system is desired, we define the estimator a time-average of the specific entropy rates, using the empirical distribution over the pasts as an estimator for f t (x p 1 ) in (15). Before considering the problem of estimating the predictive densityf (x p+1 | x p 1 ), we note that we are really interested in the specific entropy of the predictive density and not the predictive density outright. Thus, the predictive density f (x p+1 | x p 1 ) is a nuisance parameter, and a difficult one to estimate especially in higher dimensions. Based on this insight, many information theoretic estimators has been proposed that directly estimate the quantity of interest without first estimating the underlying density. For example, many estimators have been proposed based on the statistics of k-nearest neighbors amongst the sample points [35,36,64,23,63,45]. In fact, many of these estimators correspond to plug-in estimators using variable bandwidth kernel density estimators [70], with the bandwidth varying with the evaluation point: the bandwidth is taken to be the distance to the k th nearest neighbor. A key aspect of our estimator, which we turn to in Section 2.4, is the use of model selection to directly learn which lags are relevant to prediction. A similar approach could be taken with the k th nearest neighbor-based estimators, letting k vary with each lag. We return to a discussion of this approach, and its relation to our method, in Section 4.

Conditional Density Estimation
The problem of estimating a conditional density goes back to the pioneering work of Rosenblatt [55]. We estimate the predictive density using the conditional kernel density estimator proposed in [26,27]. See [6] for additional theoretical results for density estimators for general stochastic processes. Consider a continuous-valued time series {X t } T t=1 for which we desire to estimate the predictive density f (x p+1 | x p 1 ). Recalling that the predictive density is given by we can estimate the predictive density by estimating the joint density f (x p 1 , x p+1 ) and the marginal density f (x p 1 ) and taking their ratio. We estimate the marginal and joint densities using the kernel density estimatorŝ andf respectively, where K k is the product kernel L kp+1 is the univariate kernel k 1 , . . . , k p+1 are the bandwidths, and K(·) is a kernel function, i.e. a positive, symmetric probability density with finite second moment. The estimator for the conditional densityf (x p+1 | x p 1 ) is then .
Note that the joint and marginal density estimators are coupled since they use the same bandwidths k 1 , . . . , k p for both the marginal and joint density estimators. This coupling is necessary to ensure that, for example, the conditional density integrates to one with respect to x p+1 . On a more practical level for time series, this coupling allows us to screen out the distant past. Consider, for example, the extreme case where the past is irrelevant to the future in terms of prediction. By this coupling, we can ignore the past by setting the bandwidths k 1 , . . . , k p to large values. This has the effect of givinĝ f (x p+1 | x p 1 ) ≈f (x p+1 ) and recovering the appropriate independence relationship. More generally, if q < p lags are sufficient to screen off the distant past, then by setting the bandwidths k 1 , . . . , k p−q sufficiently large we can recover . We discuss how to take advantage of this property of conditional kernel density estimators in more detail in the next section.

Bandwidth and Order Selection
The estimator of the conditional density function (28), and thus the estimator of the specific entropy rate (20), depends on the choice of the order p and bandwidths k 1 , . . . , k p+1 . We therefore require a principled and repeatable procedure for selecting them. For example, in the context of transfer entropy estimation, [30] noted how, depending on the choice of these parameters, the direction of causality can be reversed. Because our approach explicitly builds a statistical model for the dynamical system, we choose the order and bandwidths via l-block cross-validation [7] of the negative log-likelihood of the conditional density. (Note that [7] calls their method h-block cross-validation, which we rename in this manuscript to avoid confusion with differential entropy.) l-block cross-validation is an extension of leave-one-out cross-validation where instead of leaving out a single observation at each evaluation, we remove the observation and l observations on either side of that observation. That is, we seek the values of p and k = (k 1 , . . . , k p+1 ) that minimize wheref −t:l is the estimate of conditional density after removing the 2l + 1 observations about t. This accounts for a bias in 0-block cross-validated likelihood resulting from the dependence inherent in temporally nearby realizations of a time series. We immediately see that (29) takes the form of an entropy rate, so this cross-validation procedure can also be thought of as minimizing the entropy rate of the model. Thus, cross-validation provides a principled means for choosing the order of the entropy rate in analogy to common practices in the discrete-valued case. For example, when computing the entropy rate for discretevalued data, it is frequently recommended to choose the order of the entropy rate by searching for an asymptotic value for order-p entropy rate as a function of p [15]. Thus, our approach extends this heuristic to the continuous-valued case, with an additional penalty on p induced by the nature of cross-validation. Moreover, both theoretical and empirical work have shown that choosing the bandwidth via cross-validation can automatically 'smooth out' irrelevant predictors by setting their bandwidths very large [26,19]. This is clearly desirable in the time series case, since we expect to induce conditional independence between the distant past and the future after accounting for a sufficient portion of the recent past. By using cross-validation, we get this dimension reduction for free. Because of the computationally intensive nature of l-block cross-validation, we begin by fixing p and choosing the bandwidths (k 1 , . . . , k p+1 ) using 0-block cross-validation, which reduces to leave-one-out-cross-validation. Then, using these bandwidths, we choose p via l-block cross-validation. In all of the reported results, we use l = 50, thus leaving out 101 points about any evaluation in (29). In principle, the block size could be chosen using the autocorrelation time or lagged mutual information [33], or a data-driven approach [37]. We leave the exploration of these approaches for future work.

Relationship to Other Entropy Rate Estimators
In the nonlinear dynamics community, especially in applications to biological systems, two popular measures of the uncertainty associated with the dynamics of a system are Approximate Entropy [51] and Sample Entropy [53]. Despite their names, both of these quantities correspond to estimators of entropy rates rather than entropies. Approximate Entropy, as originally proposed by Pincus, was motivated by a finite-time, finite-resolution approximation to the Kolmogorov-Sinai Entropy of a deterministic dynamical system. The Sample Entropy was proposed as a modification to the Approximate Entropy that addressed several of its deficiencies. In [39], Lake elucidates the key connection between the Approximate and Sample Entropies and information theoretic entropy rates. In particular, Lake shows that the Approximate Entropy corresponds to a kernel density-based estimator of the Shannon differential entropy rate using uniform kernels and fixed bandwidths k 1 = k 2 = . . . = k p+1 , while the Sample Entropy corresponds to a kernel density-based estimator of the Rényi entropy rate with order α = 2, the so-called collision entropy, with a particular choice of definition for the conditional Rényi entropy. (Unlike conditional Shannon entropy, no standard definition of conditional Rényi entropy exists for arbitrary α [69].) In later work, recommendations were made for choosing the model order p [41], for setting the common bandwidth [40], and for incorporat-ing an adaptive bandwidth [38]. In the Appendix to this paper, we reproduce the derivation made in [39] connecting the Approximate Entropy statistic to kernel density-based estimators of the differential entropy rate.

Results
We consider entropy rate estimation in three examples of increasing realism. The first example, described in Section 3.1, applies the specific entropy rate estimator to a second-order Markov model. This example was designed to emphasize, in a simple way, the potential dependence of the specific entropy rate h t on the state of the system. In Section 3.2, we consider the entropy rate of interevent intervals resulting from an integrate-and-fire model driven by synthetic chaotic signals. This type of model is typically implicit in many of the analyses of biological signals ranging from heart rate variability to neural firing. This example demonstrates how our entropy rate estimator performs when the assumption of a stochastic dynamical system is violated. Finally, in Section 3.3, we demonstrate the specific entropy rate estimator using interbeat interval sequences resulting from a tilt table experiment.
Throughout these examples, we use the R package np [27] to estimate the conditional densities using second-order Gaussian kernels [74]. As recommended in the methodology section, for a particular model order p, we choose the bandwidths using leave-one-out cross-validation on the log-likelihood, and choose the model order p using l-block cross-validation with l = 50. We then estimate the specific entropy rate using (20).

A Second-Order Markov Process
Our first example is chosen to highlight the state-dependent nature of the specific entropy rate (19). We consider a stochastic dynamical system with three effective states. One of the states corresponds to a crossing event, when the system switches from positive to negative outputs or vice versa. This state has a high specific entropy rate. The other two states correspond to when the system settles into either a run of positive outputs or a run of negative outputs. In these states, the specific entropy rate is smaller. Explicitly, consider the second-order Markov process with the transition density where p + = p − = 0.1, and φ(x; µ, σ 2 ) is the probability density function for a normal random variable with mean µ and variance σ 2 , The transition densities for each effective state are shown in the left panel of Figure 1. The first effective state (red solid) corresponds to when the two previous observations were positive, the second effective state (blue dashed) corresponds to when the two previous observations were negative, and the third effective state (green dash-dotted) corresponds to when the two previous observations had opposite signs. The right panel of Figure 1 shows a scatter plot representation of the marginal density (X t , X t+1 ) with the quadrants colored by the effective states.  Figure 1: Left: The predictive densities associated with each of the effective states for the Markov process (30). Right: A scatter plot representation of the marginal density of (X t−2 , X t−1 ) with the effective states colored according to the convention in the left panel.
The top panel of Figure 2 shows an example realization with T = 1000 which we use to estimate the specific entropy rate. We can compute the specific entropy rate h t for each effective state exactly. By symmetry, the first two effective states have the same specific entropy rate, which we compute by evaluating (19) numerically: 1.744 nats per symbol. The third effective state's predictive density corresponds to a normal density with variance 9, and thus has specific entropy rate 1 2 log(2πe · 3 2 ) ≈ 2.518 nats per symbol. The bottom panel shows the specific entropy rate (dashed blue), along with the estimated specific entropy rate with p = 2 (solid red). From the specific entropy rate, we can clearly see when the system switches from one of the low specific entropy rate states to the high specific entropy rate state, and vice versa. Moreover, we see that the estimated entropy also displays these transitions, though not as cleanly.
To see the performance of the estimator as a function of the history, for each time point t we compute both the estimator of the specific entropy rateĥ t , as well as the empirical bias between the estimated and true value, Figure 3 displays the estimated specific entropy rate (left) and bias (right) as a function of the history (x t−2 , x t−1 ). As we saw in Figure 2, the estimator successfully distinguishes between the high entropy rate effective state (colored purple) and the low entropy rate effective states (colored yellow). Because the estimated specific entropy rate is always positive for this system, a positive bias indicates that the estimated entropy rate is larger (greater predictive uncertainty) than it should be, and a negative bias indicates that the entropy rate is smaller (lower predictive uncertainty) than it should be. We see that a large positive bias occurs for those pasts that belong to either the first (red) or second (blue) effective states, but lie near the border with the third (green) effective state. This occurs because of the discontinuous transition in the predictive density between each state. It is especially pronounced for those (rare) pasts near the origin, again because of the discontinuity. Finally, we demonstrate two snap shots of the system in Figure 4 to recall the intuition behind the specific entropy rate, and how it relates to the predictive density of the stochastic dynamical system. Each panel shows the state of  Figure 3: The estimated specific entropy rateĥ t (left) and its biasĥ t −h t (right) as a function of the history (X t−2 , X t−1 ) for the Markov model. Note that the estimator correctly identifies the high and low specific entropy rate histories, and its largest bias occurs near the transitions between quadrants.
the system (top) with the present state x t marked by a blue circle and the past (x t−2 , x t−1 ) marked by red circles, the estimated predictive densityf (· | x t−2 , x t−1 ) (middle), and the estimated specific entropy rate (bottom). The left panel corresponds to when the two past observations were positive, and thus the system is in one of the low entropy rate effective states. The right panel corresponds to when the two past observations were opposite in sign, and thus the system is in the high entropy rate effective state. We see that in both cases, the estimated predictive densities and estimated entropy rates agree with the effective states.

Interevent Intervals from an Integrate-and-Fire Model Driven by Chaotic Signals
For our second example, we consider interevent intervals resulting from an integrate-and-fire model driven by a chaotic signal. This model implicitly motivates many of the embedding-based analyses used with neural and heart rate variability data. For example, it is common to consider the times between heart beats (interbeat intervals or RR intervals) as if they are equispaced samples from a continuous time process, and then apply methods from nonlinear dynamics. There is not, a priori, any reason to assume that such an analysis of interevent interval data through this 'wrong' lens (e.g. treating the interevent times from a point process as the output from a map) should give rise to meaningful results. However, a surprising result by Sauer [57] demonstrates at least one scenario where this type of analysis does give rise to meaningful results. In particular, Sauer demonstrated that when the state of a chaotic dynamical system is  Figure 4: A demonstration at two adjacent time points of (top) a realization from the second order Markov model, (middle) the estimated predictive densitŷ f (x t | x t−2 , x t−1 ), and (bottom) the specific entropy rate for the second-order Markov process in low (left) and high (right) specific entropy rate states. In the top panels, the dashed vertical bar indicates the time t, the red points correspond to the specific pasts (x t−2 , x t−1 ), and the blue points correspond to the future values x t . mapped into an interevent interval sequence via an integrate-and-fire model, a one-to-one mapping exists between the full, unobserved state of the system and an embedding of the interevent interval sequence as long as the embedding is of dimension at least twice the box counting dimension of the underlying chaotic system. Thus, it is possible to recover the true state of the entire system by considering sufficiently long interevent interval sequences.
This fact poses a problem for the analysis of interevent interval data using quantities such as Approximate Entropy or Sample Entropy, since as we have noted those approximate differential entropy rates, and the differential entropy rate of a deterministic dynamical system is negative infinity. Thus, the quantity being used is at least potentially mis-specified for the phenomenon being studied. Nevertheless, it seems unlikely that the popularity of Approximate Entropy or Sample Entropy will abate in the near future [77], and thus it is interesting to consider how a more principled entropy rate estimator performs in the misspecified case. Moreover, in practice the deterministic dynamical system model is almost certainly mis-specified for complex systems. As noted in [21], there is hope that observational and dynamical noise might smooth out the infinities, thus resulting in useful estimates of entropy rates.
Consider a non-negative signal S(t) = g(x(t)) mapping the m-dimensional state x(t) ∈ R m of a chaotic dynamical system to a scalar value. The integrateand-fire model generates a series of discrete events based on when the integrated signal crosses a fixed threshold Θ. Setting T 0 = 0, for a fixed threshold value Θ, the threshold crossing events {T i } are defined recursively as Ti+1 Ti S(t) dt = Θ (33) and the interevent intervals are given by the time between event i − 1 and i, We consider signals generated by two classic chaotic systems, the Lorenz system evolving according toẋ with the canonical values σ = 10, β = 8/3, and ρ = 28, and the Rössler system evolving according toẋ with the canonical values of a = 0.1, b = 0.1, and c = 14. For both the Lorenz and Rössler systems, following [57], we take the signal to be S(t) = (x(t) + 2) 2 (36) and fix Θ = 60 and Θ = 125, respectively. Figure 5 demonstrates example realizations of the interevent intervals IEI i = T i −T i−1 by event index i (left) as well as a lag-lag plot of consecutive interevent intervals (right) for the Lorenz (top) and Rössler (bottom) systems. We see that the two systems give rise to very different time courses of interevent intervals, as we would expect from differing dynamics of the two systems. In particular, since both the x-and y-coordinates of the Rössler system evolve in a nearlylinear fashion, we see that the interevent intervals are relatively regular. By comparison, the interevent intervals for the Lorenz system are much more erratic. Thus, we might intuitively expect for the interevent intervals from the Lorenz system to give higher specific entropy rates than the interevent intervals from the Rössler system.
Next we turn to estimating the specific entropy rate for each of these systems. For each system, we generated interevent interval sequences of length T = 1000. We then chose the model order p and bandwidths (k 1 , . . . , k p+1 ) as described in Section 2.4. The 50-block cross-validated log-likelihood (29) as a function of p is shown in Figure 6. Based on the embedology [56] result from [57], an embedding of at least twice the box counting dimension of the underlying attractor is required. Both the Lorenz and Rössler attractors have box counting  dimensions between 2 and 3, thus we expect that a value of p around 6 should be sufficient for the predictive density. We see that the 50-block cross-validated log-likelihood chooses p = 9 and p = 8 for the Lorenz and Rössler systems.
As mentioned in Sections 2.3 and 2.4, using cross-validation to choose the bandwidths of the conditional kernel density estimator introduces a form of feature selection into the conditional density estimation process: lags that are not relevant, as measured by the cross-validation score, are smoothed out by setting their associated bandwidths to infinity (in practice, to a large value). We demonstrate this phenomenon now for the bandwidths estimated for the interevent intervals derived from the Lorenz and Rössler systems. For a fixed maximal lag p, Table 1 shows the bandwidths estimated for the Lorenz (top) and Rössler (bottom) systems. The first row indicates the bandwidths chosen by cross-validation for the future k 0 and past k −1 when we include only a single lag, the second row indicates the bandwidths chosen for the future k 0 and past (k −1 , k −2 ) when we include two lags, etc. A horizontal dash (-) indicates that cross-validation has set the bandwidth associated with that lag to a value of 5 or greater, which is large with respect to the scale of the dynamics, thus in effect ignoring the lag in the estimation of the predictive density. Note that these bandwidths are for Gaussian kernels, and thus are not immediately at the scale  of the data. A transformation from the Gaussian scale to the uniform scale could be performed using the concept of canonical kernels [46]. Comparing Table 1 to Figure 6, we see that for the interevent intervals generated by the Lorenz system, intervals 4 through 7 can be ignored. This agrees with the sharp drop in Figure 6 at p = 3. Then, the intervals 8 and 9 are included, but no others, thus giving the minimum at p = 9. A similar result holds for the bandwidths for the Rösslergoverned interevent intervals, where the bandwidths stabilize at p = 8, which also corresponds to the minima in the 50-block cross-validated log-likelihood. Beyond this automatic selection of relevant lags, we see that the magnitudes of the bandwidths are very different amongst the k = (k 0 , k −1 , . . . , k −p ): as one might expect, the bandwidths for the near past are smaller than the bandwidths for the distant past, i.e. we should pay more attention to the recent past for prediction. Compare this inherent dynamic range in the bandwidths across lags to the fixed bandwidths across lags used in other statistics such as Approximate Entropy, Sample Entropy, and Multiscale Entropy. If viewed as estimators of different differential entropy rates, these estimators would be severely biased by the fixed bandwidths. Now consider the specific entropy rate of two interevent interval sequences as a function of time, shown in Figure 7. Note that both the interevent intervals Table 1: The optimal bandwidths k = (k 0 , k −1 , . . . , k −p ) chosen using (29) with p fixed from 1 to 12 for the interevent intervals derived from the Lorenz (top) and Rössler (bottom) systems. A horizontal dash (-) indicates that crossvalidation set the bandwidth associated with that lag to a value of 5 or greater, in effect ignoring the lag in the estimation of the predictive density. The bold rows correspond to bandwidths selected for the minimal values of p as shown in Figure 6.  and specific entropy rates are shown as a function of the time rather than the event index. That is, for each interevent interval sequence, we show (T i , IEI i ) and (T i , h i ). The estimate of the time-averaged specific entropy rate (20) for the Lorenz and Rössler interevent interval sequences are −0.41 nats / event and −1.0 nat / event, respectively. In addition, we also show a moving windowed average of the specific entropy rate using a uniform kernel of width 60 au in red in the bottom panel of Figure 7. This can be thought of as a local (in time) version of (20), and allows us to determine if there are periods of time when the interevent intervals are more, or less, predictable. For example, we see a drop in the specific entropy rate for the Lorenz interevent intervals around 300 au, which corresponds to a run of relatively long and regular interevent intervals. We see from both (20) and its time-local version that the interbeat interval sequence derived from the Rössler system are more predictable, which matches our intuition as outlined above based on the near-linear dynamics of the xcoordinate of the Rössler system. The thresholds Θ were chosen such that each system has approximately equal mean interevent interval length: 0.90 au and 0.88 au for the Lorenz and Rössler systems, respectively. However, the pointwise standard deviations of the two interevent interval sequences are different: 0.39 au and 0.73 au for the Lorenz and Rössler systems, respectively. Recall that, unlike discrete entropy, differential entropy is not scale invariant. This motivates determining a scale invariant analog of the specific entropy rate that teases apart inherent unpredictability from the natural scale of the system. We will consider this point in the discussion section.
As a final example, we consider estimation of the specific entropy rate where the interevent interval sequence transitions from being generated by the Lorenz system to being generated by the Rössler system and back again. In this case, the interevent interval sequence is clearly non-stationary. However, conditional stationarity is only violated locally in time around the transitions. To generate this time series, we concatenate 500 interevent intervals each from the Lorenz, Rossler, and Lorenz systems, and thus T = 1500. This sequence is shown in the top panel of Figure 8. We estimate the autoregressive order p over the entire time series using (29). The 50-block cross-validated log-likelihood as a function of p is shown in Figure 9. The minima occurs at p = 11. Note that this is a higher order than chosen for either the Lorenz (p = 9) or Rössler (p = 8) systems when estimated in isolation. We see that additional information about the past is required when we need to distinguish between the two systems. Finally, Table 2 demonstrates the bandwidths chosen by cross-validation as a function of the maximal lag p. Again, we see that cross-validation provides both model selection and adaptive smoothing.
The bottom panel of Figure 8 shows the specific entropy rate as a function of time for the concatenated system. As before, the black line is the specific entropy rate, and the red line is a moving windowed average of the specific entropy   The optimal bandwidths k = (k 0 , k −1 , . . . , k −p ) chosen using (29) with p fixed from 1 to 12 for the interevent intervals derived from the concatenation of the Lorenz, then Rössler, then Lorenz systems. A horizontal dash (-) indicates that cross-validation set the bandwidth associated with that lag to a value of 5 or greater, in effect ignoring the lag in the estimation of the predictive density. The bold row correspond to bandwidths selected for the minimal value of p as shown in Figure 9. rate. Again we see that the specific entropy rate drops as the system transitions from the Lorenz interevent intervals to the Rössler interevent intervals and then increases after the transition back to the Lorenz interevent intervals. There is, however, a slight penalty to estimating the specific entropy rate for the concatenated interevent interval sequences all at once. During the Lorenz-governed interevent interval sequence, the time-averaged specific entropy rates are −0.30 nats / event and −0.28 nats / event, compared to −0.41 nats / event when estimated in isolation. Similarly, the time-averaged specific entropy rate for the Rössler-governed interevent interval sequence is −0.72 nats / event compared to −1.0 nats / event when estimated in isolation. In both cases, we see that the specific entropy rates have increased. This is largely due to the fact that the optimal bandwidths k 1 , . . . , k p+1 when estimating the predictive density for either system in isolation are not optimal for estimating the concatenation of the two systems. This will lead to larger bandwidths overall, and thus higher specific entropy rates. For this system, the difference in the dynamics is very large and the transition point relatively obvious, and thus a better approach might be to estimate the predictive densities separately for each segment. However, in those cases where such transitions are non-obvious or where manual transition detection is not desirable, we see that estimating the predictive density all at once still leads to discrimination between high and low specific entropy rates. Figure 10 demonstrates the interevent interval sequence (top), predictive density (middle), and specific entropy rate (bottom) for interevent interval sequence for two time instants during the Lorenz (left) and Rössler (right) portions. The time instant during the portion governed by the Lorenz system has a higher specific entropy rate, as we would expect given the multi-modal nature of the estimated predictive density in the middle panel. In contrast, the time instant during the portion governed by the Rössler system has a lower specific entropy rate, as we would expect from the uni-modal and narrow estimated predictive density. However, we see that in both cases, the specific entropy rate can vary widely depending on the state of the system. For example, during periods around the long interevent intervals, the interevent intervals generated by the Rössler system can have higher specific entropy rates than those governed by the Lorenz system (the peaks in the specific entropy rate).

Specific Entropy Rate from a Tilt Table Experiment
As a last example, we consider the specific entropy rates of interbeat interval sequences from subjects participating in a tilt table experiment. It is well known by anyone with a heart that the rate of their pulse, the average number of beats within a specified window of time, can vary widely based on environmental, physiological, and psychological factors. However, it was not until the 20th century that researchers came to realize that beat-to-beat variations in heart rate convey information about the health of individuals. The study of beat-tobeat variations in heart rate is typically referred to under the umbrella term of heart rate variability. See [52,4,5] for a historical perspective on heart rate variability. The nonlinear dynamics community has contributed a large number of methods for the analysis of interbeat intervals. See [73] for an extensive historical and methodological review.
In what follows, we use the term interbeat interval (IBI) to refer to the times between the R components of adjacent QRS complexes associated with heartbeats. Common statistics computed from heart rate variability data include the mean interbeat interval and the standard deviation of the interbeat intervals. In addition, it is common to interpolate the interbeat interval sequence to obtain an equi-spaced sequence for spectral analysis [17], from which the power of high frequency and low frequency components, and their ratio, are commonly reported. It is also very common to compute Approximate and / or Sample Entropies of interbeat interval sequences. Any, and sometimes all, of these statistics are referred to as heart rate variability (HRV), and thus we will refrain from using that term. Many of these quantities can be computed by off-the-shelf software tailored for heart rate variability analysis such as Kubios [68], though we recommend caution when using such software since many of the parameters involved in both pre-processing of the data and its analysis are set in an ad hoc fashion.
As before, our approach to analyzing an interbeat interval sequence is to view it as the realization of some conditionally stationary stochastic dynamical system. This perspective naturally handles the fact that heart beats occur as a point process in time, as we saw in the previous section. Thus, we can compute the specific entropy rate associated with the time until the next heart beat, conditional on the most recent interbeat intervals. That is, if we denote the time between the (i − 1) th and i th heart beat by IBI i , we consider the specific entropy rate as h IBI i | IBI i−1 i−p . We will investigate the specific entropy rate from the interbeat interval sequences of five subjects participating in a tilt table experiment. The population consisted of two males and three females between the ages of 27 and 44. In the experiment, the subject initially positioned him/herself in a prone position on the table and was secured to the table. The subject was then kept in the supine position for 5 minutes, then tilted upright for 5 minutes, and finally was returned to a supine position for 5 minutes. An ECG was continuously recorded throughout the experiment. The interbeat intervals were extracted using the AF1 algorithm from [22]. Specific entropy rates were computed for each subject using model orders p and bandwidths (k 1 , . . . , k p+1 ) chosen as described in Section 2.4. The interbeat interval sequences (top) and specific entropy rates (bottom) for each subject are shown in Figure 11. For each subject, we see the expected decrease in interbeat interval length (increase in heart rate) as they move from a supine to upright position. However, for subjects (a-d), this change in mean interbeat interval length is also associated with a change in the overall dynamics of the interbeat interval sequence, which results in a drop in the specific entropy rate during the upright time period. With the return to supine position, the interbeat interval lengths again increase (the heart rate decreases), and the specific entropy rates of subjects (a-d) return to the same level as the start of the experiment. Clearly, with only five subjects and a single session from each subject, we cannot say much about either typical or atypical evolution of specific entropy rates in a tilt table experiment. However, it is interesting to note that subject (e), the only outlier in terms of the evolution of their specific entropy rate over time, is also the only subject with a traumatic brain injury in their past. Head trauma has been associated with changes in both spectral and information theoretic properties of interbeat interval sequences at rest [65,49]. Our results corroborate these findings, and suggest that additional studies that include a physiological stressor, such as the tilt table, may be even more disclosing.

Discussion and Future Directions
An important consideration for any estimator relates to how it behaves under error or in the presence of noise. Care must be taken with respect to how one defines error, however. For example, does error refer to observational noise, model uncertainty / misspecification, or unobserved factors [15]? We have not considered the impact of observational noise, for example, because the measurements we have considered, namely interevent and interbeat intervals, can be treated as relatively noise free. However, if observational noise is a major concern, then the estimation of specific entropy rate must be carefully applied in this context, since direct estimation from the observed signal will combine dynamical and observational uncertainties. Possible solutions include the errorsin-variables model for density estimation [6] or more general nonlinear filtering approaches [67].
We have considered only fixed bandwidths for the conditional kernel density estimator in estimating the specific entropy rate: regardless of the past and future states of the system, we use the same bandwidths in estimating the predictive density. In Section 3.2, we saw a scenario where this estimation strategy may be problematic: the typical scale of the interevent intervals differed between the Lorenz-governed and Rössler-governed periods, and this led to suboptimal bandwidths. Alternative variable bandwidth density estimation schemes allow the bandwidths to vary with either the data used in estimation of the density or the point of evaluation [70]. For example, the estimator for the differential entropy of a random vector developed in [35] based on k-nearest neighbor statistics is equivalent to a plug-in estimator of the differential entropy using a kernel density estimator with a bandwidth that varies with the point of evaluation, in this case the distance to the k th nearest neighbor of the evaluation point, along with an additional bias correction term. Many other estimators, such as the popular Kraskov-Stögbauer-Grassberger mutual information estimator [36], fall into this category. Future work will explore the tradeoff between the resolution gained by variable bandwidth estimators of specific entropy rate and the statistical and computational burden imposed. One recent approach along these lines used a variable bandwidth kernel density estimator to estimate the transfer entropy for various simulated systems [78].
Another potential issue with scaling, as we again saw in Section 3.2, is that differential entropy, and thus differential entropy rate, is not invariant to scaling. For example, changing the units used to measure the system under consideration will result in a linear shift to the differential entropy. Depending on the application at hand, this may or may not be problematic. If comparing the entropy rate of multiple time series, all with the same units, then the lack of invariance to scale washes out. However, if one is analyzing a single time series that has large variations in its characteristic scale over time, then the dependence on scaling may be problematic. One potential alternative is to normalize the differential entropy rate using the typical scale of the system at any given instant. A good candidate for this is the negentropy [11] of a random variable, which normalizes the differential entropy by the differential entropy associated with a Gaussian density with the same variance. The negentropy, unlike the differential entropy, is invariant to affine transformations of a random variable. Thus, we might define a specific negentropy rate by normalizing the specific entropy rate by an instantaneous measure of the variance. This is analogous to the redundancy [15] of discrete-state stochastic processes, which normalizes the entropy rate of a stochastic process by the entropy rate of a uniformly distributed process with the same alphabet.
Any method that utilizes either Approximate or Sample Entropies could be modified to use our specific entropy rate estimator. For example, the multiscale entropy [12], which is defined as the Sample Entropy of a time series at varying levels of aggregation, could easily be modified by direct substitution with the specific entropy rate. This would allow for not only an analysis of the unpredictability across scales, but also across time. Similarly, the point process model of interbeat interval sequences introduced in [2, 10, 71] is a particular parametric form for the stochastic dynamical system (2). In a sequel [72], the authors propose using the filtered state from this model to estimate what they call the inhomogeneous point-process entropy. They estimate this quantity using either the Approximate or Sample Entropies, and thus based on the analysis from [39], we see that their estimator is for the unscaled Shannon or Rényi entropy rate of the filtered state. Thus, the specific entropy rate could be used on the filtered state.
Our approach to specific entropy rate estimation via conditional kernel density estimation can also be extended to any of the various other information theoretic measures gaining popularity including transfer entropy [58,30], causation entropy [66], and co-/multi-informations [3]. Many of these quantities would benefit from a data-driven approach to bandwidth selection, in addition to the automatic dimension reduction such approaches induce. However, we also note that with each additional probabilistic conditioning required by these measures, we increase both the statistical and computational burden for constructing the appropriate estimator. For example, the convergence rate of kernel density-based estimators for many information theoretic quantities scale exponentially in the reciprocal of the dimension of the random vector [32], while their time complexities scale exponentially in the dimension of the random vector [63].

Conclusions
Via a decomposition of the entropy rate of a discrete-time, continuous-valued stochastic dynamical system, we have proposed a measure of state-specific uncertainty: the specific entropy rate. We have shown how to estimate the specific entropy rate from finite data using kernel density estimators, and provided a data-driven method for choosing the free parameters in the kernel density estimation. Given the immense popularity of heuristic approaches to entropy rate estimation such as Approximate Entropy and Sample Entropy, it is our hope that a more principled approach to entropy rate estimation will be found useful by the larger research community.
All of the software used in this paper was developed in R via extensions to the np library for kernel density estimation. We plan to make this implementation available online. In an effort to match the naming convention applied to Approximate Entropy (ApEn) and Sample Entropy (SampEn), we call our R implementation spenra for sp(ecific) en(tropy) ra(te). The R implementation of spenra will be hosted at http://github.com/ddarmon/spenra.

Acknowledgements
The author thanks C. Wang, D. Keyser, C. Cellucci, and P. Rapp for valuable discussions, and D. Nathan for providing the data from the tilt table experiment.
given by Finally, we compute the average logarithm of (37) across all of the vectors, giving Iff were replaced with the true density f , then for large T , Φ (p) normalized (r) approximates the negative joint differential entropy h X (p) = −E log f X (p) by the Law of Large Numbers. However, because we evaluate the estimatorf at the same data used to estimate it, (46) is a biased estimator of the negative differential entropy −h X (p) . A simple modification of (46), due to [32,31], provides an estimator for the joint differential entropy with a fast rate of convergence in the IID case. In particular, letf −t be the kernel density estimator for the joint density formed by leaving out the t th vector X t . That is, we estimate the joint density using (45) with all of the vectors except X t . This gives the leave-one-out (LOO) estimator for the joint differential entropy, Thus we see that with the proper normalization, a modification of the Approximate Entropy gives an estimator for the finite-p differential entropy rate,