2.1. Entropy, Dimension and Excess Entropy
Starting point is a (in general) vector valued stationary time series
. It is used to reconstruct the phase space of the underlying dynamical system by using delay embedding:
. If the original time series was
dimensional the reconstructed phase space will be of dimension
. More general embeddings are also possible, e.g., involving different delays or different embedding dimensions for each component of the time series. In the following, we will only consider the case
,
i.e., scalar time series. The generalization to vector-values time series is straight forward. Let us assume that we are able to reconstruct approximately the joint probability distribution
in a reconstructed phase space. Then we can characterize its structure using measures from information theory. Information theoretic measures represent a general and powerful tool for the characterization of the structure of joint probability distributions [
20,
21]. The uncertainty about the outcome of a single measurement of the state,
i.e., about
is given by its
entropy. For discrete-valued random variable
X with values
and a probability distribution
it is defined as
An alternative interpretation for the entropy is the average number of bits required to encode a new measurement. In our case, however, the
are considered as continuous-valued observables (that are measured with a finite resolution). For continuous random variables with a probability density
one can also define an entropy, the
differential entropyHowever, it behaves differently than its discrete counterpart: It can become negative and it will get even minus infinity if the probability measure for
X is not absolutely continuous with respect to the Lebesgue measure—for instance, in the case of the invariant measure of a deterministic system with an attractor dimension smaller than the phase space dimension. Therefore, when using information theoretic quantities for characterizing dynamical systems researchers often prefer using the entropy for discrete-valued random variables. In order to use them for dynamical systems with continuous variables usually either partitions of the phase space or entropy-like quantities based on coverings are employed. These methods do not explicitly reconstruct the underlying invariant measure, but exploit the neighbor statistics directly. Alternatively one could use direct reconstructions using kernel density estimators [
22] or methods based on maximizing relative entropy [
23,
24] to gain parametric estimates. These reconstructions, however, will always lead to probability densities, and are not suitable for reconstructing fractal measures which appear as invariant measures of deterministic systems.
In this paper, we use estimators based on coverings,
i.e., correlation entropies [
25] and nearest neighbors based methods [
26] considered in
Section 2.3 below. For the moment let us consider a partition of the phase space into hypercubes with side-length
ϵ. For a more general definition of an
ϵ-partition see [
27]. In principle one might consider scaling the different dimensions of
differently, but for the moment we assume that the time series was measured using an appropriate rescaling. The entropy of the state vector
observed with a
ε-partition will be denoted in the following as
with
ϵ parameterizing the resolution.
How does the entropy change if we change
ε? The uncertainty about the potential outcome of a measurement will increase if the resolution of the measurement is increased, because of the larger number of potential outcomes. If
is a m-dimensional random variable and distributed according to a corresponding probability density function
we have asymptotically for
see ([
27] or [
28] (ch. 8, p. 248, theorem 8.3.1)).
This is what we would expect for a stochastic system. However, if we observe a deterministic system the behavior of an observable depends how its dimension relates to the attractor dimension. If the embedding dimension is smaller than the attractor dimension the deterministic character will not be resolved and Equation (
2) still applies. However, if the embedding dimension is sufficiently high (
[
29]) then instead of a density function
we have to deal with a
D-dimensional measure
and the entropy will behave as
If an behavior such as in Equations (2) or (3) is observed for a range of ϵ values we will call this range a stochastic or deterministic scaling range, respectively.
Conditional entropy and mutual information Let us consider two discrete-valued random variables
X and
Y with values
and
, respectively. Then the uncertainty of a measurement of
X is quantified by
. Now we might ask, what is the average remaining uncertainty about
X if we have seen already
Y? This is quantified by the
conditional entropy The reduction of uncertainty about
X knowing
Y is the information that
Y provides about
X and is called the
mutual information between
X and
YHaving defined the
ε-dependent state entropy
we can now ask, how much information the present state contains about the state of the system at the next time step. The answer is given by the mutual information between
and
:
Using Equation (
2) one see, that for stochastic systems the mutual information will remain finite In the limit
and can be expressed by the differential entropies:
Note that this mutual information is invariant with respect to coordinate transformation of the system state,
i.e., if
is a continuous and invertible function, then
However, in the case of a deterministic system, the mutual information will diverge
This is reasonable behavior because in principle the actual state contains an arbitrary large amount of information about the future. In practice, however, the state is known only with a finite resolution determined by the measurement device or the noise level.
Predictive information, excess entropy and entropy rate The unpredictability of a time series can be characterized by the conditional entropy of the next state given the previous states. In the following we will use an abbreviated notation for these conditional entropies and the involved entropies:
The
entropy rate ([
28] see (ch. 4.2)) is this conditional information if we condition on the infinite past
In the following we assume stationarity,
i.e., we have no explicit time dependence of the joint probabilities and therefore also of the entropies. Moreover, if it is clear from the context, which stochastic process is considered, we will write
and
instead of
and
, respectively and it holds
For deterministic systems the entropy rate will converge in the limit
to the Kolmogorov-Sinai (KS-)entropy [
30,
31] which is a dynamical invariant of the system in the sense that it is independent on the specific state space reconstruction. Moreover, already for finite
,
will not depend on
ε for sufficiently small
ε because of Equation (
3).
To quantify the amount of predictability in a state sequence one might consider subtracting the unpredictable part from the total entropy of a state sequence. By doing this one ends up with a well known complexity measure for time series, the
excess entropy [
14,
15] or
effective measure complexity [
16]
with
The excess entropy provides a lower bound for the amount of information necessary for an optimal prediction. For deterministic systems, however, it will diverge because
will behave according to Equation (
3) and
will become
ϵ-independent for sufficiently large
m and small
ϵ, respectively
with
D being the attractor dimension.
The
predictive information [
13] is the mutual information between the semi-infinite past and the future time series
with
If the limits Equations (14) and (17) exist the predictive information
is equal to the excess entropy
. For the finite time variants in general
:
However, if the system is Markov of order p the conditional probabilities will only depend on the previous p time steps, , hence for and therefore .
2.2. Decomposing the Excess Entropy for Continuous States
In the literature [
13,
15,
16] both the excess entropy and the predictive information were studied only for a given partition—usually a generating partition. Thus, using the excess entropy as a complexity measure for continuous valued time series has to deal with the fact that its value will be different for different partitions—even for different generating ones.
In Equation (
16) we have seen that the excess entropy for deterministic systems becomes infinite in the limit
. The same applies to the predictive information defined in Equation (
18). Moreover, we have seen that the increase of these complexity measures with decreasing
ε is controlled by the attractor dimension of the system. Does this means that in the case of deterministic systems the excess entropy as a complexity measure reduces to the attractor dimension? Not completely. The constant in Equation (
16) reflects not only the scale of the signal, but also statistical dependencies or memory effects in the signal, in the sense that it will be larger if the conditional entropies converge slower towards the KS-entropy.
How can we separate the different contributions? We will start by rewriting Equation (
15) as a sum. Using the conditional entropies (Equation (
13)) we get
Using the differences between the conditional entropies
the excess entropy can be rewritten as
Note that the difference
is the conditional mutual information
It measures dependencies over
m time steps that are not captured by dependencies over
time steps. In other words, how much uncertainty about
can by reduced if in addition to the
step past also the
m th is taken into account. For a Markov process of order
m-the
vanish for
. In this case the sum in Equation (
20) contains only a finite number of terms. On the other side truncating the sum at finite
m could be interpreted as approximating the excess entropy by the excess entropy of an approximating Markov process. What can be said about the scale dependency of the
? From Equations (2) and (3) follows that
for
and
for
. Using this and Equation (
19) we have to distinguish four cases for deterministic systems. Note that
denotes the fractal part of the attractor dimension.
Thus the sum Equation (
20) can be decomposed into three parts:
with the
ε dependence showing up only in the middle term (MT):
with
denoting the length scale where the deterministic scaling range starts. Therefore we have decomposed the excess entropy in three terms: Two ideally
ε independent terms and one
ε-dependent term. The first term
will be called “state complexity” in the following because it is related to the information encoded in the state of the system. The constant
c was added here in order to ensure that the
ϵ-dependent term vanishes at
—the beginning of the deterministic scaling range. The second
ε-independent term
will be called “memory complexity” because it is related to the dependencies between the states on different time steps. What we call “state” in this context is related to the minimal embedding dimension to see the deterministic character of the dynamics which is
[
29]. In order to be able to get a one to one reconstruction of the attractor a higher embedding dimension might be necessary [
32]. Both
ε-independent terms together we will call “core complexity”
So far we only addressed the case of a deterministic scaling range. In the case of a noisy chaotic system we have to distinguish two
ε regions: the deterministic scaling range described above and the noisy scaling range with
with
determined by the noise level. In the stochastic scaling range all
become
ε-independent and the decomposition Equation (
22) seems to become unnecessary. This is not completely the case. Let us assume that the crossover between the two regions happens at a length scale
. Moreover, let us assume that for sufficiently large
m we have in the deterministic scaling range
(see
Section 2.5 for an example). Then we have
which allows to express the cross-over scale
in terms of the KS-entropy and the noise level related continuous entropy
Moreover, the excess entropy in the
deterministic scaling range will behave as
Evaluating this expression at the crossover length scale
allows to express the value of the excess entropy in the
stochastic scaling range as
In particular, this expression shows that decreasing the noise level, which increases
, will increase the asymptotic value of the excess entropy for noisy systems. Thus, an increased excess entropy or predictive information for a fixed length scale or partition can be achieved in many ways:
by increasing dimension D of the dynamics
by decreasing the noise level
by increasing the amplitude
by increasing the state complexity
by increasing the correlations measured by the “memory” complexity, i.e., by increasing the predictability
by decreasing the entropy rate , i.e., by decreasing the unpredictability
Naturally, the effect of the noise level will be observed in the stochastic scaling range only. In practice there might be more than one deterministic and stochastic scaling range or even no clear scaling range at all. How we will deal with these cases will be described below when we introduce our algorithm.
2.3. Methods for Estimating the Information Theoretic Measures
Reliably estimating entropies and mutual information is very difficult in high-dimensional spaces due to the increasing bias of entropy estimates. Therefore we will employ two different approaches. On the one hand we will use an algorithm for the estimation of the mutual information proposed by Kraskov
et al. [
26] based on nearest neighbor statistics which allows to reduce the bias by employing partitions of different sizes in spaces of different dimensions. On the other hand we calculate a proxy for the excess entropy using correlation entropies [
25] of order
. These are related to the Rényi entropies of second order and the correlation sum provides an unbiased estimator. Both methods do not require binning but differ substantially in what they compute.
Estimation via local densities from nearest neighbor statistics (KSG) The most common approach to estimate information quantities of continuous processes, such as the mutual information, is to calculate the differential entropies (1) directly from the nearest neighbor statistics. The key idea is to use nearest neighbor distances [
33,
34,
35] as proxies for the local probability density. This method corresponds in a way to an adaptive bin-size for each data point. For the mutual information
(required e.g., to calculate the PI (Equation (
18))), however, it is not recommended to naively calculate it directly from the individual entropies of
X,
Y and their joint
because they may have very dissimilar scale such that the adaptive binning leads to spurious results. For that reason a new methods was proposed by Kraskov
et al. [
26], that we call
KSG, which only uses the nearest neighbor statistics in the joint space. We denote
the mutual information estimate where
k nearest neighbors where used for the local estimation.
The length scale on which the mutual information is estimated by this algorithm depends on the available data. In the limit of infinite amount of data
for
. However, in order to evaluate the quantity at a certain length scale (similar to
ε above) and assuming the same underlying space for
X and
Y, noise of strength
η is added to the data resulting in
where
is the uniform distribution in the interval
. The idea of adding noise is to make the processes X and Y independent within neighborhood sizes below the length scale of the noise. In this way only the structures above the added noise-level contribute to the mutual information. Note that for small
η the actual scale (
k-neighborhood size) may be larger due to sparsity of the available data.
Estimation via correlation sum The correlation sum is one of the standard tools in nonlinear time series analysis [
36,
37]. Normally it is used to estimate the attractor dimension. However, it can also be used to provide approximate estimates of entropies and derived quantities such as the excess entropy. The correlation entropies for a random variable
with measure
are defined as [
25]
where
is the “ball” at
with radius
ε. For
Equation (
32) becomes
. The integral in this formula is also known as “correlation integral”. For
N data points
it can be estimated using the correlation sum, which is the averaged relative number of pairs in an
ε-neighborhood [
36,
37],
Θ denotes the Heaviside function
Then the correlation entropy is
For sufficiently small
ε it behaves as
with
being the correlation dimension of the system [
38]. A scale dependent correlation dimension can be defined as difference quotient
For a temporal sequence of states (or state vectors)
we can now define block entropies
by using
-dimensional delay vectors
. Now, we can define also the quantities corresponding to conditional entropies and to the excess entropy using the correlation entropy
We expect the same qualitative behavior regarding the ε-dependence of these quantities as for those based on Shannon entropies, see Equations (11) and (15). Quantitatively there might be differences, in particular for large ε and strongly non-uniform measures.
A comparison of the two methods with analytical results are given in the
Appendix A, where we find a good agreement. Although, the KSG method seems to underestimate the mutual information for larger embeddings (higher dimensional state space). The correlation integral method uses a unit ball of diameter
whereas the KSG method measures the size of the hypercube enclosing
k neighbors where the data was subject to additive noise in the interval
. Thus comparable results are obtained with
.
2.4. Algorithm for Excess Entropy Decomposition
We are now describing the algorithm used to compute the proposed decomposition of the excess entropy in
Section 2.2. The algorithm is composed of several steps: preprocessing, determining the middle term (MT) (Equation (
23)), determining the constant in MT, and the calculation of the decomposition and of quality measures.
Preprocessing: Ideally the
curves are composed of straight lines in a log-linear representation,
i.e., of the form
. We will refer to
s as the slope (it is actually the inverted slope). Thus we perform fitting, that attempts to find segments following this form, details are provided in the
Appendix B. Then the data is substituted by the fits in the intervals where the fits are valid. As for very small scales the
become very noisy we extrapolate below the fit with the smallest scale. In addition we calculate the derivative
in each point, either from the fits (
s, where available) or from finite differences of the data (using 5 points averaging).
Determining MT: In theory only two
should have a non-zero slope at each scale
ε, see Equation (
21). However, in practice we often have more terms, such that we need to find for each
ε the maximal range
, where
,
i.e., the slope is larger than the threshold
. However, this is only valid for deterministic scaling ranges. In stochastic ranges all
should have zero slope. We introduce a measure of stochasticity, defined as
which is 0 for purely deterministic ranges and 1 for stochastic ones. The separation between state and memory complexity is then inherited from the next larger deterministic range. Thus if
we use
at
, where
. Note that the here algorithmically defined
is not necessarily equal to the
defined above Equation (
28) for an ideal-typical noisy deterministic system.
Determining the constant in MT: In order to obtain the scale-invariant constant
c of the MT, see Equation (
23), we would have to define a certain length scale
. Since this cannot be done robustly in practice (in particular because it may not be the same
for each
m) we resort to a different approach. The constant parts of the
terms in the MC can be determined from plateaus on larger scales. Thus, we define
, where
is smallest scale
where we have a near-zero slope,
i.e.,
. In case there is no such
then
.
Decomposition and quality measures: The decomposition of the excess entropy follows Equations (24) and (25) with
and
used for splitting the terms:
In addition we can compute different quality measures to indicate the reliability of the results, see
Table 1.
Table 1.
Quality measures for decomposition algorithm. We use the Iverson bracket for Boolean expression: and . They are all normalized to where typically 0 is the best score and 1 is the worst.
Table 1.
Quality measures for decomposition algorithm. We use the Iverson bracket for Boolean expression: and . They are all normalized to where typically 0 is the best score and 1 is the worst.
Quantity | Definition | Description |
---|
κ=stochastic(ε) | | 0: fully deterministic, 1: fully stochastic at ε |
% negative(ε) | | percentage of negative |
% no fits(ε) | | percentage of where no fit is available |
% extrap.(ε) | | percentage of that where extrapolated |
2.5. Illustrative Example
To get an understanding of the quantities, let us first apply the methods to the Lorenz attractor, as a deterministic chaotic system, and to its noisy version as a stochastic system.
Deterministic system: Lorenz attractor The Lorenz attractor is obtained as the solution to the following differential equations:
integrated with standard parameters
,
,
.
Figure 1 displays the trajectory in phase-space (
) and using delay embedding of
using
and time delay is
(for sampling time
). The equations where integrated with Runge-Kutta fourth order with step size
. An alternative time delay may be obtained from first minimum of mutual information as a standard choice [
39,
40], which would be
, but
yields clearer results for presentation.
Figure 1.
Lorenz attractor with standard parameters. (a) Trajectory in state space . (b) Embedding of x with , time-delay , and .
Figure 1.
Lorenz attractor with standard parameters. (a) Trajectory in state space . (b) Embedding of x with , time-delay , and .
The result of the correlation sum method using TISEAN [
41] is depicted in
Figure 2 for
data points. From the literature, we know that the chaotic attractor has a fractal dimension of
[
38] which we can verify with the correlation dimension
, see
Figure 2a for
and small
ε. The conditional
(d) becomes constant and identical for larger
m. The excess entropy
(g) approaches the predicted scaling behavior of:
(Equation (
16)) with
and
. The predictive information
. (j) shows the same scaling behavior on the coarse range but with a smaller slope and saturates at small
ε.
Stochastic system: noisy Lorenz attractor In order to illustrate the fundamental differences between deterministic and stochastic systems when analyzed with information theoretic quantities we consider now the Lorenz dynamical system with dynamic noise (additive noise to the state (
) in Equations (38)–(40) before each integration step) as provided by the TISEAN package [
41]. The dimension of a stochastic systems is infinite,
i.e., for embedding
m the correlation integral yields the full embedding dimension as shown in
Figure 2b,c for small
ε.
On large scales the dimensionality of the underlying deterministic system can be revealed to some precision either from
or from the slope of
or PI. Thus, we can identify a determinstic and a stochastic scaling range with different qualitative scaling behavior of the quantities. In contrast with the deterministic case, the excess entropy converges in the stochastic scaling range and no more contribution from
are observed as shown in
Figure 2h,i. By measuring the
ε-dependence (slope) we can actually differentiate between deterministic and stochastic systems and scaling ranges, which is performed by the algorithm
Section 2.4 and presented in
Figure 3. The predictive information
Figure 2k,l again yields a lower slope, meaning a lower dimension estimate, but is otherwise consistent. However, it does not allow to safely distinguish determinstic and stochastic systems, because it always saturates due to the effect of finite amount of data.
Figure 2.
Correlation dimension
(Equation (
34)) (
a)–(
c), conditional block entropies
(Equation (
35)) (
d)–(
f), excess entropy
(Equation (36) (
g)–(
i), and predictive information
(Equation (
18)) (
j)–(
l) of the Lorenz attractor estimated with the correlation sum method (
a)–(
i) and KSG (
j)–(
l) for the deterministic system (first column) and with absolute dynamic noise
, and
(second and third column respectively). The error bars in (
j)–(
l) show the values calculated on half of the data. All quantities are given in nats (natural unit of information with base
e) and in dependence of the scale (
ε,
η in space of
x (Equation (
38))) for a range of embeddings
m, see color code. In (
d)–(
f) the fits for
allow to determine
and for
give
and
, Equations (27)–(30). Parameters: delay embedding of
x with
.
Figure 2.
Correlation dimension
(Equation (
34)) (
a)–(
c), conditional block entropies
(Equation (
35)) (
d)–(
f), excess entropy
(Equation (36) (
g)–(
i), and predictive information
(Equation (
18)) (
j)–(
l) of the Lorenz attractor estimated with the correlation sum method (
a)–(
i) and KSG (
j)–(
l) for the deterministic system (first column) and with absolute dynamic noise
, and
(second and third column respectively). The error bars in (
j)–(
l) show the values calculated on half of the data. All quantities are given in nats (natural unit of information with base
e) and in dependence of the scale (
ε,
η in space of
x (Equation (
38))) for a range of embeddings
m, see color code. In (
d)–(
f) the fits for
allow to determine
and for
give
and
, Equations (27)–(30). Parameters: delay embedding of
x with
.
Figure 3.
Excess Entropy decomposition for the Lorenz attractor into state complexity (blue shading), memory complexity (red shading), and
ε-dependent part
(beige shading) (all in nats). Columns as in
Figure 2: deterministic system (a), with absolute dynamic noise
(b), and
(c). A set of quality measures and additional information is displayed on the top using the right axis.
and
refer to the terms in
(Equation (
37)). Scaling ranges that are identified as stochastic are shaded in gray (stochastic indicator
). In manually chosen ranges, marked with vertical black lines, we evaluate the mean and standard deviation of the memory- and state complexity. Parameters:
.
Figure 3.
Excess Entropy decomposition for the Lorenz attractor into state complexity (blue shading), memory complexity (red shading), and
ε-dependent part
(beige shading) (all in nats). Columns as in
Figure 2: deterministic system (a), with absolute dynamic noise
(b), and
(c). A set of quality measures and additional information is displayed on the top using the right axis.
and
refer to the terms in
(Equation (
37)). Scaling ranges that are identified as stochastic are shaded in gray (stochastic indicator
). In manually chosen ranges, marked with vertical black lines, we evaluate the mean and standard deviation of the memory- and state complexity. Parameters:
.
Let us now consider the decomposition of the excess entropy as discussed in
Section 2.2 and
Section 2.4 to see whether the method is consistent.
Figure 3 shows the scale-dependent decomposition together with the determined stochastic and deterministic scaling ranges. The resulting values for comparing the determinstic scaling range are given in
Table 2 and reveal, that the
decreases for increasing noise level similarly as the constant reduces. This is a consistency check because the intrinsic scale (amplitudes) of the deterministic and noisy Lorenz system are almost identical. We also checked the decomposition for a scaled version of the data where we found the same values state-, memory- and core complexity. We see also that the state complexity
vanishes for larger noise, because the dimension drops significantly below 2, where we have no summands in the state complexity and the constant
is also 0.
Table 2.
Decomposition of the excess entropy for the Lorenz system. Presented are the unrefined constant and dimension of the excess entropy, the state-, memory-, and core complexity (in nats) determined in the deterministic scaling range, see
Figure 3.
Table 2.
Decomposition of the excess entropy for the Lorenz system. Presented are the unrefined constant and dimension of the excess entropy, the state-, memory-, and core complexity (in nats) determined in the deterministic scaling range, see Figure 3.
| | Determ. | Noise | Noise | Noise |
---|
Equation (16) | D | 2.11 | 1.95 | 1.68 | 1.59 |
Equation (16) | | 4.62 | 4.55 | 4.37 | 4.21 |
Equation (27) | | 0 | 0.12 | 0.23 | 0.45 |
Equation (24) | | 0.68 ± 0 | 0.68 ± 0 | 0 | 0 |
Equation (25) | | 1.86 ± 0.06 | 1.27 ± 0.02 | 1.98 ± 0.11 | 1.2 ± 0.1 |
Equation (26) | | 2.54 ± 0.06 | 1.95 ± 0.02 | 1.98 ± 0.11 | 1.2 ± 0.1 |
To summarize, in deterministic systems with a stochastic component, we can identify different ε-ranges (scaling ranges) within which stereotypical behaviors of the entropy quantities are found. This allows us to determine the dimensionality of the deterministic attractor and decompose the excess entropy in order to obtain two ε-independent complexity measures.