Selection of Embedding Dimension and Delay Time in Phase Space Reconstruction via Symbolic Dynamics

The modeling and prediction of chaotic time series require proper reconstruction of the state space from the available data in order to successfully estimate invariant properties of the embedded attractor. Thus, one must choose appropriate time delay τ∗ and embedding dimension p for phase space reconstruction. The value of τ∗ can be estimated from the Mutual Information, but this method is rather cumbersome computationally. Additionally, some researchers have recommended that τ∗ should be chosen to be dependent on the embedding dimension p by means of an appropriate value for the time delay τw=(p−1)τ∗, which is the optimal time delay for independence of the time series. The C-C method, based on Correlation Integral, is a method simpler than Mutual Information and has been proposed to select optimally τw and τ∗. In this paper, we suggest a simple method for estimating τ∗ and τw based on symbolic analysis and symbolic entropy. As in the C-C method, τ∗ is estimated as the first local optimal time delay and τw as the time delay for independence of the time series. The method is applied to several chaotic time series that are the base of comparison for several techniques. The numerical simulations for these systems verify that the proposed symbolic-based method is useful for practitioners and, according to the studied models, has a better performance than the C-C method for the choice of the time delay and embedding dimension. In addition, the method is applied to EEG data in order to study and compare some dynamic characteristics of brain activity under epileptic episodes


Introduction
For the theory of state space reconstruction suggested by Packard, Takens et al. [1,2] is the base for data-driven analysis and prediction of chaotic systems. It can be proved through Taken's theorem [2] that the strange attractor of the chaotic systems could be properly recovered from only one projection of the dynamic system. The fundamental theorem of reconstruction of Takens establishes a sufficient condition (but not necessary) given by p ≥ 2d + 1, where d is the fractal dimension of the underlying chaotic attractor, and p stands for the embedding dimension used for phase space reconstruction. Nevertheless, no condition is given regarding the time delay.
A popular method for state space reconstruction is the method of delays. It consists of embedding the observed scalar time series {X t } t∈I in one p-dimensional space X τ p (t) = (X t , X t+τ , . . . , X t+(p−1)τ ) for t ∈ I, where τ is the time delay for the reconstruction, p is the embedding dimension, and I is a set of time indexes of cardinality T. Notice that the number of points inserted in the p-dimensional space is M = T − (p − 1)τ and all dynamic properties, such as dependencies, periodicity, and complexity changes, can be extracted from it. That is, there is a differentiable homomorphism from the orbits of the chaotic attractor in the reconstructed space R p to the original system.
The selection of the parameters p and τ * is a challenge. An improper choice can result in a spurious indication of a nonlinear complex structure when the system is linear. Albeit specialized literature provides different methods to select the parameters for state space reconstruction, none of them that are superior to the remaining ones in all aspects. In general, the optimal strategy for parameter selection will depend on the time series and a complexity measure (e.g., Lyapunov Exponents or Correlation Dimension). There are two different approaches to the selection of the parameters p and τ * . The first approach considers that p and τ * are selected independently from each other. For example, the G-P algorithm for the selection of p proposed by Albano et al. [3] and different proposals for the selection of the time delay τ * based on Mutual Information [4], autocorrelation and highorder correlations [5], filling factor [6], wavering product [7], average displacement AD [8], and multiple autocorrelation [9]. The second approach considers that the parameters p and τ * are closely related when the time series under consideration is noisy and has finite length. A great number of experiments indicate that p and τ * are related with the time delay for independence of the time series through τ w = (p − 1)τ * . Therefore, a bad selection of the parameters will directly impact the equivalence between the original system and the reconstructed phase space. Thus, some authors are in favor of jointly selecting p and τ * as, for example, the small-window solution [10], C-C method [11], and automated embedding [12].
Many researchers consider that the second approach (joint selection) is more reasonable than the first one (independent selection) in the engineering practice. They consider that the estimation of mutual information is rather cumbersome computationally, whereas the autocorrelation function only accounts for the linear dependence and therefore does not properly treat the presence of nonlinearities. On the other hand, the C-C method suggested by Kim et al. [11] is the most popular, which provides the delay time τ * and embedding dimension p simultaneously by using correlation integral, and it has the advantage of low complexity and robustness in finite samples [13].
In the present paper, we propose a new method for selection p and τ * based on symbolic dynamics and Information Theory. Symbolic Dynamics studies dynamical systems on the basis of the symbol sequences obtained for a suitable partition of the state space. The basic idea behind symbolic dynamics is to divide the phase space into finite number of regions and label each region by an alphabetical letter. In this regard, symbolic dynamics is a coarse-grained description of dynamics. Even though coarse-grained methods lose a certain amount of detailed information, some essential features of the dynamics may be kept, e.g., periodicities and dependencies, among others. Symbolic dynamics has been used for investigation of nonlinear dynamical systems (central references for the interested reader are [14][15][16][17][18]; for an overview, see Hao and Zheng [19]). In general terms, there is a broad agreement in that symbolization can increase the efficiency of finding and even quantifying information for characterizing and recognizing temporal patterns (see [20] for a review on experimental data). The process of symbolizing a time series is based upon the method of delay time coordinates, introduced by Takens, in order to carry out the phase space reconstruction.
Since the methods of state space reconstruction are based to some extent on detection of delays for which there is some sort of dependence (linear or nonlinear), and Symbolic Dynamics has been used as a statistical tool to detect the presence of dependence in time series [21]; symbolic dynamics is a suitable tool to select the optimal state space reconstruction parameters of chaotic time series.
Thus, we will select p and τ * by translating the problem into symbolic dynamics and then we use a entropy measure associated with the symbols space (symbolic entropy) as a tool for the parameter selection. On the one hand, we have compared the performance of the proposed method with other available methods. Results seem to be in favor of this proposal. On the other hand, and from an empirical point of view, we have applied it to EEG data, which allows for understanding some dynamic characteristics of brain activity under epileptic episodes.
The rest of the paper is structured as follows: in Section 2, we introduce the basic concepts of symbolic analysis, and we also provide a symbolization procedure that works for estimation of the parameters for Phase Space reconstruction. In Section 3, we show the performance of the symbolic method to estimate phase space reconstruction parameters, and we compare it with the well known Mutual Information based methods and C-C method. In Section 4, the new techniques presented in this paper are applied to a real EEG database obtained from the University of Bonn and studied well for understanding epileptic phenomena. Finally, Section 5 presents conclusions.

Definitions and Symbolization Procedure
In this section, we will introduce some definitions and basic notations referred to symbolic dynamics.
Let {X t } t∈I be a real valued time series. We will use symbolic analysis to study the state space reconstruction parameters associated with it. Symbolic analysis in our context is a coarse-grained approach to the study of time series, consisting of embedding the time series in a p-dimensional space, then constructing a partition of this p-dimensional space and labeling each set of the partition with a symbol. Therefore, all the p-dimensional vectors belonging to the same set of the partition are labelled with the same symbol. Afterwards, with information theory based measures, we will study the distribution of the symbols that will help us in the estimation of the parameters for state space reconstruction.
More concretely, in mathematical terms, given a positive integer p ≥ 2, and a time delay τ, we consider that the time series is embedded in an p-dimensional space as follows: The parameter p is usually known as embedding dimension and X τ p (t) p, τ-history. Next, given a positive real number and in order to provide a partition of R p , we define for any element v = (v 1 , v 2 , . . . , v p ) ∈ R p the following indicator function: That is, δ ij (v) = 1 if and only if its entries v i and v j satisfy that |v i | and |v j | are both either smaller or greater than . Let the set Γ p be the set of cardinality 2 p−1 formed by the vectors of length p − 1 with entries in the set {0, 1}. Then, we can define a map f : .
The map f defines an equivalence relation in R p such that v 1 v 2 if and only if f (v 1 ) = f (v 2 ). Therefore, this equivalence relation provides a partition of R p in 2 p−1 disjoint sets. Each of these sets is labeled with an element of Γ p . The elements in Γ p are called symbols and f symbolization map. In general, if π ∈ Γ p is a symbol and v ∈ R p is such that f (v) = π, then we will say that v is of π-type. Next, we are interested in the application of the symbolization map f to the p, τ- is a vector (symbol) whose i-th entry provides information on whether |X t | and |X t+i | are both either smaller or greater than . Then, we want to extract information on the dynamics of the time series {X t } t∈I by using information theory based measures on its associated symbols distribution . More concretely, we can estimate the probability of a symbol π ∈ Γ p as Now, under this setting, given a time delay τ and embedding dimension p ≥ 2, we can define the symbolic entropy of a time process {X t } t∈I . This entropy is defined as the Shannon's entropy [22] of the 2 p−1 distinct symbols as follows: Symbolic entropy h(p, τ) is the information contained in comparing p, τ-histories generated by the time process. Notice that 0 ≤ h(p, τ) ≤ ln(n), where the lower bound is attained when only one symbol occurs, and the upper bound for a completely random system (i.i.d temporal sequence) where all possible symbols appear with the same probability. Then, if τ = τ * is an optimal time delay, for a positive integer k, the dependence between X t and X t+kτ * vanishes, and hence the symbolic entropy associated with the time series {X t } should be maximum. Therefore, in order to select the optimal time delay τ * , we select the first τ satisfying With respect to the optimal embedding window τ w = (p − 1)τ * , this can be associated with the mean orbital period P w of low-dimensional chaotic systems that shows pseudoperiodicity. That is, P w can be considered the time dependence of the chaotic time series. Although the chaotic systems oscillate without periodicity, low dimensional chaotic systems show pseudo-periodicity. The mean orbital period could naturally be associated with the mean time between two consecutive visits to a Poincare section [23]. For the time series with mean orbital period P w , all points at a time multiple of P w are in the same Poincare section in phase space. Therefore, a local minimum of symbolic entropy h(p, τ) is reached for τ = kP w and thus To finish this section, we are going to illustrate the symbolization procedure with an easy example. Let {X t } t∈I be the following finite time series: and assume that = 3, τ = 1 and p = 3. Then, the symbols' set remains as Under this setting, we can construct the following five p, τ-histories: 9); and X 1 3 (5) = (−1, 9,14). Then, the symbolization map f associate each p, τ-history to a symbol. Concretely, f 3 (X 1 3 (1) = (1, −7, −12)) = (0, 0) because the first entry of the m-history, 1 is in absolute value smaller than = 3 while the second and the third are both greater than = 3, and hence the agreement indicator that defines the symbolization map takes the value 0. Similarly, we find that is of (0, 1)−type, and X 1 3 (5) is of (0, 0)−type. Thus, we can estimate the symbols distribution by its relative frequency p((0, 0)) = 2 5 , p((0, 1)) = p((1, 0)) = p((1, 1)) = 1 5 and the entropy associated with them

Selection of p and for Finite Sample Sizes
When determining the parameters of phase space reconstruction of a finite chaotic time series by using the symbolic entropy, one needs to select in advance the values of p and . In addition, sample size T also plays an important role. In [21], some general criteria are recommended to select the embedding dimension p and sample size T in order to compute the symbolic entropy. First, the sample size T should be as larger than the number of symbols 2 p−1 of the symbolization map f . Second, from a statistical point of view, data sets must contain at least five times the number of possible events or symbols. Thus, the embedding dimension will be the largest positive integer p that satisfies 5 · 2 p−1 ≤ T.
To select , we propose to use a data driven method which is based on symbolic entropy. Particularly, we partially rely on the methodology described in [24] based on the construction of peak detection functions (FPs). The selected will be the largest that locally maximizes the absolute value of a pick function FP(i, x i ), where FP associates values to the symbolic entropy of a time series. More concretely, define and allows for selecting the time delay τ for which value h(p, τ) is maximum (respectively minimum) in the neighborhood of (τ − k, τ + k). As stated in [24], values of k in the range [3,5] are usually suitable. Notice that, by construction, 0 < < max{X t }. Then, the selected parameter, namely * , will be the one in the interval (0, max{X t }) that satisfies

Simulation Analysis
The following examples illustrate the performance of the proposed symbolic method when estimating the parameters time delay τ * and embedding dimension p for phase space reconstruction of a chaotic time series. The aim of this set of simulations is, firstly, to empirically evaluate the performance of the new symbolic procedure to select the "correct" parameters. Secondly, we aim to compare with the symbolic method with other competitive available methods that have been commented in the introductory section and that are fully documented in the bibliographical references of this paper.
To this end, we extract univariate time series {X t } of length T = 3000 from five chaotic systems that have been extensively studied. In all cases, we set the embedded time series in a six-dimensional space that is p = 6. To evaluate the performance of the novel symbolic method, we compare the performances of the new method with other available selection methods: the C-C method (C-C), the Nearest Neighbor method, and the method based on the first minimum of the autocorrelation function (FAC) selection parameters of these last two methods are based on the Mutual Information (MI) criteria.
Scientific literature has shown that the C-C method has a good performance when used for selecting time delays and embedding dimensions. Thus, the C-C method can be thought of as a natural competitor and therefore it is worth comparing the performance against it. For this reason, we will compare results for several well-known dynamic systems. In order to compare and evaluate the performance of each method, we will use the selected parameters of each method for reconstructing the attractor and estimating two complexity measures of each system. These two measures are theoretically known for each of the three systems, and therefore they are used as a base of comparison. A final user will prefer using reconstruction parameters that lead to estimations that are as close as possible to the theoretical ones.
Accordingly, we will use the following systems to conduct the comparisons: • Lorenz system [25]: . z = xy − bz The time series was obtained by projecting the x-coordinate of the system defined by the parameters a = 16, c = 45.92, b = 4 , with an integral step 0.01, and initial conditions x 0 = −1, y 0 = 0 y z 0 = 1. The computed optimal ratio is * = 1.2σ x , where σ x is the standard deviation of the chaotic time series under consideration. For this optimal radio, Figure 1 illustrates the series of the normalized symbolic entropy h(6, τ)/6 as a function of the time delay τ for the Lorenz system. Clearly, we observe that the first local maximum is attained at τ * = 12 and the minimum at τ w = 46. Then, an estimated value of embedding dimension p can be computed by solving τ w = (p − 1)τ * obtaining an approximate value of p = 5. For the Mutual Information method, the optimal time delay was τ * = 11, while, for the C-C method, the estimated parameters were τ * = 10 τ w = 100 and p = 11. Notice that the optimal time delay τ * estimated by the three methods are quite close to each other while the estimated time delay window τ w strongly differs between C-C and symbolic methods: • Rossler system [26]: .
The time series was obtained by projecting the x-coordinate of the system defined by the parameters d = 0.15, e = 0.2 and f = 10, with an integral step 0.05, and initial conditions x 0 = −1, y 0 = 0 and z 0 = 1. The computed optimal radio is * = 0.4σ x , where σ x is the standard deviation of the chaotic time series under consideration. Figure 2 shows the normalized symbolic entropy h(6, τ)/6 as a function of the time delay τ for the Rossler system. It can be seen that the selected parameters by the symbolic method are τ * = 18 and τ w = 121, and consequently the estimated value for p is 8. The estimated time delay for Mutual Information method is τ * = 20. For the C-C method, the estimated parameters are τ * = 17 and τ w = 191. Again, the optimal time delay τ * estimated by the three methods is very similar while the time delay window τ w estimated by the C-C method is much different than the one estimated with symbolic entropy. • Duffing System [27]: .
The time series was obtained by projecting the x-coordinate of the system defined by the parameters g = 0.05, k = 0.25, l = 7.5 and v = 1, with an integral step 0.05, and initial conditions x 0 = −1, y 0 = 0 y z 0 = 1. The computed optimal radio is * = 0.275σ x , where σ x is the standard deviation of the chaotic time series under consideration. Figure 3 shows the normalized symbolic entropy h(6, τ)/6 as a function of the time delay τ for the Duffing system. The estimated optimal time delay and time delay window with the symbolic method are τ * = 14 and τ w = 126, respectively. Then, the estimated embedding dimension is p = 10. As in the previous examples, the estimated time delay for the Mutual Information method (τ * = 12) and for the C-C method (τ * = 12) are fairly close to the one estimated by symbolic method. Again, the time delay window estimated based on the C-C method τ w = 161 is far from the one estimated with symbolic method. These first three models are well-known and well-studied and have served as a base of comparison of new techniques for similar aims as this paper. In order to complete this analysis, we have also considered the next two models that we refer to as the Mackey-Glass model and Chen model: • Mackey-Glass system [28]: The time series was obtained by fixing parameters a = 0;2, b = 0;1, c = 10, y = 17, with initial conditions x(t < 0) = 0 y x(t = 0) = 1;2. The first 2000 iterations were discarded. The computed optimal radio is * = 0.79σ x , where σ x is the standard deviation of the chaotic time series under consideration. The estimated optimal time delay and time delay window with the symbolic method are τ * = 13 and τ w = 49, respectively. Figure 4 shows the normalized symbolic entropy h(6, τ)/6 as a function of the time delay τ for the Mackey-Glass system Then, the estimated embedding dimension is p = 5. In this case, the estimated time delay for the Mutual Information method (τ * = 12) and for the C-C method (τ * = 14) are fairly close to the one estimated by symbolic method (τ = 13). Again, the time delay window estimated based on the C-C method τ w = 166 is far from the one estimated with symbolic method(τ w = 49): The time series was obtained by projecting the x-coordinate of the system defined by the parameters a = 35, b = 3, c = 28, with an integral step 0.01, and initial conditions x 0 = −1, y 0 = 0 y z 0 = 1. The first 2000 iterations were discarded. The computed optimal radio is * = 0.89σ x , where σ x is the standard deviation of the chaotic time series under consideration. The estimated optimal time delay and time delay window with the symbolic method are τ * = 11 and τ w = 60, respectively. Figure 5 shows the normalized symbolic entropy h(6, τ)/6 as a function of the time delay τ for the Chen system. Then, the estimated embedding dimension is p = 6. As in the previous examples, the estimated time delay for the Mutual Information method (τ * = 10) and for the C-C method (τ * = 9) are fairly close to the one estimated by symbolic method. Again, the time delay window estimated based on the C-C method τ w = 104 is far from the one estimated with symbolic method (τ w = 60).    Table 1 summarizes for each method the estimated parameters for phase space reconstruction of the five systems. Bold is reserved for the results obtained with the new selecting method. In order to check whether the symbolic method is reliable when estimating the parameters for phase space reconstruction, τ * , τ w , and p, we will compute, based on this estimation, two complexity measures for each one of the systems that needs these parameters for its computation. These complexity measures are the largest Lyapunov exponent LLE [30], which is a measure of the complexity of the time process, and the Correlation Dimension D [31], which is a measure of the dimension of the space occupied by the chaotic attractor. For the computation of these two geometric invariants, the time delay τ * and embedding dimension p are essential parameters, and a bad selection of them would produce a big bias in LLE and D. The largest Lyapunov exponent LLE for the five systems have been computed in [27,[32][33][34][35] and the Correlation Dimension D in [32,33,36,37]. Furthermore, we have completed the study by increasing the sample size to 10,000 observations. Tables 2  and 3 show the values of LLE and D based on the values of the estimated parameters τ * , τ w and p with symbolic and C − C methods together with the reference values, respectively. Table 2. Largest Lyapunov exponent LLE based on the estimation of the phase space parameters τ * , τ w and p if C-C method and symbolic method, together with the reference true value. Values in parentheses report estimated LLE for series of 10,000 observations.  We can observe the estimated values of the largest Lyapunov exponent and Correlation dimension based on the Symbolic method in Tables 2 and 3, respectively. Importantly, these symbolic-based estimations are very close to their reference (theoretical) values, regardless of the sample size, which suggests the good behavior of the new method for reconstruction of the dynamics of the system. On the other hand, we were wondering if the symbolic method is competitive with its main competitor, namely, the C-C method. In this regard, we can devise that the estimated values for the Lyapunov Exponents are clearly in favor of the Symbolic method as the estimation is closer to the theoretical reference value than in the case of the C-C method. This is true for the five systems. Similar conclusions can be obtained from the results regarding correlation dimension: the symbolic-based estimated dimensions are closer to the true value than C-C estimation, regardless the studied system. On the other hand, methods based on nearest neighbors and autocorrelation function are reported. Results show the symbolic based method also has better empirical behavior. All of these results could be explained by a wrong selection of delay time window τ w by the C-C method as stated in [23,[38][39][40].

EEG Dynamics under Epileptic Activity
The Electroencephalogram (EEG) is a spontaneous bioelectricity activity that is produced by the central nervous system. Therefore, EEG can be understood as a representative signal containing information about the activity of the brain. Currently, EEG is widely used in clinic and neuralelectricity physiological research. The shape of the waves may contain useful information about the state of the brain. EEG does include abundant information about the state and change of the neural system.
The dynamics of brain activity is considered to be of a nonlinear nature. Accordingly, EEG signals are studied by means of nonlinear dynamic tools. Indeed, a large body of studies have reported that the EEG was derived from chaotic systems [41][42][43][44].
In this section of the paper, we apply the symbolic-based approach for reconstruction of dynamics generated by empirical EEG recording from a public dataset by the University of Bonn [41]. Epilepsy is characterized by recurring seizures in which abnormal electrical activity in the brain causes the loss of consciousness or a whole body convulsion. From this point of view, our results will contribute to the empirical analysis of role on nonlinear dynamics in epileptology. The Bonn University EEG database is comprised of five types of EEG signals (EEG recordings from healthy volunteer with eyes open and closed, epilepsy patients in the epileptogenic zone during a seizure-free interval and in an opposite brain zone, and epilepsy patients during epileptic seizures) were studied.
To conduct this empirical analysis, we firstly use Theiler's method of surrogate data to distinguish between linearity and nonlinearity. To do so, the null hypothesis of linearity is tested against nonlinearity [45]. Chaos cannot come from a linear signal. Secondly, we test for chaoticity against pure stochasticity. Linear signals are expected to be of stochastic nature while nonlinear signals can come from either a stochastic process or a pure chaotic one. The statistical test for chaos [46] tests the null hypothesis of chaos versus the alternative of stochastic process. We also estimate correlation dimension using Theiler's approach in order to exclude time correlated states in the correlation integral estimation [47].  Table 4 collects the outcomes of all procedures. Results firstly indicate that the brain's activity is of a nonlinear nature for a healthy person with open eyes and for records of epileptic person regardless if s/he is under seizure activity or not, whenever measurement is done in the epileptogenic zone. The test for chaos applied to nonlinear signals helps to conclude that only the nonlinear dynamics found for epileptic patients are statistically compatible with chaotic dynamics, while the dynamics are nonlinear stochastic for a healthy person with open eyes. Finally, the estimated correlation dimensions show how (correlation) dimension is reduced as the process moves from stochastic to chaotic, as expected.
These results support the nonlinear deterministic structure of brain dynamics related to epileptic activity as earlier reported in [48,49]. Our estimates of correlation dimensions are in line with other previous studies [50] on the same dataset, although with different parameter configurations. Thus, the conclusion in this regard is that epileptic seizures are emergent states with reduced dimensionality compared to non-epileptic activity. This is in line with the clinical common knowledge that establishes that healthy systems evolve with time and their adaptive capability is higher, resulting in higher complexity. On the other hand, the alternations in the structural components and/or decreased functional capability of the subsystem cause dysfunction in the regularity mechanism of the overall system, which results in the loss of complexity, as indicated in [51,52].

Conclusions
In this paper, we have introduced a new method based in Symbolic Dynamics, for the estimation of the phase space reconstruction parameters τ * , τ w and p. In the simulation analysis, we applied the Symbolic method to choose the phase space reconstruction parameters from the time series generated from several dynamical models that have been well-studied and used for evaluating the ability of different reconstruction methodologies. The values found for τ * agree well with those found for the mutual information and the C-C method. The values found for τ w do not agree with the values estimated by the C-C method. For this reason, in order to clarify which method for selecting phase space reconstruction parameters is more reliable, we use them in the computation of two complexity measures, namely largest Lyapunov exponent (LLE) and Correlation dimension (D). Results indicate that the parameters estimated by the Symbolic method produces a closer approach to reference (theoretical) values of LLE and D than the C-C method. Finally, the proposed method is used to study the dynamics of brain activity under epilepsy by means of real EEG signals. The empirical results hint that epileptic patients show chaotic dynamics in EEG signals. Furthermore, our results are statistically significant and therefore hint the potential of symbolic based tools in distinguishing healthy and epileptic subjects.