Permutation Entropy: New Ideas and Challenges

Over recent years, some new variants of Permutation entropy have been introduced and applied to EEG analysis, including a conditional variant and variants using some additional metric information or being based on entropies that are different from the Shannon entropy. In some situations, it is not completely clear what kind of information the new measures and their algorithmic implementations provide. We discuss the new developments and illustrate them for EEG data.


Introduction
The concept of Permutation entropy introduced by Bandt and Pompe [1] in 2002 has been applied to data analysis in various disciplines (compare e.g., the collection [2], and papers [3,4]).The Permutation entropy of a time series is the Shannon entropy of the distribution of ordinal patterns in the time series (see also [5]).Such ordinal patterns, describing order types of vectors, are coded by permutations.Denoting the set of permutations of {0, 1, . . ., d} for d in the natural numbers N by Π d , we say that a vector (v 0 , v 1 , . . ., v d ) ∈ R d+1 has ordinal pattern π = (r 0 , r 1 , . . ., r d ) Definition 1.The empirical Permutation entropy (ePE) of order d ∈ N and of delay τ ∈ N of a time series (x t ) N−1 t = 0 with N ∈ N is given by where p τ π = #{t ∈ {dτ, dτ + 1, . . ., N − 1} | (x t , x t−τ , . . ., x t−dτ ) has ordinal pattern π} is the relative frequency of ordinal patterns π in the time series and 0 ln 0 is defined by 0 (compare [6]).
The vectors (x t , x t−τ , . . ., x t−dτ ) related to t, d and τ are called delay vectors.
In contrast to some other presentations of Permutation entropy in the literature, the Shannon entropy obtained from the ordinal pattern distribution is normalized by dividing by d.This is justified by an interesting statement made by Bandt et al. [7], which roughly states that for time series from a special class of systems, for τ = 1 and d → ∞, (1) approaches a non-negative real number, however, how large the corresponding class is remains open to debate.Normalizing as given allows comparing Permutation entropy for different orders.(Some further discussion on this point follows in this paper.)Another usual normalization is given by dividing the Shannon entropy by its maximal possible value ln(d!).
In order to be more flexible in data analysis, various modifications of Permutation entropy have been developed during the last years.One class of such modifications is based on adding some metric information of the corresponding delay vectors to the ordinal patterns derived.Entropy measures of that type are the fine-grained Permutation entropy proposed by Liu and Wang [8], the weighted Permutation entropy and the robust Permutation entropy introduced by Fadlallah et al. [9] and Keller et al. [10], respectively.Bian et al. [11] have adapted Permutation entropy to the situation of delay vectors with equal components which are relevant when the number of possible values in a time series is small.Other variants to consider are Tsallis or Renyi entropies instead of the Shannon entropy which goes back to the work of Zunino et al. [12] and Liang et al. [13], or to integrate information from different scales.The latter was done by Li et al. [14], Ouyang et al. [15] and Azami and Ascodero [16] on the base of averaging the original data on different scales and by Zunino et al. [17] and Zunino and Ribeiro [18]) by considering a whole spectrum of delays τ.
Unakafov and Keller [19] have proposed the Conditional entropy of (successive) ordinal patterns, which has been shown to perform better than the Permutation entropy itself in many cases.
In this paper, we discuss some of the new ("non-weighting") developments in Permutation entropy.We first give some theoretical background in order to justify and to motivate conditional variants of Permutation entropy.Then we take a closer look at Renyi and Tsallis modifications of Permutation entropy.The last part of the paper is aimed at the classification of electroencephalography (EEG) data from the viewpoint of epileptic activity.Here, we point out that ordinal time series analysis combined with other methods and automatic learning could be promising for data analysis.We rely on EEG data reported in [6,20] and use algorithms developed in Unakafova and Keller [21].

The Kolmogorov-Sinai Entropy
In order to achieve a better understanding of Permutation entropy, we take a modeling approach.For this, we choose a discrete dynamical system defined on a probability space (Ω, B, P).The elements ω ∈ Ω are considered as the (hidden) states of the system, the elements of the σ-algebra B as the events one is interested in, and the probability P measures the chances of such events taking place.The dynamics of the system are given by a map T : Ω ← being B − B-measurable, i.e., satisfying T −1 (B) ∈ B for all B ∈ B. The map T is assumed to be P-invariant saying that P(T −1 (A)) = P(A) for all B ∈ B, in order to guarantee that the distribution of the states of the system does not change under the dynamics.
For simplicity, during the whole paper, we assume that (Ω, B, P) and T are fixed; specifications are given directly where they are needed.
In dynamical systems, the Kolmogorov-Sinai entropy based on entropy rates of finite partitions is a well established theoretical complexity measure.For explaining this concept, consider a finite partition C = {C 1 , C 2 , . . ., C l } ⊂ B and note that the (Shannon) entropy of such a partition is defined by For A = {1, 2, . . ., l}, the set A k of words a 1 a 2 . . .a k of length k over A provides a partition C k of Ω into pieces (In this paper, T •l denotes the l-fold concatenation of T, where T •0 is the identity map.)The distribution of word probabilities P(C a 1 a 2 ...a k ) contains some information on the complexity of the system, which, as the word length k approaches ∞, provides the entropy rate of C. The latter can be interpreted as the mean information per symbol.Here, let H(C 0 ) := 0. Note that both the sequences are monotonically non-increasing.In order to have a complexity that does not depend only on a fixed partition and measure the overall complexity, the Kolmogorov-Sinai entropy (KS entropy) is defined by For more information, in particular concerning formula (3), see for example Walters [22].By its nature, the KS entropy is not easy to determine and to obtain from time dependent measurements of a system.One important point is to find finite partitions supporting as much information on the dynamics as possible.If there is no feasible so-called generating partition containing all information (for a definition, see Walters [22]), this is not easy.The approach of Bandt and Pompe builds up appropriate partitions based on the ordinal structure of a system.Let us explain the idea in a general context.
Since it is no additional effort to consider more than one observable, let X = (X 1 , X 2 , . . ., X n ) : Ω → R n be a random vector with observables X 1 , X 2 , . . ., X n as components.Originally, Bandt and Pompe [1] discussed the case that the measuring values coincide with the states of a one-dimensional system, which in our language means that Ω ⊂ R, n = 1, and that there is only one observable which coincides with the identity map.
For each order d ∈ N, we are interested in the partition has ordinal pattern π i for i = 1, 2, . . ., n} called ordinal partition of order d.Here, ordinal patterns are of order d and the delay is one.

No Information Loss
For the rest of this section, assume that the measurements by the observables do not lose information on the modeling system, roughly meaning that the random variables . .separate states of Ω and, more precisely, that for each event C in the σ-algebra generated by these random variables, there exists some B ∈ B with P(C ∆ B) = 0. Note that already, for one observable, the separation property described is mostly satisfied in a certain sense (see Takens [23], Gutman [24]).
Moreover, assume that T is ergodic, meaning that P(T −1 (B)) = P(B) implies P(B) ∈ {0, 1} for all B ∈ B. Ergodicity says that the given system does not separate into proper subsystems and, by Birkhoff's ergodic theorem (see [22]), allows to obtain properties of the whole system on the base of single orbits ω, T(ω), T •2 (ω), . . .Under these assumptions, it holds This statement shown in Antoniouk et al. [25] and generalized in [26] can be interpreted as follows: If there is no loss of information caused by the measuring process, the whole information of the process can be obtained by only considering ordinal relations between the measurements for each observable.Formula (4), to be read together with formula (3), is illustrated by Figures 1 and 2, where directions of arrows indicate the direction of convergence.
Approximation of Kolmogorov-Sinai (KS) entropy: the direct way.
Approximation of KS entropy: the conditional way.

Conditional Entropy of Ordinal Patterns
In order to motivate a complexity measure based on considering successive ordinal patterns, we deduce some useful inequalities from the above statement.First, fix some d ∈ N and note that, since Because EntroRate(C X (d)) is converging to the KS entropy for d → ∞ in a monotonically non-decreasing way, the following is valid: Note that the faster the sequence (k d ) ∞ d = 1 increases, the nearer are the terms in the inequality, and good choices of the sequence provide equality of all terms in the last inequality.In particular, it holds being the background for the following definition (compare Unakafov and Keller [19]): Indeed, the concept given is a conditional entropy, however, since C X (d) 2 is a finer partition than C X (d), it reduces to a difference of entropies.

Permutation Entropy
The description of KS entropy given by formula (4) includes a double-limit, where the inner one is non-increasing and the outer is non-decreasing.Bandt et al. [7] (see also [1]) have proposed the concept of Permutation entropy, which only needs one limit and was the starting point of using ordinal pattern methods.Here, the concept is given in our general context.
) is called the Permutation entropy of order d with respect to X.Moreover, by the Permutation entropy with respect to X, we understand The definition of Permutation entropy given is justified by the result of Bandt et al. [7] that if T is a piecewise monotone interval map, then it holds PE id = KS.Here, Ω is an interval and X = id, where id denotes the identity on Ω.We do not want to say more about this result, but mention that general equality of KS entropy and Permutation entropy is an open problem, however as shown in [25] (under the assumptions above) it holds KS ≤ PE X .
Assuming that the increments of entropy of successive ordinal partitions are well-behaved in the sense that lim exists, by the Stolz-Cesàro theorem, one obtains the inequality (compare [19]).This inequality is shedding some light on the relationship of KS entropy, Permutation entropy and the various conditional entropy differences (see Figure 2) considered.
Summarizing, all quantities considered are related to the entropies H(C X (d) k ).For simplicity, we now switch to only one observable X.The generalization is simple but not necessary for the following.In the restricted case, for d, k ∈ N, it holds , where has ordinal pattern π l for l = 1, 2, . . ., k}).

The Practical Viewpoint
Despite the general relation of KS entropy and Permutation entropy, the asymptotics of entropies taken from ordinal pattern distributions are relatively well understood by the above considerations.This allows a good interpretation of what they measure.The practical problem, however, is that asymptotics can be very slow and thus cause problems in the estimation of P(C X (d) k ) from orbits of a system when d or k is high.Here, for n, k ∈ N, a naive estimator of P The problem is that one needs too many measurements for a reliable estimation if d and k are large.This is demonstrated for the logistic map with "maximal chaos" in the one-dimensional case.
Here, Ω = [0, 1], T is defined by T(ω) = 4ω(1 − ω) for x ∈ [0, 1], P is the Lebesgue measure, and X is the identity map.Note that this map and maps of the whole logistic family are often used for testing complexity measures.(For more information, see [27,28].) Figure 3  The approximation of the KS entropy is rather good for d = 7, but a long orbit is needed.In general, the higher d is, the longer a stabilization of the estimate takes.A similar situation is demonstrated in the graphic in the middle, now with fixed d = 7 but increasing k.In the graphic on the bottom, three of the curves from the upper graphic are drawn again as a contrast to curves showing h(d,1) d for d = 5, 6, 7. The latter can be considered as estimates of the corresponding Permutation entropies; the estimates, however, are bad since d is not large enough.Note that according to the result of Bandt et al., estimates for very high d must be near to the KS entropy.The results illustrate that using Conditional entropy of ordinal patterns is a good compromise.Good performance of Conditional entropy of ordinal patterns for the logistic family is reported in Unakafov and Keller [19].

The Concept
It is natural to generalize Permutation entropy to Tsallis and Renyi entropy variants, which has first been done by Zunino et al. [12].Liang et al. [13] discuss the performances of a large list of complexity measures, among them are the classical as well as Tsallis and Renyi Permutation entropies in tracking changes of EEG during different anesthesia states.They report that the class of Permutation entropies in some features shows good performance relative to the other measures, with best results of the Renyi variant.Let us have a closer look at the new concepts.Definition 4. For some given positive α = 1, the empirical Renyi Permutation entropy (eRPE) and empirical Tsallis Permutation entropy (eTPE) of order d ∈ N and of delay τ ∈ N of a time series (x t ) N − 1 respectively, with p π as given in (2).(We include the factor 1 d in the entropy formulas only for reasons of comparability with the classical Permutation entropy.)

Some Properties
As, in general, the Renyi and the Tsallis entropy of a distribution for α → 1 converge to the Shannon entropy, convergence of eRPE and eTPE to the ePE holds as well.The two concepts principally can be used in data analysis to further emphasize the role of small ordinal pattern probabilities if α < 1 or of large ones if α > 1 (compare the graphs of the functions x → x ln x, x → x α for different α and x ∈ [0, 1]).The consequences of this weighting become obvious for the eRPE when considering limits for α → 0 and α → ∞.One easily sees that meaning that for large α, the eRPE mainly measures the largest relative ordinal pattern frequency (on a logarithmic scale), with low entropy for high relative frequency; for small α, the eRPE mainly measures the number of occurring ordinal patterns (on a logarithmic scale).Since the eTPE is only a monotone functional of the eRPE for fixed α; despite a different scale, it has similar properties as the eRPE.
For α = 2, the eRPE has a nice interpretation.Having N π = (N − dτ)p τ π in the time series, there are pairs of different times providing ordinal pattern π.Since, all in all, we have different time pairs, the quantity can be interpreted as the degree of recurrence in the time series.This quantity is, in fact, the symbolic correlation integral recently introduced by Caballero et al. [29].

Demonstration
We demonstrate the performance of eRPE and eTPE for different α on the base of EEG data discussed in [6].For this, we consider two parts each of a 19-channel scalp EEG of a boy with lesions predominantly in the left temporal lobe caused by a connatal toxoplasmosis.The data were sampled at a rate of 256 Hz, meaning that 256 measurements were obtained each second.The first data part (data set 1) was taken from an EEG derived at an age of eight years and the second one (data set 2) was derived at an age of 11 years, four months after the implantation of a vagus stimulator.Note that epileptic activity was significant before vagus stimulation and was the reason for the implantation.(For some more details on the data, see [6]).
Figure 4 shows ePE and eRPE for α = 0.5, 2 and eTPE for α = 2 for the two data sets in dependence on a shifted time window, where d = 3 and τ = 4.Each graphic represents the 19 channels for a fixed entropy by 19 entropy curves.Among the channels are T3 and P3, which are interesting in the following.The curve related to a fixed channel contains pairs (t, h t ), where h t is the entropy of the segment of the related time series ending at time t and containing 2 × 256 + 3 × 4 successive measurements.(Each segment provides 512 = 2 × 256 ordinal patterns representing a time segment of two seconds.)The t are chosen from a time segment of 100 seconds, where the beginning time is set to 0 for simplicity.We have added a fat black curve representing the entropy of the whole brain instead of the single channels.Here, the relative frequencies used for entropy determination were obtained by counting ordinal patterns in the considered time segment pooled over all channels.The two EEG parts reflect changes of brain dynamics.Whereas the first EEG part shows low entropies relative to the other channels of P3 and, partially, of T3, entropies of both channels are relatively higher in the second EEG part.More information with additional data sets derived directly before and after the vagus stimulator implantation is available from [6], where only the ePE was considered.Here, we want to note that P3 and T3 are from the left part of the brain with the lesions, and P3 seems to mainly reflect some kind of irregular behavior related to them.The most interesting point is that before vagus stimulator implantation, P3 and partially T3 are conspicuous both in phases with and without epileptic activity.For orientation, the graphics given in Figure 4 for data set 1 and in Figure 5 show a part from around 30 to 90 seconds with low entropies for pooled ordinal patterns (fat black curve) related to a generalized epileptic seizure, meaning epileptic activity in the whole brain.Figure 4 suggests that a visual inspection of the data using eRPE and eTPE instead of ePE does not seem to gain further insights when α is chosen close to 1. Here, it is important to note that, for the parameters considered, all ordinal patterns are accessed at many times (with some significant frequency).Our guess is supported by Table 1.For each of the channels FP2, T3, and P3 and given α, the relative frequency of concordant pairs of the observations ePE, eRPE at time s and ePE, eRPE at time t among all pairs (s, t); s < t is shown.(s, t) is said to be concordant if the difference between ePE at times s and t and the difference between eRPE at times s and t have the same sign.The results particularly reflect the fact that for channel Fp2, providing measurements from the front of the brain, the ordinal patterns are more equally distributed than for T3; and for P3, the distribution of ordinal patterns is the farthest from equidistribution.For contrast, we also consider α = 250.Figure 5 related to data set 1 indicates that extreme choices of α could be useful in order to analyze and visualize changes in the brain dynamics more forcefully.The upper graphic of eRPE for α = 0.01 shows that at the beginning of an epileptic seizure, the number of ordinal patterns abruptly decreases for nearly all channels and after some increasing stays on a relatively low level until the end of the seizure.For α = 35, it is interesting to look at the whole ensemble of the entropies.
Here, eRPE indicates much more variability of the largest ordinal pattern frequencies of the channels in the seizure-free parts than in the seizure epochs.The very low entropy at the beginning of the seizure reverts back to mainly increasing or decreasing only ordinal patterns.Here, the special kind of scaling given by the eTPE allows to emphasize the special situation at the beginning of a seizure and can therefore be interesting for an automatic detection of epileptic seizures, whereby the correct tuning of α is important.

Classification on the Base of Different Entropies
Classification is an important issue in the analysis of EEG data, for which in many cases entropy measures can be exploited.Here, it is often unclear which of these measures are the best and how much they are 'overlapping'.In view of the discussion in Section 2, note that by the asymptotic statements considered, ordinal pattern based measures can behave very similarly for high orders, but for low orders their performance can be rather different.Also including different delays in the analysis, one obtains much flexibility on one hand, but the problem of a high number of parameter combinations arises on the other hand.
Here, we want to discuss the classification of EEG data using ePE, empirical Conditional entropies of ordinal patterns (eCE, see below) and, additionally, Approximate entropy (ApEn) (see Pincus [30]) and Sample entropy (SampEn) (see Richmann and Moormann [31], and extending work in Keller et al. [10]).For the definition and usage of ApEn and SampEn, in particular for the parameter choice ("ApEn(2, 0.2σ, x)", "SampEn(2, 0.2σ, x)"), we directly refer to [10].The eCE, motivated and defined for dynamical systems in Section 2, is given by the following definition (compare also [10]): Definition 5. Given a time series (x t ) N−1 t = 0 with N ∈ N, the quantity • group E: intracranial EEG's recorded from subjects with epilepsy during a seizure period.
In contrast to [10], where the groups A and B and the groups C and D were pooled, each of the groups is considered separately in the following.

Visualization and Classification for Delay One
In order to give a visual impression of how the considered entropies separate the data, we provide three figures.For the ordinal pattern based measures, we have chosen order d = 5 and delay τ = 1. Figure 6 shows the values of the four considered entropies for all data sets.The values obtained are drawn from the left to the right starting from those from group A and ending with those from group E. For better presentation of the distribution of all entropies, we have added boxplots (see Figure 7).Figure 6 shows that each of the considered entropies does not separate much, however, one can see different kinds of separation properties for the ordinal pattern based entropies and the two other entropies.A better separation is seen in Figure 8 where one entropy measure is plotted versus another one, in four different combinations.Here, the discrimination between E, the union of A and B, and the union of C and D is rather good, confirming the results in [10], but both A, B and C, D are strongly overlapping.The general separation seems to be slightly better using three entropies, which is illustrated by Figure 9. Here, however, a two-dimensional representation is chosen by plotting the second principal component versus the first one, both obtained by principal component analysis from the three entropy components' variables.In order to obtain more objective and reliable results, we have tested a classification of the data on the base of the chosen entropies and of Random Forests (see Breiman [32]), a popular and efficient machine learning method.For this purpose, the data were randomly divided into a learning group consisting of 80% of the data sets and a testing group consisting of the remaining 20%.The obtained accuracy of the classification, i.e., the relative frequency of the correctly classified data sets, was averaged over 1000 testing trials for each entropy combination considered.The results of this procedure, summarized in Tables 2-4, show that including an additional entropy results in a higher accuracy, but that the way in which entropies are combined is crucial for the results.Note that combining all four entropies provides an accuracy of only 70.6%.

Other Delays
Clearly, the accuracy of the classification using ePE and eCE depends on the choice of the parameters of the entropies.Whereas the choice of higher orders does not make sense for statistical reasons, as already mentioned in Section 2, testing different delays τ is useful since the delay contains important scale information.Note that already for both, ePE and eCE, the classification accuracy varies by more then 14%.Considering only delays τ = 1, 2, . . ., 9 for d = 5, the maximal accuracies for ePE and eCE are 45.01%and 48.0% for τ = 6 and τ = 5, respectively.Note that both results are better than those for the SampEn (see Table 2) and that τ = 1 provides the worst results.Combining two ePEs and eCEs for delays in {1, 2, . . ., 9}, one reaches an accuracy of 61.79% (for delays 1 and 2) and 62.28% (for delays 1 and 9), respectively.

Resume
Throughout this paper, we have discussed ordinal pattern based complexity measures both from the viewpoint of their theoretical foundation and their application in EEG analysis, centered upon Permutation entropy and its conditional variants.We have pointed out that, as in many situations in (model-based) data analysis, one must give attention to the discrepancy of theoretical asymptotics and statistical requirements, here in view of estimating KS entropy.In the case of moderately but not extremely long data sets, the concept of Conditional entropy of ordinal patterns, as discussed, could be a compromise.It has been shown to have better performance than the classical Permutation entropy in many situations.
A good way of further investigating the performances of ordinal pattern based measures is an extensive testing of these measures for data classification.In this direction, the results of this paper for a restricted parameter choice are already promising, however systematic studies are required and planned.For this purpose, based on the considerations in Sections 3 and 4, the authors also propose including Renyi and Tsallis variants of Permutation entropy (with an extreme parameter choice), ordinal pattern based disequilibrium measures as considered by Zunino et al. [17] and Zunino and Ribeiro [18]), and classical concepts such as Approximate entropy and Sample entropy.The latter are interesting in combination with ordinal complexity measures since they possibly address other features.The most important challenge, however, is to deal with the great number of entropy measures and parameters which, in the opinion of the authors, can be faced by using machine learning ideas in a sophisticated way.
shows h(d, k) − h(d, k − 1) and h(d,1) d for different d, k ∈ N in dependence on orbit lengths of T between 10 2 , and 10 5 and 10 6 , respectively.The horizontal black line indicates the KS entropy of T, which is ln 2. The graphic on the top provides curves for h(d, k) − h(d, k − 1) for k = 2 and d = 2, 3, . . ., 7 being estimates of the Conditional entropy of ordinal patterns of corresponding orders d.

data set 2 , 5 secondsFigure 4 .
Figure 4. Comparison of empirical Renyi Permutation entropy (eRPE) for α = 0.5, 2, empirical TsallisPermutation entropy (eTPE) for α = 2 and ePE, computed from EEG recordings before stimulator implantation (data set 1) and after stimulator implantation (data set 2) for 19 channels using a shifted time window, order d = 3 and delay τ = 4. Highlighting, in particular, the channels T3 (green line) and P3 (red line) as well as the entropy over all channels by a fat black line.The sampling rate is 256 Hz.

Figure 6 .
Figure 6.Entropy of the EEG data sorted by groups for four different entropy measures.

ePE( 5 Figure 7 .
Figure 7. Boxplots for four different entropy measures sorted by groups.

Figure 8 .
Figure 8.One entropy versus another entropy for four entropy combinations.

Figure 9 .
Figure 9. Second principal component versus the first one obtained from principal component analysis on three entropy variables.

Table 1 .
Concordance of the sign of entropy differences of ePE and eRPE for given α.

Table 2 .
Results of the classification on the base of one entropy.

Table 3 .
Results of the classification on the base of two entropies.

Table 4 .
Results of the classification on the base of three entropies.