On the Statistical Properties of Multiscale Permutation Entropy: Characterization of the Estimator’s Variance

Permutation Entropy (PE) and Multiscale Permutation Entropy (MPE) have been extensively used in the analysis of time series searching for regularities. Although PE has been explored and characterized, there is still a lack of theoretical background regarding MPE. Therefore, we expand the available MPE theory by developing an explicit expression for the estimator’s variance as a function of time scale and ordinal pattern distribution. We derived the MPE Cramér–Rao Lower Bound (CRLB) to test the efficiency of our theoretical result. We also tested our formulation against MPE variance measurements from simulated surrogate signals. We found the MPE variance symmetric around the point of equally probable patterns, showing clear maxima and minima. This implies that the MPE variance is directly linked to the MPE measurement itself, and there is a region where the variance is maximum. This effect arises directly from the pattern distribution, and it is unrelated to the time scale or the signal length. The MPE variance also increases linearly with time scale, except when the MPE measurement is close to its maximum, where the variance presents quadratic growth. The expression approaches the CRLB asymptotically, with fast convergence. The theoretical variance is close to the results from simulations, and appears consistently below the actual measurements. By knowing the MPE variance, it is possible to have a clear precision criterion for statistical comparison in real-life applications.


Introduction
Information entropy, originally defined by Shannon [1], has been used as a measure of information content in the field of communications. Several other applications of entropy measurements have been proposed, such as the analysis of physiological electrical signals [2], where a reduction in entropy has been linked to aging [3] and various motor diseases [4]. Another application is the characterization of electrical load behavior, which can be used to perform non-intrusive load disaggregation and to design smart grid applications [5].
Multiple types of entropy measures [6][7][8] have been proposed in recent years to assess the information content of time series. One notable approach is the Permutation Entropy (PE) [9], used to measure the recurrence of ordinal patterns within a discrete signal. PE is fast to compute and robust even in the presence of outliers [9]. To better measure the information content at different time scales, Multiscale Permutation Entropy (MPE) [10] was formulated as an extension of PE, by using the multiscale approach proposed in [11]. Multiscaling is particularly useful in measuring the information

Materials and Methods
In this section, we establish the concepts and tools necessary for the derivation of the MPE model. In Section 2.1 we review the formulation of PE and MPE in detail. In Section 2.2, we show the derivation of the MPE variance. In Section 2.3, we derive the expression for the CRLB of the MPE, and we compare it to the variance. Finally, in Section 2.4, we explain the statistical model used to generate surrogate signals, which are used to test the MPE model.

Permutation Entropy
PE [9] measures the information content by counting the ordinal patters present within a signal. An ordinal pattern is defined as the comparison between the values of adjacent data points in a segment of size d, known as the embedded dimension. For example, for a discrete signal of length N, x 1 , . . . , x n , . . . , x N , and d = 2, only two possible patterns can be found within the series, x n < x n+1 and x n > x n+1 . For d = 3, there are six possible patterns present, as shown in Figure 1. In general, for any integer value of d, there exists d factorial (d!) possible patterns within any given signal segment. To calculate PE, we must first obtain the sample probabilities within the signal, by counting the number of times a certain pattern i = 1, . . . , d! occurs. This is formally expressed as follows: where π i is the label of a particular ordinal pattern i, andp i is the estimated pattern probability. Some authors [15] also introduce a downsampling parameter in Equation (1), to address the analysis of an oversampled signal. For the purposes of this article, we assume a properly sampled signal, and thus, we do not include this parameter.
Using these estimations, we can apply the entropy definition [1] to obtain the PE of the signal whereĤ is the estimated normalized PE from the data. PE is a very simple and fast estimator to calculate, and it is invariant to non-linear monotonous transformations [15]. It is also convenient to note that we need no prior assumptions on the probability distribution of the patterns, which makes PE a very robust estimator. The major disadvantage in PE involves the length of the signal, where we need N d! for PE to be meaningful. This imposes a practical constraint in the use of PE for short signals, or in conditions where the data length is reduced. All possible patterns for embedded dimension d = 3, from a three-point sequence {x n , x n+1 , x n+2 }.

Multiscale Permutation Entropy
The MPE consists of applying the classical PE analysis on coarse-grained signals [10]. First, we define m, as the time scale parameter for the MPE analysis. The coarse-graining procedure consists in dividing the original signal into adjacent, non-overlapping segments of size m, where m is a positive integer less than N. The data points within each segment are averaged to produce a single data point, which represents the segment at the given time scale [11].
MPE consists in applying the PE procedure in Equation (2) on the coarse-grained signals in Equation (3) for different time scales m. This technique allows us to measure the information inside longer trends and time scales, which is not possible using PE. Nonetheless, the resulting coarse-grained signals have a length of N/m, which limits the analysis. Moreover, for a sufficiently large m, the condition N/m d! is eventually not satisfied. At this point, it is important to discuss some constraints regarding the interaction between time scale m and the signal length N. First, strictly speaking, N/m must be an integer. In practice, the coarse-grained signal length will be the integer number immediately below N/m. Second, the proper domain of m is (0, N), as the segments size, at most, can be the same length as the signal itself. It is handy to use a normalized scale m N with domain (0, 1) which does not change between signals. This normalized scale is the inverse of the coarse-grained signal length. Taking the length constraints from the previous section, this means that m will result in a coarse-grained signal of 2 data points, which is not meaningful for this analysis. For d = 3, the normalized scale must be significantly less than 1 6 , which corresponds to a coarse-grained signal of six elements. Therefore, for practical applications, we restrict our analysis for values of m N close to zero, by selecting a small time scale m, a large signal length N, or both.

Variance of MPE Statistic
The calculated MPE can be interpreted as a sample statistic that estimates the true entropy value, with an associated expected value, variance, and bias, for each time scale. We have previously developed the calculation of the expected value of the MPE in [17], where the bias has been found to be independent from the pattern probabilities. We expand on this result by first proposing an explicit equation to the MPE statistic, and then we formulate the variance of MPE, as a function of the true pattern probabilities and time scale.
For the following development, we use H, the non-normalized version of Equation (2), By taking the Taylor expansion of Equation (4) on a coarse-grained signal in Equation (3), we get whereĤ is the MPE estimator, H is the true unknown MPE value, m is the time scale, N is the signal length, and d the embedded dimension. The probabilities p i correspond to the true probabilities (unknown parameters) of each pattern, and ∆Y i correspond to the random part in the multinomially (Mu) distributed frequency of each pattern, Y i being the number of pattern counts of type i in the signal. If we take into consideration the length constraints discussed in Section 2.1.2, we know that the normalized time scale m N is very close to zero. This implies that the high-order terms of Equation (5) will quickly vanish for increasing values of k. Therefore, we propose to make the simplest approximation of Equation (5) by taking only the term k = 1. By doing this, we get the following expression: Using our previous results involving MPE [17], we know the expected value of Equation (7) is The statistic presents a bias that does not depend on the pattern probabilities p i , and thus, can be easily corrected for any signal. Now, we move further and calculate the variance of the MPE estimator. First, it is convenient to express Equation (4) p being the column vector of pattern probability estimators,l the column vector of the logarithm of each pattern probability estimator, and T is the transpose symbol. We can now rewrite Equation (7) aŝ The circle notation • represents the Hadamard power (element-wise). Now, we can obtain the variance of the MPE estimator (11), Now, we know that E ∆Y∆Y T is the Covariance matrix of ∆Y, which is the multinomial random variable defined in Equation (6).
where diag(p) is a diagonal matrix, where the diagonal elements are the elements of p.
We can further summarize the covariance matrix (14) as follows: We also rewrite the following term in Equation (13), By combining Equations (14) to (18) explicitly in Equation (13), we get the expression, We note that Equation (19) is a cubic polynomial respect to the normalized scale m/N. Since we are still working in the region where m/N is close to (but not including) zero, we propose to further simplify this expression using only the linear term. This means that we can approximate Equation (19) as follows: We note that Equation (20) will be equal to zero for a perfectly uniform pattern distribution (which yields a maximum PE). In this particular case, Equation (20) will not be a good approximation for the MPE variance, and we will need, at least, the quadratic term in Equation (19). Except for extremely high or low values of MPE, we expect the variance linear approximation to be accurate. We discuss more properties of this statistic in the Section 3.2.

MPE Cramér-Rao Lower Bound
In this section, we compare the MPE variance (20) to the CRLB, to test the efficiency of our estimator. The CRLB is defined as where B is the bias of the MPE expected value from Equation (8) and where I(H) is the Fisher Information, and f H (y; p) is the probability distribution function of H.
Although we do not have an explicit expression for f H , we can express CRLB(H) as a function of CRLB(p) as follows [18]: I(p) is the Fisher information matrix for p, which we know has a multinomial distribution related to Equation (6). Each element of I(p) is defined as where fp(p; p) is the probability distribution ofp, which is identically distributed to Equation (6) (for the full calculation of CRLB(H), see Appendix B). Thus, by obtaining the inverse of I(p) and all partial derivatives of H with respect to each element of p, we obtain the CRLB(H).
As we recall from our results in Equations (19) and (20), the CRLB corresponds to the first term of the Taylor series expansion of the MPE variance. The high-order terms in Equation (19) increase the MPE variance above the CRLB. For small values of m N , the higher-order terms in Equations (20) become neglectable, which make the MPE variance approximation converge to CRLB(H).

Simulations
To test the MPE variance, we need an appropriate benchmark. We need to design a proper signal model with the following goals in mind: First, the model must preserve the pattern probabilities across all the signal generated; second, the function must have the pattern probability as an explicit parameter, easily modifiable for testing. The following equation satisfies these criteria for dimension d = 2: where This is a non-stationary process, with a trend function δ(p). is a Gaussian noise term with variance σ 2 = 1, without loss of generality. Although different values of σ 2 will indeed modify the trend function, it will not affect the pattern probabilities in the simulation, as PE is invariant to non-linear monotonous transformations [15].
For dimension d = 2, p = p 1 = P(x n < x n+1 ) and 1 − p = p 2 = P(x n > x n+1 ), which are the probabilities of each of the two possible patterns. The formulation of δ(p) comes from the Gaussian Complementary Cumulative Distribution function, taking p as a parameter. This guarantees that we can directly modify the pattern probabilities p and 1 − p for simulation.
Although p is not invariant at different time scales, it is consistent within each coarse-grained signal, which suffices for our purposes. We chose to restrict our simulation analysis to the embedded dimension d = 2. Although our theoretical work holds for any value of d (see Section 2.2), it is difficult to visualize the results at higher dimensions.
This surrogate model (28) was implemented in the Matlab environment. For the test, we generated 1000 signals each, for 99 different values for p = 0.01, 0.02, . . . 0.99. The signal length was set to N = 1000. Some sample paths are shown in Figure 2. These signals were then subject to the coarse-graining procedure (3) for time scale m = 1, . . . , 10. The MPE measurement was taken for each coarse-grained simulated signal using Equation (2). Finally, we obtained the mean and variance of the resulting MPE's for each scale. This simulation results are discussed in Section 3.2.

Results and Discussion
In this section, we explore the results from the theoretical MPE variance. In Section 3.1, we contrast the results from the model against the MPE variance measured from the simulations. In Section 3.2, we discuss these findings in detail.

Results
Here, we compare the theoretical results with the surrogate data obtained by means of the procedure described in Section 2.4. We use the cubic model from Equation (19) instead of the linear approximation in Equation (20), to take in consideration non-linear effects that could arise from simulations. Figure 3 shows the theoretical predictions in dotted lines, and simulation measurements in solid lines.    Figure 3d shows the case where we have extreme pattern probability distributions, with entropy close to zero. Although we use the complete cubic model (19), the predicted curves are almost linear.
In general, we can observe that the simulation results have greater variance than the prediction of the model. The real values from the simulations correspond to the sample variance from the signals, calculated fromp instead of p. Nonetheless, the simulations and theoretical graphs have the same shape. It is interesting to note that the discrepancies increase with the scale. This effect is addressed in Section 3.2.

Discussion
It is interesting to explore the particular structure of the variance. As we can observe from Figure 3b, the MPE variance increases linearly with respect to the time scale for a wide range of pattern distributions, as described in Equation (20). This is even true with highly uneven distributions, where the entropy is very close to zero, as shown in Figure 3d. Nonetheless, when we observe the expression (20), the equation collapses to zero when all pattern probabilities p i are the same. For any embedded dimension d, all probabilities p i = 1 d! . This particular pattern probability distribution is the discrete uniform distribution, which yield to the maximum possible entropy in Equation (2). As we can observe from Figure 3c, the linear approximation in Equation (19) is not enough to estimate the variance in this case. Nonetheless, the results suggest a quadratic increase. This agrees with previous results by Little and Kane [16], where they characterized the classical normalized PE variance for white noise under finite-length constraints.
This result matches our model in Equation (19) (taking the quadratic term) for the specific case of uniform pattern distribution and time scale m = 1. This suggest that, when we approach the maximum entropy, the quadratic approximation is necessary.
Contrary to the bias in the expected value of MPE [17], the variance is strongly dependent on the pattern probabilities present in the signal, as shown in Figure 3a. For the MPE with embedded dimension d = 2, the variance of MPE has a symmetric shape around equally probable patterns. We observe, for a fixed time scale, that the variance increases the further we deviate from the center, and sharply decreases for extreme probabilities. The variance presents its lowest points at the center and the extremes of the pattern probability distributions, which corresponds to the points where the entropy is maximum and minimum, respectively. For d = 2, we can calculate the variance (20) more explicitly, It is evident that the zeros of this equation correspond to p = 0, p = 1 (points of minimum entropy), and p = 0.5 (maximum entropy). We can get the critical points by calculating the first derivative of Equation (32) with respect to p m N d dp l T Σ p l| d=2 = m N ln which is equal to zero to get the extreme points. From the first term of Equation (33), we obtain the critical point p = 0.5, which is a minimum. If the second term of the equation is equal to zero, we need to solve the transcendental function Numerically, we found the maximum points to be p = 0.083 and p = 0.917, as shown in Figure 3a. Both these values yield to a normalized PE value ofĤ = 0.413. This implies that, when we obtain values of MPE close to this value, we will have a region with maximum variance. This effect arises directly from the pattern probability distribution. Hence, in the region aroundĤ = 0.413 for d = 2, we will have maximum estimation uncertainty, even before factoring the finite length of the signal, or the time scale. Therefore, Equation (20) implies an uneven variance across all possible values of the entropy measurement, regardless of time scale and embedded dimension. It also implies a region where this variance is maximum.
Lastly, as noted in Section 2.3, the MPE variance (20) approaches the CRLB for small values of m. This is further supported by the simulation variance MPE measurements shown in Figure 3, which are consistently above the theoretical prediction. We can attribute this effect to the number of iterations of the testing model in Equation (28), where an increasing number of repetitions will yield to a more precise MPE estimation, with a reduced variance. Since we already include the effect of the time scale in Equations (19) and (20), the difference between the theoretical results and the simulations does not come directly from the signal length or the pattern distribution.

Conclusions
By following on from our previous work [17], we further explored and characterized the MPE statistic. By using a Taylor series expansion, we were able to obtain an explicit expression of the MPE variance. We also derived the Cramér-Rao Lower bound of the MPE, and compared it to our obtained expression. Finally, we proposed a suitable signal model for testing our results against simulations.
By analyzing the properties of the MPE variance graph, we found the estimator to be symmetric around the point of equally probable patterns. Moreover, the estimator is minimum in both the points of maximum and minimum entropy. This implies, first, that the variance of the MPE is dependent on the MPE estimation itself. In regions where the entropy measurement is near the maximum or minimum, the estimation will have a minimum variance. On the other hand, there is an MPE measurement where the variance will have a maximum. This effect comes solely from the pattern distribution, and not from the signal length or the MPE time scale. We should take in account this effect when comparing real entropy measurements, as it could affect the interpretation of statistically significant difference.
Regarding the time scale, the MPE variance linear approximation is sufficiently accurate for almost all pattern distributions, provided that the time scale is small compared to the signal length. An important exception to this is the case where the pattern probability distribution is almost uniform. For equally probable patterns, the linear term of the MPE variance vanishes, regardless of time scale. In this case, we need to increase the order of the approximation to the quadratic term. Hence, for MPE values close to the maximum, the variance presents a quadratic growth respect to scale.
We found the MPE variance estimator obtained in this article to be almost efficient. When the time scale is small compared to the signal length, the MPE variance resembles the MPE CRLB closely. Since the CRLB is equal to the first term on the Taylor series approximation for the variance, this implies that the efficiency limit also changes with the MPE measurement itself. Since this effect also comes purely from the pattern distribution, we cannot correct it with the established improvements of the MPE algorithm, like Composite MPE or Refined Composite MPE [13].
By knowing the variance of the MPE, we can improve the interpretation of this estimator in real-life applications. This is important because researchers can impose a precision criterion over the MPE measurements, given the characteristics of the data to analyze. For example, the electrical load behavior analysis can be achieved using short-term measurements where the time scale is a limiting factor for MPE. By knowing the variance, we can compute the maximum time scale until which the MPE is still significant.
By better understanding the advantages and disadvantages of the MPE technique, it is possible to have a clear benchmark for statistically significant comparisons between signals at any time scale.

Funding:
The work is part of the ECCO project supported by the french Centre Val-de-Loire region under the contract #15088PRI.
Acknowledgments: This work was supported by CONACyT, Mexico, with scholarship 439625.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Multinomial Moment Matrices
In this section we will briefly derive the expressions for the Covariance, Coskewness, and Cokurtosis matrices for a multinomial distribution. These will be necessary in the calculation of the MPE variance in Section 2.2.
First, we start with a multinomial random variable with the following characteristics, where we use the same definitions as in Equations (10) and (12), where p is the vector composed of all pattern probabilities p i , andp is the estimation of p. We will also use the definitions in Section 2.2, where d! is the number of possible patterns (number of events in the sample space), m is the MPE time scale, and N is the length of the signal. For the remainder of this section, we will assume m and N are such that n = N m is integer. We note that Y is composed of a deterministic part np i , and a random part ∆Y i . It should be evident that E[∆Y i ] = 0 and ∆Y i is identically distributed to Y.
We proceed to calculate the Covariance matrix of the binomial random variable ∆Y. We know that for i = 1, . . . , d! and j = 1, . . . , d!. Thus, if we gather all possible combinations of i and j in the Covariance matrix, we get Similarly, the skewness and coskewness can be expressed as, which yields to the Coskewness matrix where we use again the vector definitions in Equation (12). Lastly, we follow the same procedure to obtain the Cokurtosis matrix, by obtaining the values which combines in the matrix as follows By taking advantage of this expressions, we are able to calculate the MPE variance in Section 2.2.
The next step is to compute the Fisher Information matrix. We know from Equation (6)  is the count of patterns in the signal, as described in Equations (1) and (6). Sincep = m N Y, they have the same probability distribution. Using (A16), I(p * ) is calculated as follows, ln( f Y (y; p)) = ln( N m !) + y d! ln(p d! ) + where 1 · 1 T is a square matrix of ones, with rank 1. We should note that j = 1, . . . , (d! − 1) and k = 1, . . . , (d! − 1), so I(p * ) is of size (d! − 1)x(d! − 1). Because of the constraint in (A11), p d! offers no additional information. Moreover, we guarantee that I(p * ) is non-singular, and thus, invertible.