Article Some Convex Functions Based Measures of Independence and Their Application to Strange Attractor Reconstruction

The classical information-theoretic measures such as the entropy and the mutual information (MI) are widely applicable to many areas in science and engineering. Csiszar generalized the entropy and the MI by using the convex functions. Recently, we proposed the grid occupancy (GO) and the quasientropy (QE) as measures of independence. The QE explicitly includes a convex function in its definition, while the expectation of GO is a subclass of QE. In this paper, we study the effect of different convex functions on GO, QE, and Csiszar’s generalized mutual information (GMI). A quality factor (QF) is proposed to quantify the sharpness of their minima. Using the QF, it is shown that these measures can have sharper minima than the classical MI. Besides, a recursive algorithm for computing GMI, which is a generalization of Fraser and Swinney’s algorithm for computing MI, is proposed. Moreover, we apply GO, QE, and GMI to chaotic time series analysis. It is shown that these measures are good criteria for determining the optimum delay in strange attractor reconstruction.


Introduction
The advent of information theory was hallmarked by Shannon's seminal paper [1].In that paper, some fundamental measures of information were established, among which is the entropy.The entropy is a concept that originally resided in thermodynamics and statistical physics as a measure of the degree of disorder [2].For a different purpose, Shannon used the entropy to measure the amount of information.He also proposed the prototype of mutual information (MI) to measure the amount of information transmitted through a communication channel.The Shannon MI can be viewed as the Kullback divergence (also known as the relative entropy) between the joint probability density function (PDF) and the product of marginal PDFs.It reaches its minimum, zero, if and only if the variables are independent.Hence MI can be viewed as a measure of independence.Since the 1960s, generalizations of the Shannon entropy and MI have attracted the attention of many researchers, yielding various forms of non-Shannon information-theoretic measures [2][3][4][5][6][7][8][9].For example, Renyi suggested some properties that the entropy should satisfy, such as the additivity, and proposed a class of entropies with a parameter.The Shannon entropy is the limit of these entropies when the parameter approaches 1 [3].Harvrda and Charvat proposed a generalization of the Shannon entropy that is different from the Renyi's entropy, which is called structural -entropy [4].Using convex functions, Csiszar proposed the f-divergence [5], which is a generalization of the relative entropy.A generalized version of Shannon MI, which is referred to as generalized mutual information (GMI) in this paper, can be readily derived from f-divergence [5].Tsallis postulated a generalized form of entropy [6].Kapur also gave many definitions of entropies [7].More recently, we proposed the grid occupancy (GO) [8] and the quasientropy (QE) [9] as measures of independence.The QE, which explicitly includes a convex function, is a general enough definition that includes the -entropy, Tsallis entropy, and some of the Kapur's entropies as special cases.For the GO, it can be shown that its expectation is a subclass of QE [8].
As we can see, many information-theoretic measures involve convex functions.Thus, it is interesting to pose the question that how different convex functions affect the behavior of the related measures.Previous studies were usually carried out on a restricted type of convex function.For example, Tsallis studied the power function [6].In this paper, however, we study the effect of arbitrary convex functions.We propose a quality factor (QF) to quantify the sharpness of the minima of GO, QE, and GMI.We show that the order of QF with respect to l, the number of quantization levels, correctly predicts the sharpness of minima.Especially, some convex functions yield QFs with higher order than that of the minus Shannon entropy and MI, which implies that the related measures have sharper minima than the minus Shannon entropy and MI do.
Nowadays, applications of entropy and information theory have extended, far beyond the original scope of communication theory, to a wide variety of fields such as statistical inference, non-parametric density estimation, time series analysis, pattern recognition, biological, ecological and medical modeling and so on [10][11][12][13][14][15].In this paper, besides the above theoretical results, we also deal with a specific application of the independence measures to chaotic time series analysis, the background of which is described as follows.In many cases, the commonly seen one-dimensional scalar signals or time series are observations of a certain variable of a multivariate dynamical system.An interesting property of dynamical systems is that their multivariate behavior can be approximately "reconstructed" from observations of just one variable by a method called state space (phase portrait) reconstruction.For chaotic systems, this method is also termed strange attractor reconstruction, which has become a fundamental approach in chaotic signal analysis.Basic reconstruction can be done using delay coordinates [16].Namely, we take the current observed value of the time series and the values after equally spaced time lags  , 2 , 3 … to obtain "observations" of several reconstructed variables where  is the delay.If  is not appropriate, the effect of reconstruction might not be good.So we should consider how to choose a proper delay.Since reconstruction aims at releasing the information of the whole system "condensed" in one variable, generally the reconstructed variables should be as independent as possible.Thus, a measure of independence can be used as a criterion for choosing  .For example, Fraser and Swinney used the first minimum of the Shannon MI for choosing delay according to Shaw's suggestion.They proposed a recursive algorithm for computing Shannon MI, and they showed that the MI is actually more advantageous than the correlation function that only takes into account second order dependence [17].In this paper, we shall develop a recursive algorithm for computing GMI, which is a generalization of Fraser and Swinney's algorithm for computing Shannon MI.In addition, we are going to show that GO, QE, and GMI are even better criteria for choosing  than the Shannon MI.It should be noted that this paper is by no means a thorough treatment of all the measures of independence, there are many not covered such as the Hilbert-Schmidt independence criterion proposed by Gretton et al. [18] and the distance correlation proposed by Szekely et al. [19].
The rest of the paper is organized as follows.In Section 2, we first give the definition of quasientropy (QE), and then derive the quality factor (QF) of QE.In Section 3, we illustrate the principle of grid occupancy (GO), introduce the relation between GO and QE, and define the QF of GO based on the QF of QE.Section 4 reveals the relation between QE and the generalized mutual information (GMI), and deduces the QF of GMI, also from the QF of QE.The recursive algorithm for computing GMI is mentioned in Section 4, while the details are placed in the Appendix for better organization of the paper.Section 5 is devoted to a study of the order of QF with respect to l, the number of quantization levels, elucidating the cases when this order is higher than, equal to, and lower than that of the MI's.These theoretical results are verified by the numerical experiments on delay reconstruction of the Rössler and the Lorenz attractors presented in Section 6.Finally, Section 7 concludes the paper.In the following, the word "entropy," when appears alone, refers to the Shannon entropy and, the phrase "MI," when appears alone, refers to the Shannon MI, as is the common usage.

Quasientropy
Let us consider the problem of how to measure the independence of several variables.We denote by , where

 
Prob A denotes the probability that event A occurs, and   r p  is the probability density function (PDF) of r .Without loss of generality, consider two continuous variables 1 r and 2 r .In the past one or two decades, the study of copulas has become a blooming field of statistical research [20].
Copula is the joint CDF of the transformed variables by their respective CDFs.The rationale of the study of copulas is that, to study the relation between two or more variables, we should nullify the effect of their marginal distributions and concentrate on their joint distribution.Based on this principle, we transform 1 r and 2 r by their respective CDFs as follows: , 0,1 0,1 z z   , and we have [8,9]: , Lemma 1 shows that we can measure the independence of 1 r and 2 r by measuring the uniformity of the distribution of   . To this end, let us partition the region      uniform grid, and denote by   , i j  the   , i j th square in the grid: Denote by   , p i j the probability that   then, the QE for measuring the independence of 1 r and 2 r is defined as [9]: , , where   f  is a differentiable strictly convex function on   0,1 l .Clearly, when   log , QE is nothing but the entropy of   Due to the Jensen's inequality, we have [9]: with equality if and only if   , Then the equality in (5) holds and   r and 2 r are not independent, then there exists 0 l such that for any , r r  cannot reach its minimum [9].Thus, for a large enough l , the minimal  

Quality Factor (QF) of QE
There are infinitely many strictly convex functions.Let us consider the effect of different convex functions on the performance of QE.In [9], we proved that: In order to better differentiate between independence and "near" independence, it is preferable that QE have high sensitivity around its minimum to the variance of the uniformity of probability density.
Thus, we write   , p i j in the following form: Then, the Taylor expansion of (4) gives [9]: Based on ( 5), (6), and ( 8), we can define the quality factor (QF) of QE as: where, the numerator is twice the amplification rate of the variance between   , p i j and the uniform probability distribution in (8), and the denominator is the maximum dynamic range of   1 2 , r r  derived from ( 6) and ( 5).The greater the QF, the more sensitive is QE to the change in the uniformity of probability distribution around its minimum and, thus, the sharper is the shape of the minimum of QE.
An illustration of the QF of QE is shown in Figure 1.Suppose that QE is plotted versus ,


, and  is the angle that is tangent to the curve of QE at the minimum of QE, then: Clearly, when Q  is large (small), QE is sharp (blunt) around its minimum.

QF of Grid Occupancy (GO)
In [8], we described an interesting phenomenon that independence can be measured by simply counting the number of occupied squares: Given N realizations (observed points) of   and let: where   Then, the independence measure GO is defined as: A visual illustration of GO is depicted in Figure 2. , , z z , respectively, where It is easy to see that the points in Figure 2c  Indeed, we can prove that the expectation of GO is a subclass of QE [8]: where   E  denotes the mathematical expectation.Therefore, we can define the QF of GO as: , Since Q  is not easy to be analyzed, we study it using numerical methods as shown in Figure 3.We vary l from 2 to 1,000.For each l , find the N , denoted max N , that maximizes Figure 3, we can find:

QF of GMI
Let us now consider the generalized mutual information (GMI) proposed by Csiszar [5]: where   f  is strictly convex on   0,  .Clearly, when   log , GMI reduces to the classical MI: Like MI, GMI is invariant under componentwise invertible transformations [21].Therefore: where 1 z and 2 z are as defined in (1), and we have utilized the following facts [9]: It is easy to verify that: where   , p i j is as defined in (3).As we are mainly concerned about the situation when l is fairly large, according to (21), we can define the QF of GMI as: We have developed a recursive algorithm for computing GMI.The algorithm is described in the Appendix, where some related issues are also addressed.

Existence of GMI
is a sufficient condition to ensure the existence of GMI.Sometimes, however, , z z f p u v might not be continuous, so let us investigate the existence of GMI in more depth.As already done in (22), the .Thus, ( 5) and ( 6) can be applied to this QE, which yield: Namely, Taking the limit as l   and applying (21), we get: Equation ( 25) gives the lower and upper bounds of GMI.The lower bound is reached when 1 r and 2 r are independent [5].The upper bound is reached when, for example, 1 2 r r  , which is one of the most   , which reaches the upper bound shown in (26).The MI of continuous variables is the limit of the MI of their quantized versions as the number of quantization levels goes to infinity [9].Therefore, when 1 2 r r r   and thus 1 2 We see that the MI between a continuous variable and itself is infinity.This is because that is infinite along the diagonal u v  and causes the integration in (19) to diverge.Compared with the Shannon MI, there are more stable forms of GMI.For example, when     This shows that the GMI with      (about 0.3679) and upper bounded by 1 and never diverge.It takes the finite value, 1, even for the most dependent case.

Orders of QFs with Respect to l
Table 1 shows the Q  's and Q  's of some convex functions and their orders with respect to l as l   .The function sin u   , which appears in the last line, is not convex on   0,  , so it cannot be used in GMI and Q  does not exist.However, it is convex on   0,1 , so it can be used in QE and , the order of the minus entropy and the MI.For the case of GO, ( 16) already shows that it can have the QF of   . This means that, when l is large enough, these measures will have sharper minima than the minus entropy and the MI.
Table 1.QFs of QE and GMI.
 

Numerical Experiments
In this section, we show some numerical results of applying GO, QE, and GMI to delay reconstruction of strange attractors.Figure 4 shows the time evolution       , x t y t of Rössler chaotic system [20], where t is the sampling number.The sampling time interval is 100  We can see that the order of QF correctly predicts the shape of minima: The higher (lower) the order, the sharper (blunter) the minima.Figures 5b-5g are plots of QE, where we have arranged them so that their orders of QFs decrease from high to low.Observing Figures 5b-5g, we can see the process that the minima of these curves of QE vary from sharp to blunt, and until very vague.In [9], it is shown that the order of QF should be at least   l  to ensure that  keeps a well-defined minimum as l increases.The orders of QFs of Figures 5b-5d are all higher than   l  .The minima in these plots are all very prominent.Figure 5c is minus the entropy of   , p i j (labeled H  in the plot).Adding reaches the theoretical lower bound proposed in [9].However, we can see that the neighboring areas of the minima are too flat and the positions of minima are not easy to locate.This is due to that   e t and   e t   cannot be completely independent.Therefore, generally we should choose convex functions whose orders of QFs are higher than   l  .Figure 5e is indeed the variance between   , p i j and the uniform probability distribution.Figure 5e shows that such form of variance, though easily conceived of, is obviously not a good measure of independence.The shape of the QE plot of   2 f u u  is identical with Figure 5e.According to Table 1, under the precondition 1 a  , the smaller a , the higher the order of QF of a u .Thus the nearer a approaches 1, the better the effect of QE.For instance, Figures 5d and 5e show that the effect of 1.001  u is better than that of 2 u .Tsallis proved that the limit of his entropy is the Shannon entropy when the parameter q in his entropy, which corresponds to the a in   a f u u  here, approaches 1 [6].Figures 5c and 5d are good manifestations of that statement.We can see that the shapes of curves in Figures 5c and 5d look very close.What is more, that statement can be testified by the order of QFs.The order of QF of a u where 1 , is lower than but approaches the order of QF of log u u,   , as a approaches 1.
We have used in Figure 5g the sine function sin u   .This can be traced back to Kapur who used trigonometric functions to create entropies [7].The QF of sin u Therefore, the neighborhoods of the minima in Figure 5g are even more flat than those in Figures 5e  and 5f, and the positions of minima are hardly located.Finally, we can see that the minima in Figures 5a, 5b, and 5h, whose order is   , are sharper, and their positions are more definite than those in Figures 5c and 5i whose order is   . Namely, these measures outperform the minus entropy and the MI in the prominence of minima.Of course, the minima in Figures 5a-5d, 5h, and 5i are all distinct enough, and their positions coincide.The first (leftmost) minimum, 47   , is a good choice of the delay for reconstruction.To verify this point, Figures 6a, 6b, and 6c show the delay portraits using the first (leftmost) minimum, a non-minimum, and the seventh minimum, respectively.Compared with Figure 4, it is clear that Figure 6a reproduces well the folding structure of the original Rössler attractor.The covariance function reflects the correlation between signals and sometimes can also be used for choosing  [23].However, generally speaking, the independence measures GO, QE, and GMI, like MI, are better criteria than the covariance.To illustrate this point, let us see the second example, which is on the Lorenz system [22,24].The sampling time interval is 1000  .The observed sequence   e t is also obtained by adding noise to   e t e t   versus  .Figure 8 shows that the reconstruction portrait with the first minimum of GO reproduces the double wing structure of the Lorenz attractor fairly well.
The results are similar when using the QE or GMI.By contrast, none of the minima of the covariance function in Figure 7b are good choices.For instance, Figure 8c is the delay portrait using the first minimum of covariance function.It looks much more complex than Figures 8a and 8b.Actually, the curve of covariance function misses the best choice of  , which appears earlier than its first minimum.Note that GO and QE are much more efficient to compute than MI and GMI so that they work well with real-time applications such as independent component analysis [8] and blind source separation [9].MI and GMI, however, can give more accurate estimate of the degree of independence.

Conclusions
We have studied some convex functions based measures of independence.Essentially, the independence of variables is equivalent to the uniformity of the joint PDF of the variables obtained by transforming the original variables by their respective CDFs.The MI just uses the convex function log u u to penalize the nonuniformity of this joint PDF, and thus measures the independence.Similarly, the QE uses an arbitrary convex function to penalize the nonuniformity of the dicretized version of this PDF.Since any convex functions can appear in it, the QE is a broad enough definition that actually includes quite a few existing definitions of entropies.In this paper, we have focused on how different selections of convex functions affect the performance of the corresponding independence measure.When the QE is plotted versus the variance from the uniform distribution, the QF of QE is twice the slope of QE at its minimum over the dynamic range of QE and, therefore, indicates the sharpness of QE around its minimum.The expectation of GO is a subclass of QE, and the Csiszar's GMI is the limit of a special form of QE as the number of quantization levels approaches infinity.Based on these two facts, the definition of QF can be generalized to be made also applicable for GO and GMI.The order of QF with respect to the number of quantization levels well predicts the sharpness of the minima of these measures of independence.Furthermore, it indicates that GO, QE, and GMI can have more prominent minima than the minus entropy and the MI.Besides theoretical study, we have also applied the convex functions based measures of independence to chaotic signal processing.Initiated by Fraser and Swinney, the MI has become the most widely used criterion for choosing delay in strange attractor reconstruction.However, there is no evidence to show that it is the optimum criterion.Indeed, GO, QE, and GMI are good alternative and sometimes even better criteria for choosing delay.We have proposed a recursive algorithm for computing GMI that is a generalization of Fraser and Swinney's algorithm for computing MI.In addition, we have conducted numerical experiments on chaotic systems that well exemplify the theory of QF that we have developed.
Finally we note that whether a measure of information is good or not is application dependent.In some applications, a sharp minimum might not be crucial.We shall describe such application in a forthcoming paper.Future work also includes applying the convex function based measures to real world problems such as EEG analysis.To conclude, the results presented in this paper is instructive in understanding the behavior of convex functions based information measures and helping selecting the most suitable measures for various applications.
where   , p i j is as defined in (3).Thus, when l is fairly large,   Alternatively, a more delicate approach can be taken.Namely, different numbers of quantization levels are used according to the bumpiness of As shown in Figure 9, each square in the 2 2 m m  grid consists of four squares in the , z z is uniformly distributed over S , then: 4 Hence there is no need to further divide S into 1 S , 2 S , 3 S , and 4 S .Thus, the following recursive algorithm for computing     uniform grid over and let: where, if It is easy to verify that, when   log , the above algorithm reduces to Fraser and Swinney's recursive algorithm for computing MI [17].

B. Uniformity Test
In the above recursive algorithm for computing GMI, we need to judge whether . This can be done using the 2  test proposed in [17].However, the 20% confidence level in [17] was chosen arbitrarily.Experiments show that the uniformity test with the 20% confidence level may be too stringent and cause the GMI estimate to be too large.To solve this problem, we here propose a simple method for choosing the confidence level.We mix two independent variables, 1 s and 2 s , with identical distributions by a 2 2  rotation matrix with rotation angle  to get two variables, 1 r and 2 r , i.e., Laplacian variables, we can spot that the 5% confidence level produces the minimum estimation error in both cases.Comparing the estimation results using 20% and 5% confidence levels with the standard curves, we can see that the results with the 20% confidence level are obviously too large, whereas those with the 5% confidence level are quite accurate, as shown in Figures 10 and 11.Therefore, 5% is a better choice for the confidence level.

D. Output of the Recursive Algorithm of GMI
If 1 r and 2 r are independent,   , z z will be uniform in     0,1 0,1  .The recursive algorithm for computing GMI will terminate at the 0 m  hierarchy and produce the minimal output   . The maximal output is generated for the most dependent case, e.g., 1 2 r r  .Assume that we use 2 k N  sample points of two same variables 1 2 r r  to compute GMI.The corresponding sample points of , z z will line up along the diagonal.The 2  uniformity tests, no matter using 20, 10, or 5% confidence levels, all lead to the same final partition that divides the sample points of   , z z into every four points in a square.For example, Figure 12 shows the case of 16 N  sample points.In this case, the computation of GMI proceeds as follows: Thus, we obtain the lower and upper bounds of GMI computed using the recursive algorithm as follows.
        We can check whether overflow might occur with (45) when a convex function such as   u f u a  ( 1 a  ) is used.

,
QE becomes the Tsallis entropy[6] of   , p i j .

Figures 2a and 2b plot 500 observed points of two independent variables   1 2 ,
r r and two dependent variables   are uniformly distributed whereas those in Figure 2d are not.Partition the region     0,1 0,1  in Figures 2c and 2d into a grid of l l  , say, 10 10  , same-sized squares.A square is said to be occupied if there is at least one point in it.Then, being uniform is the most efficient way to occupy maximum number of squares, and this is confirmed in Figures 2e and 2f where the situations of Figures 2c and 2d are shown, respectively.(The occupied squares are shaded.)The grid occupancy (GO) defined in (13) is exactly minus the ratio of occupied squares.As Figures 2e and 2f clearly show,

Figure 3 .
Figure 3. Numerical study of QF of GO.Left:

z and 2 z
) gives the lower and upper bounds of Shannon MI of l -level uniformly quantized versions of 1


. The observed sequence   e t is obtained by adding a white noise uniformly distributed in  0.1, 0.1  to   x t .We use       , e t e t  to reconstruct the original chaotic attractor, where  is the delay to be determined.We examine the curves of versus  . can be chosen at the minima of these curves.So we hope these curves provide apparent minima.
Figure 5i.The QFs of Figures 5e and 5f are both   l  .Hence the shapes of their minima are basically

Figure 5 .
Figure 5. GO, QE, and GMI of delay-coordinate variables of Rössler chaotic system versus delay  .All plots are calculated using 65,536 sample points.100 l  is used for GO and QE.Delay is measured in sampling numbers.The convex functions   f u 's used in

Figure 6 .
Figure 6.Delay reconstruction portraits of the Rössler attractor using time delays corresponding to (a) the first minimum, (b) a non-minimum, and (c) the seventh minimum, respectively, of the curves in Figures 5a-5d, 5h, and 5i.
x t .Figures7a and 7bplot the curves of

Figure 7 .
Figure 7. GO and auto-covariance function of reconstructed variables of the Lorenz attractor versus  .50000 N  sample points are used in both.200 l  in GO.

Figure 8 .
Figure 8.Comparison of reconstruction effects of the Lorenz attractor.The reconstruction delays of (b) and (c) are the first minima of GO and auto-covariance function in Figure 7, respectively.

p
in different local regions.Larger l should be taken in more fluctuant regions to avoid the estimated  from being too small, whereas smaller l should be taken in rather flat regions to avoid the estimated  from being too large due to limited sample size.Specifically, let 2

Figure 9 .
Figure 9.Each square in the 2 2 m m grid is composed of four squares in the

r and 2 r 2 s and 2 s
is then a function of  , i.e., distribution table, we choose five confidence levels, 20, 10, 5, 2, and 1%, to estimate   I  .If 1 have some standard distributions, then standard values of   I  are easy to obtain.Examining in turn the two cases where 1 s and 2 s are both uniform variables and both

Figure 12
Figure 12.Partition of 16 N  sample points of   1 2, z z of two same variables.