Article Nonparametric Estimation of Information-Based Measures of Statistical Dispersion

We address the problem of non-parametric estimation of the recently proposed measures of statistical dispersion of positive continuous random variables. The measures are based on the concepts of differential entropy and Fisher information and describe the “spread” or “variability” of the random variable from a different point of view than the ubiquitously used concept of standard deviation. The maximum penalized likelihood estimation of the probability density function proposed by Good and Gaskins is applied and a complete methodology of how to estimate the dispersion measures with a single algorithm is presented. We illustrate the approach on three standard statistical models describing neuronal activity.


Introduction
Frequently, the dispersion (variability) of measured data needs to be described.Although standard deviation is used ubiquitously for quantification of variability, such approach has limitations.The dispersion of the probability distribution can be understood in different points of view: as "spread" with respect to the expected value, "evenness" ("randomness") or "smoothness".For example highly variable data might not be random at all if it consists only of "extremely small" and "extremely large" measurements.Although the probability density function or its estimate provides a complete view, quantitative methods are needed in order to compare different models or experimental results.
In a series of recent studies [1,2] we proposed and justified alternative measures of dispersion.The effort was inspired by various information-based measures of signal regularity or randomness and their interpretations that have gained significant popularity in various branches of science [3][4][5][6][7][8][9].For convenience, in what follows we discuss only the relative dispersion coefficients (i.e., the data or the probability density function is first normalized to unit mean).Besides the coefficient of variation, c v , which is the relative dispersion measure based on standard deviation, we employ the entropy-based dispersion coefficient c h , and the Fisher information-based coefficient c J .The difference between these coefficients lies in the fact that the Fisher information-based coefficient, c J , describes how "smooth" is the distribution and it is sensitive to the modes of the probability density, while the entropy-based coefficient, c h , describes how "even" it is, hence being sensitive to the overall spread of the probability density over the entire support.Since multimodal densities can be more evenly spread than unimodal ones, the behavior of c h cannot be generally deduced from c J (and vice versa).
If a complete description of the data is available, i.e., the probability density function is known, the values of the above mentioned dispersion coefficients can be calculated analytically or numerically.However, the estimation of these coefficients from data is more problematic, and so far we employed either the parametric approach [2] or non-parametric estimation of c h based on the popular Vasicek's estimator of differential entropy [10,11].The goal of this paper is to provide a self-contained method of non-parametric estimation.We describe a method that can be used to estimate both c h and c J as a result of a single procedure.

Dispersion Coefficients
We briefly review the proposed dispersion measures, for more details see [2].Let T be a continuous positive random variable with probability density function f (t) defined on [0, ∞).By far, the most common measure of dispersion of T is the standard deviation, σ, defined as the square root of the second central moment of the distribution.The corresponding relative dispersion measure is known as the coefficient of variation, c v , where E(T ) is the mean value of T .The entropy-based dispersion coefficient is defined as where h(T ) is the differential entropy [12], The numerator in (2), σ h = exp[h(T ) − 1], is the entropy-based dispersion, "analogous" to the notion of standard deviation σ.The interpretation of σ h relies on the asymptotic equipartition property theorem [12].Informally, the theorem states that almost any sequence of n realizations of the random variable T comes from a rather small subset (the typical set) in the n-dimensional space of all possible values.The volume of this subset is approximately σ n h = exp[nh(T )], and the volume is bigger for those random variables, which generate more diverse (or unpredictable) realizations.The values of σ h and c h quantify how "evenly" is the probability distributed over the entire support.From this point of view, σ h is more appropriate than σ to describe the randomness of outcomes generated by the random variable T .
The Fisher information-based dispersion coefficient, c J , is defined as where The Fisher information is traditionally interpreted by means of the Cramer-Rao bound, i.e., the value of 1/J(T ) is the lower bound on the error of any unbiased estimator of a location parameter of the distribution.Due to the derivative in (5), certain regularity conditions are required on f (t) in order for J(T ) to be interpreted according to the Cramer-Rao bound, namely continuous differentiability for all t > 0 and f (0) = f (0) = 0, [13].However, the integral in (5) exists and is finite also for, e.g., an exponential distribution.Any locally steep slope or the presence of modes in the shape of f (t) increases J(T ), [4].

Methods of Non-Parametric Estimation
In the following we assume that {t 1 , t 2 , . . ., t N } are N independent realizations of the random variable The c v is most often estimated by the ratio of sample standard deviation to sample mean, however, the estimate may be considerably biased, [14].
Estimation of c h relies on the estimate ĥ of the differential entropy h(T ), as follows from (2).The problem of differential entropy estimation is well exploited in literature [11,15,16].It is preferable to avoid estimations based on data binning (histograms), because discretization affects the results.The most popular approaches are represented by the class of Kozachenko-Leonenko estimators [17,18] and the Vasicek's estimator [10].Our experience shows that the simple Vasicek's estimator gives good results on a wide range of data [19][20][21], thus in this paper we employ it for the sake of comparison with the estimation method described further below.Given the ranked observations t where The integer parameter m < N/2 is set prior to the calculation, roughly one may set m to be the integer part of √ N .The bias-correcting factor is where Ψ(z) = d dz ln Γ(z) denotes the digamma function, [22].The estimation of c J requires estimation of J(T ), and it is more problematic and as far as we know no "standard" algorithms have been proposed.
Huber [23] showed that there exists a unique interpolation of the empirical cumulative distribution function such that the resulting Fisher information is maximized.In theory, one may use this method to estimate J(T ).Unfortunately, it is very complicated and probably not feasible even for moderate values of N .
Kernel density estimators are widely used for estimation of the density.They can be successfully used for the calculation of the differential entropy too [24,25].But, according to our experience, they are unsuitable for estimation of the Fisher information, mainly due to the inability to control the "smoothness" of the ratio f /f , which is the principal term in the integral (5).Inappropriate choice of the kernel and the bandwidth leads to many local extremes of the kernel density estimator which substantially increases the Fisher information.Theoretically, the kernel should be "optimal" in the sense of minimizing the mean square error of the ratio f /f , which is difficult to derive in a closed form, generally.One may also derive Vasicek-like estimator of J(T ), based on the empirical cumulative distribution function and considering differences of higher order, however, we discovered that this approach is numerically unstable.
We found that the maximum penalized likelihood (MPL) method of Good and Gaskins [26] for probability density estimation offers a possible solution.The idea is to represent the whole probability density function using a suitable orthogonal base of functions (Hermite functions), and the "best" estimate of f (t) is obtained by solving a set of equations for the base coefficients given the sample {t 1 , . . ., t N }.
In order to proceed, we first log-transform and normalize the sample, where The idea here is to obtain a new random variable X, X = (ln T − µ)/ √ 2σ 2 , with a more Gaussian-like density shape unlike the shape of the distribution of T , which is usually highly skewed.Furthermore, the limitation of the support of T to the real half-line may cause numerical difficulties.The probability density function of X is expressed by the real probability amplitude c(x), so that X ∼ c 2 (x).For the purpose of numerical implementation, c(x) is represented by the first r Hermite functions, [22], as where and H m (x) denotes the Hermite polynomial ).The goal is to find such c = c(x), for which the score, ω, is maximized, subject to constraint (ensuring that c 2 (x) is a probability density function) The term l(c; x 1 , . . ., x N ) in ( 12) is the log-likelihood function, and Φ is the roughness penalty proposed by [26], controlling the smoothness of the density of X (and hence the smoothness of f (t)), The two nonnegative parameters α and β ( α + β > 0 ) should be set prior to the calculation.Note that the first term in ( 15) is equal to α-times the Fisher information J(X).
The system of equations for c i 's which maximize (12) can be written as [26] for k = 0, 1, . . ., r − 1.The system can be solved iteratively as follows.Initially c 0 = 1, c m = 0 for m = 1, 2, . . .and λ = N , which gives the approximation of X by a Gaussian random variable.Then, in each iteration step, current values of c 0 , . . ., c r−1 are substituted into the right hand side of ( 16), and the system is solved as a linear system with unknown variables c 0 , . . ., c r−1 appearing on the left hand side of (16).Once the linear system is solved, the coefficients are normalized to satisfy (13).The corresponding Lagrange multiplier is calculated as The r + 1 variables (r coefficients and the Lagrange multiplier λ) are computed iteratively.In accordance with [26], the algorithm is stopped when λ does not change within desired precision in subsequent iterations.
After the score-maximizing c i 's are obtained, and thus the density of X is estimated, the estimates of dispersion coefficients in (2) and ( 4) are calculated from the following entropy and Fisher information estimators.The change of variables gives where h | dx is the estimator of the differential entropy of X, µ and σ are given by ( 9) and 2 dx is the estimated expected value of X. (Numerically, E(X) is usually small and can be neglected.Its value is influenced by the number, r, of considered Hermite functions.) The Fisher information estimator then follows, by an analogous change of variables, as The calculation requires the first derivative of c(x) [22],

Results
Neurons communicate via the process of synaptic transmission, which is triggered by an electrical discharge called the action potential or spike.Since the time intervals between individual spikes are relatively large when compared to the spike duration, and since for any particular neuron the "shape" or character of a spike remains constant, the spikes are usually treated as point events in time.Spike train consists of times of spike occurrences τ 0 , τ 1 , . . ., τ n , equivalently described by a set of n interspike intervals (ISIs) t i = τ i − τ i−1 , i = 1 . . .n, and these ISIs are treated as independent realizations of the random variable T ∼ f (t).The probabilistic description of the spiking results from the fact that the positions of spikes cannot be predicted deterministically due to presence of intrinsic noise, only the probability that a spike occurs can be given [27][28][29].In real neuronal data, however, the non-renewal property of the spike trains is often observed [30,31].Taking the serial correlation of the ISIs as well as any other statistical dependence into account would result in the decrease of the entropy and hence of the value of c h , see [12].
We compare exact and estimated values of the dispersion coefficients on three widely used statistical models of ISIs: gamma, inverse Gaussian and lognormal.Since only the relative coefficients are discussed, c v , c h , c J , we parameterize each distributions by its c v while keeping E(T ) = 1.
The gamma distribution is one of the most frequent statistical descriptors of ISIs used in analysis of experimental data [32].Probability density function of gamma distribution can be written as where Γ(z) = ∞ 0 t z−1 exp(−t)dt is the gamma function, [22].The differential entropy is equal to, [2], where Ψ(z) = d dz ln Γ(z) denotes the digamma function, [22].The Fisher information about the location parameter is The Fisher information diverges for c v ≥ 1/ √ 2, with the exception of c v = 1 (corresponds to exponential distribution) where J Γ = 1, but the Cramer-Rao based interpretation of J Γ does not hold in this case since f Γ (0) = 0, see [13] for details.
The inverse Gaussian distribution is often used for description of ISIs and fitted to experimental data.It arises as result of spiking activity of a stochastic variant of the perfect integrate-and-fire neuronal model [33].The density of this distribution is The differential entropy of this distribution is equal to where K (1,0) (ν, z) denotes the first derivative of the modified Bessel function of the second kind [22], The Fisher information of the inverse Gaussian distribution results in The lognormal distribution is rarely presented as model distribution of ISIs.However, it represents a common descriptor in analysis of experimental data [33], with density The differential entropy of this distribution is equal to and the Fisher information is given by The theoretical values of the coefficients of c h and c J as functions of c v are shown in Figure 1 for all the three distributions mentioned above.We see that the functions form hill-shaped curves with local maxima achieved for different values of c v .Asymptotically, c h as well as c J tend to zero for c v → 0 or c v → ∞ for the three models.
Figure 1.Variability represented by the coefficient of variation c v , the entropy-based coefficient c h (dashed curves, right-hand-side axis) and the Fisher information-based coefficient c J (solid curves, left-hand-side axis) of three probability distributions: gamma (green curves), inverse Gaussian (blue curves) and lognormal (red curves).The entropy-based coefficient, c h , expresses the evenness of the distribution.In dependency on c v , it shows a maximum at c v = 1 for gamma distribution (corresponds to exponential distribution) and between c v = 1 and c v = 1.5 for inverse Gaussian and lognormal distributions.For all the distributions holds c h → 0 as c v → 0 or c v → ∞.The Fisher information-based coefficient, c J , grows as the distributions become "smoother".The overall dependence on c v shows a maximum around c v = 0.5.Similarly to c h (c v ) dependencies, c h → 0 as c v → 0 or c v → ∞ (does not hold for gamma distribution, where c J can be calculated only for c v < 1/ √ 2).To explore the accuracy of the estimators of the coefficients c h and c J when the MPL estimations of the densities are employed, we did three separate simulation studies.All the simulations and calculations were performed in the free software package R [34].For each model with the probability distribution ( 21), ( 24) and ( 27), respectively, the coefficient of variation, c v , varied from 0.1 to 3.0 in steps of 0.1.One thousand samples, each consisting of 1000 random numbers (a common number of events in experimental records of neuronal firing), were taken for each value of c v from the three distributions.
The MPL method was employed on each generated sample to estimate the density.We chose the number of the base functions equal to r = 20.The larger bases, r = 50 or r = 100, were examined too, with negligible differences in the estimation for the selected models.The values of the parameters for (15) were chosen as α = 3 and β = 4, in accordance with the suggestion ( [26], Appendix A).
The outcome of the simulation study for the gamma density ( 21) is presented in Figure 2, where the theoretical and estimated values of c h and c J for given c v are plotted.In addition, the values of coefficient c h calculated by Vasicek's estimator are shown.We see that the MPL method results in precise estimation of c h for low values of c v and in slightly overestimated values of c h for c v > 0.7.The Vasicek's estimator gives slightly overestimated values too.Overall, the performances of Vasicek's and MPL estimators are comparable.The maximum of the MPL estimator of c h is achieved at the theoretical value, c v = 1.We can see that the standard deviation of the c h estimate is higher as c v grows from zero.As c v > 1, the standard deviation begins to decrease slowly.
We conclude that the MPL estimator of c J is accurate for low c v .For c v > 0.5 it results in underestimated c J and it tends to zero as c v → 1/ √ 2 as well as the theoretical values.The high bias of c J for high c v is caused by inappropriate choice of the parameters α and β, which were kept fixed in accordance with the suggestion of [26].Nevertheless, the main shape of the dependency on c v remains and the MPL estimates of c J achieves local maximum at c v .= 0.55, which is slightly lower than the theoretical value.The entropy-based variability coefficient c h (panel a) and the Fisher information-based variability coefficient c J (panel b), calculated nonparametrically from ( 18) and ( 19), respectively, for gamma distribution (24).The mean values (indicated by red discs) accompanied by the standard error (red error bars) are plotted in dependency on the coefficient of variation, c v .The dashed lines are the theoretical curves.In panel a, the results obtained by estimation (6)  Figure 3 shows the corresponding results for the inverse Gaussian density.The Vasicek's and MPL estimates of c h are almost precise.Both the estimator of c h based on MPL method and that based on the Vasicek's entropy give the same mean values together with the same standard deviations.The MPL estimator of c J gives accurate and precise results for c v < 0.5.For higher c v , the estimated value is lower than the true one.The maximum of estimated c J is achieved at the same point, c v .= 0.5.The asymptotical decrease of the estimated c J to zero is faster than the true dependency is.By our experience, this can be improved by setting higher α and β in order to give higher impact to the roughness penalty (15).The results of the simulation study on the lognormal distribution with density ( 27) is plotted in Figure 4, together with the theoretical dependencies and the results for the Vasicek's estimator of the entropy.Both the MPL estimators of c h and c J have qualitatively same accuracy and precision features as the analogous estimators for the inverse Gaussian model.

Discussion and Conclusions
In proposing the dispersion measures based on entropy and Fisher information we were motivated by the difference between frequently mixed up notions of ISI variability and randomness, which, however, represent two different concepts [1].The proposed measures have been so far successfully applied mainly to examine differences between various neuronal activity regimes, obtained either by simulation of neuronal models or from experimental measurements [32].There, the comparison of neuronal spiking activity under different conditions plays a key role in resolving the question of neuronal coding.However, the methodology is not specific to the research of neuronal coding; it is generally applicable whenever one needs to quantify some additional properties of positive continuous random data.
In this paper, we used the MPL method of Good and Gaskins [26] to estimate the dispersion coefficients nonparametrically from data.We found that the method performs comparably with the classical Vasicek's estimator [10] in the case of entropy-based dispersion.
The estimation of Fisher information-based dispersion is more complicated, but we found that the MPL method gives reasonable results.In fact, so far the MPL method is the best option for c J estimation among the possibilities we tested (modified kernel methods, spline interpolation and approximation methods).The key parameters of the MPL method which affect the estimated value of J(T ) (and consequently of c J ) are the values of α and β in (15).In this paper we employed the suggestion of [26], however, we found that different setting may sometimes lead to dramatic improvement in the estimation of J(T ).We tested the performance of the estimation for sample sizes less than 1,000 and we found out that significant and systematic improvement resulting in low bias can be reached if α and β are allowed to depend somehow on the sample size.In this sense, the parameters play a similar role to the parameter m in the Vasicek's estimator (6).We are currently working on a systematic approach to determine α, β optimally, but the fine tuning of α and β is a difficult numerical task.Nevertheless, even without the fine-tuning, the performance of the entropy estimation is essentially the same as in the case of Vasicek's estimator.
The length of the neuronal record and hence the sample size is another issue related to the choice of these parameters.As emphasized, e.g., by [35,36], particularly short record can considerably modify the empirical distribution.This can be adjusted by the parameter values, choosing whether the distribution should fit the data or it should be rather robust.

Figure 2 .
Figure 2.The entropy-based variability coefficient c h (panel a) and the Fisher information-based variability coefficient c J (panel b), calculated nonparametrically from (18) and (19), respectively, for gamma distribution(24).The mean values (indicated by red discs) accompanied by the standard error (red error bars) are plotted in dependency on the coefficient of variation, c v .The dashed lines are the theoretical curves.In panel a, the results obtained by estimation (6) are added (blue triangles indicate mean values and blue error bars stand for standard error).The results are based on 1000 trials of samples of size 1000 for each value of c v .
Figure 2.The entropy-based variability coefficient c h (panel a) and the Fisher information-based variability coefficient c J (panel b), calculated nonparametrically from (18) and (19), respectively, for gamma distribution(24).The mean values (indicated by red discs) accompanied by the standard error (red error bars) are plotted in dependency on the coefficient of variation, c v .The dashed lines are the theoretical curves.In panel a, the results obtained by estimation (6) are added (blue triangles indicate mean values and blue error bars stand for standard error).The results are based on 1000 trials of samples of size 1000 for each value of c v .

Figure 3 .
Figure 3. Estimations of the variability coefficients c h (panel a) and c J (panel b) for inverse Gaussian distribution(24).The notation and the layout is the same as in Figure2.

Figure 4 .
Figure 4. Estimations of the variability coefficients c h (panel a) and c J (panel b) for lognormal distribution(27).The notation and the layout is the same as in Figure2.