Entropy and its discontents: A note on definitions

The routine definitions of both entropy, and differential entropy show inconsistencies that make them not reciprocally coherent. We propose a few possible modifications of these quantities so that 1) they no longer show incongruities, 2) they go one into the other in a suitable limit as the result of a renormalization. The properties of the new quantities would slightly differ from that of the usual entropies in a few other respects


Introduction
As it is usually defined, the Shannon entropy of a discrete law p k = P {x k } associated to the values x k of some random variable is and apparently is a non negative, dimensionless quantity. As a matter of fact however it does not depend on all the details of the distribution: for instance only the p k are relevant, while the x k play no role at all. This means that if we modify our distribution just by moving the x k , the entropy is left the same: this entails among others that H does not always change along with the variance (or other typical parameters) of the distribution which instead is contingent on the x k values. In particular H is invariant under every linear transformation ax k + b (centering and rescaling) of the random quantities: in this sense every type of laws [1] is isoentropic. Surprisingly enough, despite the unsophistication of definition (1), and beyond a few elementary examples, explicit formulas displaying the dependence of the entropy H from the parameters of the most common discrete distributions are not known. If for instance we take the entropy H of the binomial distributions B n,p with x k = k = 0, 1, . . . , n p k = n p p k (1 − p) n−k (2) although it would always be possible to calculate the entropy H for every particular example, no general formula giving its explicit dependence from n and p is available, and only its asymptotic behavior for large n is known in the literature [2,3]: Remark moreover that, while this formula explicitly contains np(1 − p) namely the variance of B n,p , it is easy to recognize that -as long as we leave untouched the n probabilities p k -the entropy H [B n,p ] remains the same when we change the variance by moving the points x k away from their usual locations x k = k. In particular this is true for the standardized (centered, unit variance) binomial B * n,p with and same p k of (2) which entails H [B n,p ] = H B * n,p . All that hints to the fact that what seems to be relevant to the entropy is not the variance itself, but some other feature, possibly related to the shape of the distribution. In a similar vein, for the Poisson distributions P λ with the entropy is with an asymptotic expression for large λ which explicitly contains the parameter λ (also playing the role of the variance), but which is also completely independent from the values of the x k 's. As a consequence a standardized Poisson distribution P * λ , with and same probabilities p k , has the same entropy of P λ , namely H [P λ ] = H [P * λ ]. When on the other hand we consider continuous laws 1 with a pdf f (x), the definition (1) no longer apply, and we are led to introduce another quantity commonly known as differential entropy (we acknowledge that this name could be misleading for an integral, but we will retain it in the following to abide a long established habit): which in several respects differs from the entropy (1) of the discrete distributions. First of all explicit formulas of the entropy (9) are known for most of the usual laws: for example (see also Appendix A) the distributions U(a) uniform on [0, a] with a > 0 have entropy while for the centered, Gaussian laws N(a) with variance a 2 we have An exhaustive list of similar formulas for other families of laws is widely available in the literature, but even from these two examples only it is apparent that: 1. at variance with the discrete case, the differential entropies explicitly depend on a scaling parameter a, showing now a dependence either on the variance, or on some other dispersion index such as the inter quantile ranges (IQnR): this means in particular that the types of continuous laws are no longer isoentropic; 2. the differential entropies can take negative values when the parameters of the laws are chosen in such a way that the logarithm arguments fall below 1; 3. the logarithm arguments are not in general dimensionless quantities, in an apparent violation of the homogeneity rule that the scalar arguments of transcendental functions (as logarithms are) must be dimensionless quantities; this entails in particular that the entropy depends on the units of measurement These three remarks make hence abundantly clear that something is inscribed in the definition (9) which is not present in the definition (1), and vice versa. Finally the two definitions seem not to be reciprocally consistent in the sense that, when for instance a continuous law is weakly approximated by a sequence of discrete laws, we would like to see the entropies of the discrete distributions converging toward the entropy of the continuous one. That this is not the case is apparent from a few counterexamples. It is well known for instance that, for every 0 < p < 1, the sequence of the standardized binomial laws B * n,p weakly converges to the Gaussian N(1) when n → ∞; however, since the binomial probabilities p k are unaffected by a standardization, the entropies H B * n,p still obey to formula (3), and hence their sequence diverges as ln √ n instead of being convergent to the differential entropy of N(1) which from (11) is ln √ 2πe. In the same vein the cdf F (x) of a uniform law U(a) can be approximated by the sequence F n (x) of the discrete uniform laws U n (a) concentrated with equal probabilities p 1 = . . . = p n = 1 n on the n equidistant points x 1 , . . . , x n where x k = k∆ for k = 1, 2, . . . , n, and x k − x k−1 = ∆ = a n with x 0 = 0. However it is easy to see that so that their sequence again diverge as ln n, while the differential entropy h[U ] of the uniform law has the finite value (10). As a consequence of these remarks, in the following sections we will propose a few elementary ways to change the two definitions (1) and (9) in order to possibly rid them of the said inconsistencies, and to make them reciprocally coherent without losing too much of the essential properties of the usual quantities. These new definitions, moreover, operate an effective renormalization of the said divergences so that now, when a continuous law is weakly approximated by a sequence of discrete laws, also the entropies of the discrete distributions converge toward the entropy of the continuous one. A few additional points with examples and explicit calculations are finally collected in the appendices. It must be clearly stated at this point, however, that we do not claim here that the Shannon entropy is somehow ill-defined in itself: we rather point out a few reciprocal inconsistencies of the different manifestations of this time-honored concept, and we try to attune them in such a way that every probability distribution (either discrete, or continuous) would now be treated on the same foot 2 Entropy for continuous laws Let us begin with some remarks about the differential entropy for continuous laws with a pdf f (x): the simplest ways to achieve the essential of our aims would be to adopt some new definition of the type where κ is any parameter of the law f (x) with the same dimensions of x, and with a finite and strictly positive value for every non degenerate law. To this end the first idea which comes to the fore consists in taking the standard deviation σ to play the role of κ in (13), but it is also apparent that this choice would restrict our definition only to the continuous laws with finite second momentum leaving out many important cases. A strong alternative candidate for the role of κ could instead be some interquantile range (IQnR) which can represent a measure of the dispersion even when the variance does not exist. In the following we will analyze a few possible choices for the parameter κ along with their principal consequences

Interquantile range
The calculation of the IQnR goes through the use of the quantile function Q(p), namely the inverse cumulative distribution function (cdf ). In order to take into account possible jumps and flat spots of a given cdf F (x), the quantile function is usually defined as In the case of continuous laws (no jumps), however, this can be reduced to and when F (x) is also strictly increasing (no flat spots) we finally have It is apparent then that Q(p) jumps wherever F (x) has flat spots, while it has flat spots wherever F (x) jumps. The IQnR function is then defined as and the classical interquartile range (IQrR) is just the particular value The IQnR ̺(p) is a non increasing function of p, and for continuous laws (since Q(p) has no flat spots) it is always well defined and never vanishes so that one of its values can be safely used to play a role in the definition of κ in (13). Of course, when a law has also a finite second momentum, the IQnR ̺ and the standard deviation σ are both well defined and the ratio γ = ̺/σ often has the same for entire families of laws. We now propose to adopt a new form for the entropy of continuous laws which -by making use, instead of the variance, of some particular value of IQnR that we will denote ̺ -will encompass even the case of the laws without a finite second momentum: In particular for the continuous laws we can simply take ̺ = ̺ 1 4 , the IQrR Despite the minimality of this change of definition, however, the new entropy h has properties slightly different from h. It is shown by the examples of the Appendix A that -at variance with the usual differential entropy h -this new entropy h has neither a minimum, nor a maximum value because, according to the particular continuous law considered, it takes every possible real value, both positive and negative. In this respect we must instead recall the well known property of the Gaussian laws N(a) which qualify as the laws with the maximum differential entropy h among all the other continuous laws with the same variance σ 2 . It is apparent then that within our new definition (19) this special position of the Gaussian laws will simply be lost.
The adoption of (19) however brings several benefits that will also be made apparent in the examples of the Appendix A: first of all the argument of the logarithm is now by definition a dimensionless quantity so that the value of h becomes invariant under change of measurement units. Second, the new entropy h will no longer depend on the value of some scaling parameters linked to the variance: its values are determined by the form of the distribution, rather than by its actual numerical dispersion, and will be the same for entire families of laws. When in fact the variables are subject to some linear transformation y = ax + b (with a > 0, as in the changes of unit of measurement) the differential entropy h changes with the new pdf according to namely it is explicitly dependent from the scaling parameter a, while it is independent from the centering parameter b. It is apparent moreover that, according to these remarks, also the quantile function of the transformed cdf so that any IQnR is modified according to a̺(p), namely it will be sensitive again only to the scaling parameter a, but not to the centering one. As a consequence the modifications of both h and ̺ under a linear transformation of the variables are apparently such that they cancel out reciprocally so that h, as defined in (19), is always left unchanged: this means in particular that the types of laws are isoentropic

Variance and scaling parameters
By restricting ourselves to the continuous laws with finite second momentum and standard deviation σ, an alternative redefinition of the differential entropy could be considered as − This form of differential entropy would bring the same benefits of h: the argument of the logarithm is dimensionless, and it will no longer depend on the value of scaling parameters. Since however the dimensional parameter is the standard deviation, it is possible to show that the Gaussian laws would keep now their usual role of maximum entropy laws, and this suggests to propose a further possible change of definition as (please remark the change of sign) As shown in the Appendix A, all the Gaussian laws N(a) will now have h[N] = 0 and -because of the change of sign -this value will now represent the minimum for all the other laws, irrespective of their variance: as a consequence the entropy h of all the continuous distributions with finite variance will now be non negative, as for the entropy H of the discrete laws. For laws lacking a finite second momentum (as the Cauchy laws) we would have no h entropy because these distributions have no variance to speak about: this is an apparent shortcoming presented by the definition (21) of h, and to go around this weakness we introduced our definition (19) of h by exploiting the properties of the IQnR ̺(p) which are always well defined for every possible distribution. It would be interesting to remark, however, not only that these are not the only two possible choices, but also that even seeming harmless modifications can imply slightly different properties. For instance, by going back to the remarks at the end of the Section 2.1, it is well known that by linear transformation of the variables (with a > 0 to simplify) every continuous law f (x) spans a type of continuous laws As already pointed out, the centering parameter b has no influence on the value of the entropy, while the scaling parameter a would change the differential entropy h of definition (9) by an additional ln a. As a consequence by simply adopting as a new definition where a is the parameter locating the law within its type, we would get an entropy invariant for rescaling. It is apparent that the definition (22) considers a just as a parameter, and not as a measure of dispersion, and it is interesting to notice that it also entails a few consequences shown in the examples of the Appendix A.
In particular we now have that the entropy h takes again all the (positive and negative) real values and hence that there is no such a thing as a maximum entropy distribution as in the case of the h entropy

Entropy for discrete laws
We could now naively extend to the discrete laws our previous re-definitions simply by taking H − ln κ, with H given by (1), and with a suitable choice of κ, but in so doing we would miss a chance to reconcile the two forms (discrete and continuous) of our entropy in some limit behavior. We find then more convenient to introduce some further changes that for the sake of generality we will discuss in the settings of the Section 2.1 where κ is an IQnR

Renormalization
In order to extend the definition (19) to the discrete distributions we must first remark that, at variance with the continuous case, now the IQrR ̺ 1 4 can vanish and hence can not be immediately adopted as κ in our definitions. For the discrete laws in fact F (x) makes jumps and hence Q(p) has flat spots so that ̺(p) can be zero for some values of p, and in particular this can happen also for p = 1 4 . If however our distributions are purely discrete (a few remarks about the more general case of mixtures can be found in the Appendix B) ̺(p) is a non increasing function of p which change values only by jumping, and which is constant between subsequent jumps. As a consequence, with the only exception of the degenerate laws (which have a constant Q(p), and hence a ̺ vanishing for every p), ̺(p) certainly takes non zero values for some 0 < p ≤ 1 4 , even when ̺ 1 4 = 0. We can then use in our definitions as dimensional constant ̺ the smallest, non zero IQnR larger or equal to the IQrR ̺ 1 4 : more precisely, if P is the set of all the values of ̺(p) for 0 < p ≤ 1 4 , and P 0 = P\{0}, we will take ̺ = min P 0 > 0. Remark that in particular we again have ̺ = ̺ 1 4 whenever the IQrR does not vanish. We start by remarking that if F (x) is the cumulative distribution function of a discrete distribution concentrated on x k with probabilities p k for k = 1, 2 . . . , by taking (with x 0 < x 1 and hence F (x 0 ) = 0: for instance x 0 = x 1 − inf k≥2 ∆x k so that ∆x 1 = inf k≥2 ∆x k ) we see first that the definition (1) can be immediately recast in the form Since on the other hand many typical discrete distributions (binomial, Poisson . . .) describe counting experiments, in many instances we have ∆x k = 1 and in these cases (since ∆x k is also dimensionless) we could also write By comparing this expression with the definition of differential entropy (9), and by recalling that for a continuous distribution we have f (x) = F ′ (x), we are led to propose as new definition of the entropy of a discrete law the quantity In general, even for the discrete distributions, ∆x k is not dimensionless, but apparently this is compensated by means of ̺. This definition (23) has properties which are similar to that of the new differential entropy defined in the previous section, but the main benefit of this new formulation is that now -as it will be discussed in the subsequent section -the differential h entropy of a continuous law (19) can be recovered as a limit of the entropies H of a sequence of approximating, discrete laws. In fact the new definition (23) effectively renormalizes the traditional entropy H in such a way that the asymptotic divergences pointed out in the Section 1 are exactly compensated by means of our dimensional parameters. These conclusions hold also for a suitable extension of the alternative definitions (21) and (22) by taking into account (30) of Appendix A, we finally get which is a first example of the convergence of entropies to differential entropies in the framework of the new definitions. The same result is achieved for H[B * n,p ] because now σ = 1, so that ̺ = γ[B n,p ] while from (4) we have for every k and hence from (23)  In a similar way for the discrete uniform distributions U n (a) introduced in the Section 1 we now have

Conclusions
We have proposed to modify both the usual definition (1) of the entropy H, and (9) of the differential entropy h respectively into (23) and (19), namely within our most general notation where in general ̺ coincides with the IQrR ̺ 1 4 of the considered distribution, except when the IQrR vanishes (as can happen for discrete laws): in this last event ̺ is taken as the smallest non zero IQnR of the distribution. There are also several other possible re-definitions which essentially differ among them by the choice of the parameter κ in (13), and by the set of their possible values. All these definitions, moreover, bypass the anomalies listed in the Section 1 and appear to go smoothly one into the other for suitable discrete-continuous limit. As a matter of fact the introduction of the dimensional parameter κ effectively renormalizes the divergences that we would otherwise encounter in the limiting processes leading from discrete to continuous laws. Remark finally that the discrete form (25) can also be easily customized to fit with the entropy estimation from empirical data We end the paper by pointing out that, despite extensive similarities, the new quantities such as H and h no longer have all the same properties of H and h. For instance at the present stage we could neither prove, nor disprove (by means of some counterexample) that H ≥ 0 as for H. On the other hand the examples seem also to allow no room for h-extremal distributions, as the normal laws were for the h differential entropy. Remark however that these conclusions would be different by adopting the alternative definitions that are presented in the Section 2.2. While all these topics seem to be interesting fields of inquiry, it would also be important to extensively review what is preserved of all the well known properties of the usual definitions, and how to adapt further ideas such as relative entropy, mutual information and whatever else is today used in the information processing [4,5]. This remarks emphasize the possibilities open by our seemingly naive changes: as a bid to connect two previous standpoints, in fact, our proposed definitions blend the properties of the older quantities, and in so doing can also break new ground. An extensive analysis of all the possible consequences both of the proposed definitions, and of their articulations will be the subject of a forthcoming paper, while on this topic we will at present limit ourselves just to point out that many relevant features of the entropy essentially derive from the properties of the logarithms which in any case play a central role also in the new definitions. Finally it would be stimulating to explore how -if at all -it is possible to make the new definitions compatible with other celebrated extensions of the classical entropies such as, for instance, that proposed by Tsallis [6], or the more recent cumulative residual entropy [7]

A Examples
We begin by comparing the values of the two differential entropies h and h, respectively defined as (9) and (19), for the most common families of laws by neglecting the centrality parameters which are irrelevant for our purposes because both the entropies are independent from them. For the Gaussian laws N(a) with we know that and hence we immediately have from (19) For the laws U(a) uniform on [0, a] with (here ϑ(x) is the Heaviside function) we instead have For the gamma laws G λ (a), λ > 0 with we have Finally for the family of Student laws T λ (a), λ > 0 with the variance exists only for λ > 2, but the differential entropy is well defined for every λ > 0 For short the explicit form of the IQrR ̺ will not be explicitly given here. Since however the differential entropy h[T λ ], and the IQRr ̺ [T λ ] are both proportional to the scaling parameter a, the entropy h[T λ ] will be independent from a, and as a function of λ is displayed in the Figure 2 A particular consequence of these example is that -as already remarked in the Section 2.1 -the entropy h has neither maximum, nor minimum value, and by suitably choosing the continuous law it can take every real value both positive and negative Similar calculations can then be carried on also for the alternative entropy definition (21) of h for laws endowed with a finite second momentum: we immediately get for the Gaussian laws and for the uniform laws For the gamma laws we have now the values, as displayed in the Figure 5, go from −∞ for λ → 0 + , to +∞ for λ → +∞. In particular for the exponential type we have

B Mixtures
The definition of ̺ proposed in the Section 3 certainly produces non zero values for both purely discrete and purely continuous laws. Some care must be exercised however for discrete-continuous mixtures. Let us take for instance the mixture  Figure 7. It is apparent then that the IQrR is zero because while the IQnR ̺(p) is a continuous function so that (with the notations of the Section 3) also ̺ = 0 because inf P 0 = 0. As a consequence, in the case of discretecontinuous mixtures with a cdf such as F (x) = qF d (x) + (1 − q)F c (x) 0 < q < 1 we can not simply extend the definitions of the Section 3. We could however consider separately both the ̺ d of the discrete distribution (as defined in the Section 3), and the ̺ c = ̺ 1 4 of the continuous distribution, and to take as dimensional constant κ their convex combination q ̺ d + (1 − q) ̺ c which never vanishes because at least its continuous part is always non zero