1 Introduction
Interpretation of Shannon’s entropy
H(
p) is usually developed in context of an experiment where the entropy is described as a measure of uncertainty; cf. [
6], [
5], [
7]. Motivated by a simple (and well-known) observation that exp(
H(
p)) is equal to the size of support of the underlying random variable for the uniform distribution, in this short note we introduce concept of Effective size of support (Ess). Measure of Ess should satisfy a small set of natural requirements. The class of Ess measures
which satisfy the requirements is in a direct relationship to the family of Rényi’s
α-entropies which includes as its special case also Shannon’s entropy. We address the issue of selecting the value of
α such that the corresponding
would be the most appropriate measure of Ess. Unlike to entropy, Ess has an obvious meaning. From the point of view of Probability or Statistics, Ess can be seen as a more natural concept than entropy.
2 Effective size of support
Let X be a discrete random variable which can take on values from a finite set of m elements, with probabilities specified by the probability mass function (pmf) p. The support of X is a set . Let |(p(X))| denote the size of the support.
While pmf p = [0.5, 0.5] makes both outcomes equally likely, the following pmf q = [0.999, 0.001] characterizes a random variable that can take on almost exclusively only one of two values. However, both p and q have the same size of support. This motivates a need for a quantity that could measure size of support of the random variable in a different way, so that the random variable can be placed in the range [1,m] according to its pmf. We will call the new quantity/measure the effective support size (Ess), and denote it by (p(X)); (p) or (X), for short. The example makes it obvious that (·) should be such that (q) will be close to 1, while to p it should assign value (p) = 2.
3 Properties of Ess
Ess should have certain properties, dictated by common sense.
P1) (p) should be continuous, symmetric function (i.e., invariant under exchange of pi, pj, i, j = 1,...,m).
P2)
(
δm) = 1 ≤
(
pm) ≤
(
um) =
m; where
um denotes the uniform pmf on
m-element support,
δm denotes an
m-element pmf with probability concentrated at one point,
pm denotes a pmf
1 with |
(
p)| =
m.
P3) ([pm, 0]) = (pm).
P4) (p(X, Y)) = (p(X))(p(Y)), if X and Y are independent random variables.
The first two properties are obvious. The third one states that extending support by an impossible outcome should leave Ess unchanged. Only the fourth property needs, perhaps, some little discussion. Or, better, an example. Let p(X) = [1, 1, 1]/3 and p(Y ) = [1, 1]/2 and let X be independent of Y . Then p(X,Y ) = [1, 1, 1, 1, 1, 1]/6. According to P2), (p(X)) = 3, (p(Y)) = 2 and (p(X,Y)) = 6 = (p(X))(p(Y)). It is reasonable to require the product relationship to hold for independent random variables with arbitrary distributions.
The properties P1-P4 are satisfied by
, where
α is a positive real number, different than 1. Note that
(·) of this form is exp of Rényi’s entropy. For
α → 1,
(
p, α) also satisfies P1-P4 and takes the form of exp(
H(
p)), where
is Shannon’s entropy
2; cf. [
1]. It is thus reasonable to define
(
p, α) for
α = 1 this way (with the convention 0 log 0 = 0), so that
(·) then becomes a continuous function of
α.
4 Selecting α
The requirements P1-P4 define entire class of measures of effective support size. This opens a problem of selecting α.
It is instructive to begin addressing the problem with a consideration of behavior of (p(X), α) at the limit values of α. It can be easily seen that as α → 0, (p(X), α) →|(p(X))|, i.e., the size of the support. Thus, the closer the α to zero, the more (·, α) behaves like the standard support size |(p(X))|.
For α →∞, , where . Thus, the higher the α, the more (·, α) judges a pmf solely by its component with the highest value of probability. At the limit, all pmf’s with the same are seen as entirely equivalent.
For the sake of illustration, in
Table 1,
(
p, α) is given for various two-element pmf’s, and
α = 0.001, 0.1, 0.5, 0.9, 1.0, 1.5, 2.0, 10, ∞.
Table 1.
(p, α) for α = 0.001, 0.1, 0.5, 0.9, 1.0, 1.5, 2.0, 10, ∞ and different p’s.
Table 1.
(p, α) for α = 0.001, 0.1, 0.5, 0.9, 1.0, 1.5, 2.0, 10, ∞ and different p’s.
| | | (p, α) | | | |
α | [0.5, 0.5] | [0.6, 0.4] | [0.7, 0.3] | [0.8, 0.2] | [0.9, 0.1] | [1.0, 0.0] |
0.001 | 2.000000 | 1.999959 | 1.999826 | 1.999554 | 1.998979 | 1.000000 |
0.1 | 2.000000 | 1.995925 | 1.982696 | 1.956233 | 1.902332 | 1.000000 |
0.5 | 2.000000 | 1.979796 | 1.916515 | 1.800000 | 1.600000 | 1.000000 |
0.9 | 2.000000 | 1.964013 | 1.856116 | 1.675654 | 1.416403 | 1.000000 |
1.0 | 2.000000 | 1.960132 | 1.842023 | 1.649385 | 1.384145 | 1.000000 |
1.5 | 2.000000 | 1.941178 | 1.777878 | 1.543210 | 1.275510 | 1.000000 |
2.0 | 2.000000 | 1.923077 | 1.724138 | 1.470588 | 1.219512 | 1.000000 |
10.0 | 2.000000 | 1.760634 | 1.486289 | 1.281379 | 1.124195 | 1.000000 |
∞ | 2.000000 | 1.666666 | 1.428571 | 1.250000 | 1.111111 | 1.000000 |
Based on the table, in this simplest case of two-valued random variable we would opt for (·,∞) as the good measure of Ess. However, for larger || this choice becomes less attractive. As it was already noted, and all pmf’s with the same are seen to have the same Ess. For instance, p = [0.95, 0.05] and q = [0.95, x] where x stands for the other remaining 99 components with the value 0.05/99 = 0.0005, are by (·,∞) judged to have the same Ess, equal to 1.053. Just for a comparison, (p, 1) = 1.220, while (q, 1) = 1.535. This undesirable feature of (·,∞) manifests itself even more sharply in the case of continuous random variables.
5 Ess in the continuous case
The continuous-case analogue
3 of
is
, where
f(
x) denotes a density with respect to Lebesgue measure. The continuous-case
, though always positive, can – naturally – be smaller than one. And the discrete-case upper bound
m is now replaced by ∞. It is worth stressing that
behaves with respect to shift and scale transformations in the desired manner. Indeed, if
Y =
X +
a, then
; if
Y =
aX, then
.
For the Gaussian
n(
µ, σ2) distribution,
; cf. [
8]. This for
α → ∞ converges to
so that for
σ2 = 1 it becomes
= 2.5067. It is worth comparing with
(cf. [
9]), which reduces in the case of
σ2 = 1 to 4.1327. This makes much more sense.
That (·,∞) is not the appropriate measure of Ess can be even more clearly seen in the case of the Exponential distribution. For βe−βx with β = 1, (·,∞) = 1 while S(·, 1) = e.
6 Adding another property
The above considerations suggest that (·, 1) might be the most appropriate of the Ess measures which satisfy the requirements P1-P4. The question is whether there is some other requirement that is reasonable to add to the already employed properties, such that it could narrow down the set of (·, α) to (·, 1).
To this end, let us consider two random variables
X,
Y that, in general, might be dependent. It is natural, to extend requirement P4 to the more general setting, by requiring that
4
with the equality if and only if
X and
Y are independent.
For α ≠ 1, it might be in some cases that instead of ≥ the opposite relation < holds true. Indeed, consider for instance the following bivariate discrete random variable with pmf p(X,Y)
0.2 | 0.05 | 0.05 | 0.3 |
0.3 | 0.2 | 0.2 | 0.7 |
0.5 | 0.25 | 0.25 | X\Y |
Marginal pmf
p(
X) has
(
X,∞) = 2, and
. Hence,
= 2.86 , which is smaller than
. After a minor change in the joint pmf, such that the marginals remain unchanged, it is possible to satisfy P4
∗. It is known (cf. [
1]) that solely
(·,1) always satisfies the natural requirement P4
∗.
7 Summary
Shannon’s entropy is a key concept of Communication Theory. In Probability and Statistics the entropy is usually interpreted as a measure of uncertainty about realization of a random variable, or as a measure of complexity or uniformness of a probability distribution. Though the entropy is within Probability and Statistics from time to time (and from area to area) blamed for failing to be measure of all the fancy and intangible things, it remains to be a valuable tool.
In this note we introduced
5 concept of the Effective support size (Ess) of a random variable. There are a few requirements that the measure
(
p(
X)) of Ess of a probability distribution
p(
X) should satisfy. The requirements turn to be direct analogues of those placed on entropy; cf. [
5], [
1]. It thus should not be surprising that they are satisfied
6 by
which is the exponential of Rényi’s entropy.
Since
(·
, α) is in fact a continuum of measures of Ess, it is necessary to find out which of them would be the most appropriate measure(s) of Ess. It seems that
(·, 1) = exp(
H(·)), where
H(·) is Shannon’s entropy, is the best choice; cf.
Sect. 4 and
Sect. 5. We also argued for expanding the key requirement P4 into a more general requirement P4
∗. The enhanced set of requirements is satisfied solely by
(·, 1).
We maintain that from the point of view of Probability and Statistics, Ess is more basic concept than entropy. The two concepts are related together by the exp / log link. Without the link thus for instance knowing that Shannon’s entropy of the Gaussian variable is does not say much. Figuratively speaking, thanks to Ess entropy itself becomes more informative.
Ess adds also a new meaning to the Maximum Entropy method [
4]. For instance the classic finding [
6] that the Gaussian distribution has the maximal value of Shannon’s entropy among all distributions with prescribed second moment can be rephrased as stating that among all such distributions the one with the biggest effective support is the Gaussian distribution.