Abstract
As a measure of randomness or uncertainty, the Boltzmann–Shannon entropy H has become one of the most widely used summary measures of a variety of attributes (characteristics) in different disciplines. This paper points out an often overlooked limitation of H: comparisons between differences in H-values are not valid. An alternative entropy is introduced as a preferred member of a new family of entropies for which difference comparisons are proved to be valid by satisfying a given value-validity condition. The is shown to have the appropriate properties for a randomness (uncertainty) measure, including a close linear relationship to a measurement criterion based on the Euclidean distance between probability distributions. This last point is demonstrated by means of computer generated random distributions. The results are also compared with those of another member of the entropy family. A statistical inference procedure for the entropy is formulated.
1. Introduction
For some probability distribution , with for i = 1, …, n and , the entropy , or simply H, is defined by:
where the logarithm is the natural (base-e) logarithm. The probabilities may be associated with a set of quantum states of a physical system in statistical mechanics or physics, a set of symbols or messages in a communication system, or, most generally, a set of mutually exclusive and exhaustive events. First used by Boltzmann [1] in statistical mechanics (as kH with k being the so-called Boltzmann constant) and later introduced by Shannon [2] as the basis for information theory (with base-2 logarithm and bits as the unit of measurement), this entropy H can appropriately be called the Boltzmann–Shannon entropy.
Although interpreted in a number of different ways, the most common and general interpretation of H is as a measure of randomness or uncertainty of a set of random events (e.g., [3] (pp. 67–97), [4], [5] (Chapter 2), [6] (pp. 12, 13, 90)). A number of alternative entropy formulations have been proposed as parameterized generalizations of H in Equation (1) (see, e.g., [7,8,9]), but with limited success or impact. The most notable exception is the following one-parameter family of entropies by Rényi [10]:
which reduces to Equation (1), with base-2 logarithm, when . This entropy family has, for instance, been used as a fractal dimension [11] (pp. 686–688). Another such family of entropies is that of Tsallis [12] defined as:
which includes the H in Equation (1) as the limiting case when .
Since its origins in statistical physics and information theory, the entropy H in Equation (1) has proved to be remarkably versatile as a summary measure of a wide variety of attributes (characteristics) within diverse fields of study, ranging from psychology (e.g., [13]) to fractal geometry [11] (pp. 678–687). However, such widespread use of H has led to misuse, improper applications, and misleading results due to the fact that, although H has a number of desirable properties [14] (Chapter 1), it does suffer from one serious limitation: comparisons between differences in H-values are not valid. The basis for this limitation is explained in the next section of this paper.
As a clarification of such comparisons in general, consider some summary measure M and probability distributions , , , . The various types of potential comparisons can then be defined as follows:
where c is a constant. While, because of the properties of H in Equation (1), there is no particular reason to doubt the validity of the size comparison in Equation (2a) involving H, the difference comparisons in Equations (2b) and (2c) are not valid for H as discussed below.
In this paper, an alternative and equally simple entropy is introduced as:
The term entropy is used for this measure of randomness or uncertainty since (a) it has many of the same properties as H in Equation (1) and (b) the entropy term has been used in such a variety of measurement situations for which can similarly be used. As is established in this paper, has the important advantage of being more informative than H in the sense that meets the conditions for valid difference comparisons as in Equations (2b) and (2c). It will also be argued that the is the preferred member of a family of entropies with similar properties. Statistical inference procedure for will also be outlined.
2. Conditions for Valid Difference Comparisons
Consider that M is a measure of randomness (uncertainty) such that its value for any probability distribution is bounded as:
where and are the degenerate and uniform distributions
and where one can set . In order for the difference comparisons in Equations (2b) and (2c) to be permissible or valid, some condition needs to be imposed on M (see [15]). Specifically, all intermediate values in Equation (4) have to provide numerical representations of the extent of randomness (uncertainty) that are true or realistic with respect to some acceptable criterion. While different types of validity are used in measurement theory [16] (pp. 129–134), value validity will be used here for this required property of M. In order to establish specific requirements for M to have value validity, a particular probability distribution proves useful and Euclidean distances will be used as a criterion.
Therefore, consider the recently introduced lambda distribution:
where is a uniformity (evenness) parameter and with and in Equation (5) being special (extreme) cases [17]. This is simply the following weighted mean of and :
From Equations (4) and (6), it follows that, for any given :
so that validity conditions on can equivalently be determined in terms of . With probability distributions considered as points (vectors) in n-dimensional space and with the Euclidean distance function d being the basis for a validity criterion, the following requirement seems most natural and obvious [17]:
Since , i.e., there is no randomness when one , it follows from Equation (9) that:
as a value-validity condition. This condition also follows immediately from Equation (7) as:
which equals Equation (10) for .
For the case when , is the midpoint of and with coordinates . Then:
which is exactly as stated in Equations (10) and (11) with . Of course, Equations (12) and (13) represent a weaker value-validity condition than Equations (10) and (11). Note also that it is not assumed a priori that M is a linear function of . This linearity is a consequence of Equations (7)–(9).
The entropy H in Equation (1) with and does not meet these validity conditions. For example, for n = 2 and , , which far exceeds the requirement log2/2 = 0.35 in Equation (13). Similarly, and . It can similarly be verified that for all n and , and hence for all from Equation (8), overstates the true or realistic extent of the randomness (uncertainty) that H is supposed to measure. Consequently, difference comparisons as in Equations (2b) and (2c) based on H are invalid. An alternative measure that meets the validity conditions for such comparisons will be introduced next.
3. The New Entropy
3.1. Derivation of
The logic or reasoning behind the in Equation (3) as a measure of randomness or uncertainty may be outlined as follows:
- (1)
- As a matter of convenience and as used throughout the rest of the paper, all individual probabilities will be considered ordered such that:
- (2)
- Due to the constraint that , there is no loss of generality or information by focusing on .
- (3)
- Instead of considering or a weighted mean or sum of for some function f of the individual ’s, one could consider the sum of the means of all pairs of the . Since an entropy measure needs to be zero-indifferent (expansible), i.e., unaffected by the addition of events with zero probabilities (e.g., [14] (Chapter 1)), a logical choice of pairwise means would be the geometric means for all i, j = 2, …, n (since obviously ). Therefore, the measure consisting of the means , including those for i = j, can be expressed as:where the multiplication factor 2 is included so that = H(1/n,…,1/n) = n − 1 instead of (n−1)/2. With being the modal (largest) probability, this in Equation (15) is twice the sum of the pairwise geometric means of all the non-modal probabilities. Furthermore, from the fact that, for a set of numbers and then setting , it follows from the second expression in Equation (15) that:which is the same as the formula in Equation (3).
As an alternative approach, one could begin by considering the power sum, or sum of order , (e.g., [18] (pp. 138–139). Strict Schur-concavity, which is discussed below as an important property of an entropy and one that H in Equation (1) has, requires that the parameter [18] (pp. 138–139). Since (i = 1, …, n), a further restriction is that be positive and hence for the power sum . In order for to comply with the value-validity condition in Equation (11), it is clear that can only be the power sum of the non-modal probabilities so that:
with and for the probability distributions in Equation (5). A reasonable upper bound would be . This requirement is met for in Equation (17) and by the addition of , resulting in:
which is the same as Equation (16) and for which .
3.2. Properties of
The properties of in Equation (16), some of which are readily apparent from its definition, can be outlined as follows:
Property 1.
is a continuous function of all its individual arguments .
Property 2.
is (permutation) symmetric in the .
Property 3.
is zero-indifferent (expansible), i.e.,
Property 4.
For any given and the and in Equation (5):
Property 5.
From Equation (19), is strictly increasing in n.
Property 6.
is strictly Schur-concave so that, if is majorized by (denoted by ):
with strict inequality unless is simply a permutation of .
Property 7.
is concave, but not strictly concave.
Property 8.
meets the value-validity condition in Equation (10) with .
Proof of Property 6.
The strict Schur-concavity of follows immediately from the partial derivatives:
and the fact that is strictly increasing in i = 1, …, n (unless ) ([18] (p. 84)). The majorization in Equation (20) is a more precise statement than the vague notion that the components of are “more nearly equal” or “less spread out” than are those of . By definition, if (and with the ordering in Equation (14) for and ):
([18] (p. 8)). ☐
Proof of Property 7.
From Equation (16) and for the probability distributions and and all :
From Minkowski’s inequality (e.g., [19] (p. 175)):
so that, from Equations (21) and (22):
proving that is concave. However, and importantly, is not strictly concave since the inequality in Equation (23) is not strict for all such as for and in Equation (5) when
as required by the value-validity conditions in Equations (10) and (11). ☐
- Note 1: If a measure (function) M is strictly concave so that, instead of Equation (23), the inequality is strict for all , then the condition in Equation (11) cannot be met. The H in Equation (1) is one such measure.
- Note 2: The extremal values for a measure of randomness or uncertainty is also a logical requirement for valid difference comparisons. As a particular case of the proportional difference comparisons in Equation (2c), and for any integer m < n:i.e., adding an amount m to n results in the same absolute change in the value of as does subtracting m from n in the equiprobable case. Or, in terms of the function f where , Equation (25) can be expressed more conveniently as:The general solution of the functional equation in Equation (26) is with real constants a and b [20] (p. 82), which equals for a = 1 and b = −1.
- Note 3: For the binary case of n = 2, , which equals H(0.5, 0.5) in Equation (1) if the base-2 logarithm is used. In fact, H(0.5, 0.5) = 1 is an axiom or required property, the normalization axiom, frequently used in information theory to justify the use of the base-2 logarithm in Equation (1) and bits as the unit of measurement [14] (Chapter 1). The binary entropy or:
4. Generalization of
Instead of the pairwise geometric means in Equation (15), one could consider power means, or arithmetic means of order , and hence the following parameterized family of entropies:
of which in Equations (15) and (16) is the particular member . Since a measure of randomness (uncertainty) should be zero-indifferent (see Property 3 of ), it is clear from the formula in Equation (27) that cannot be positive, i.e., where means the limit when . If is taken to be 0 for (see, e.g., [21] (Chapter 2) for the properties of such power means). One of the important properties of is that it is a non-decreasing function of and is strictly increasing unless . Besides this , there are other types of means that could be considered (e.g., [18] (pp. 139–145), [22]).
Since is strictly increasing in , it follows from Equation (27) that, for any probability distribution :
where the lower limit is the limit of is defined in Equations (15) and (16) and is the limit of . The inequalities in Equation (28) are strict unless in Equation (5).
Each member of has the same types of properties as those of discussed above. The strict Schur-concavity of follows from the fact that (a) is (permutation) symmetric in the and (b) the partial derivatives, after setting in Equation (27):
are clearly strictly increasing in i = 1, …, n (for ) for all . The case when was proved in the preceding subsection.
As with any reasonable measure of randomness or uncertainty, each member of in Equation (27) is a compound measure consisting of two components: the dimension of the distribution or vector and the uniformity (evenness) with which the elements of are distributed. For any probability distribution , this fact can be most simply represented by:
where for the uniform distribution in Equation (5) and where reflects the uniformity (evenness) of . The basically controls for n. For the distribution in Equation (6), .
The limiting member of in Equation (28) as is defined by:
where the expression in Equation (30b) can easily be seen to follow directly from Equation (30a) (remembering again the order in Equation (14)). The second expression in Equation (30a) has been briefly mentioned by Morales et al. [23] and the form in Equation (30b), divided by 2, has been suggested by Patil and Taillie [24] as one potential measure of diversity.
5. Comparative Analysis
5.1. Why the Preference for
From a practical point of view, what sets in Equation (3) or Equation (16) apart from any other member of the family in Equation (27) is its ease of computation. Its values can easily be computed on a hand calculator for any probability distribution even when the dimension n is quite large. For other members of , pairwise means have to be computed, which becomes practically impossible without the use of a computer program even when n is not large. The computational effort for the member (when ) is somewhat less than for other members. Nevertheless, the apparently simpler formula for in Equation (30b) requires that all be ordered as in Equation (14), which can be very tedious if done manually and nearly impossible if n is large.
The is also favored over other members of when considering the agreement with some other measure based on Euclidean distance and the familiar standard deviation. Specifically, for any probability distribution and , consider the following linearly decreasing function of the distance :
where is the standard deviation of (with devisor n rather than n – 1) and CNV is the coefficient of nominal variation [25,26]. It is clear from Equation (31) that, for in Equation (5), and . Also, for the lambda distribution in Equation (6), , so that D satisfies the value-validity condition in Equation (10). Of course, D is not zero-indifferent (see Property 3 for ).
Since the Euclidean distance and the standard deviation are such universally used measures, it is to be expected that an acceptable measure of randomness (uncertainty) should not differ substantially from D in Equation (31). From numerical examples, it is seen that values of in Equation (16) tend to be closer to those of D in Equation (31) than are the values of any other member of the family in Equation (27). In order to demonstrate this fact, a computer simulation was used to generate a number of random distributions using the following algorithm. For each randomly generated probability distribution , n was first generated as a random integer between 3 and 20, inclusive. Then, with the ordering in Equation (14), each was generated as a random number (to 5 decimal places) within the following intervals:
and finally
For each such generated distribution, the values of , , and D were computed according to the formulas in Equations (16), (30b) and (31) as were their corresponding uniformity (evenness) indices from Equation (29). After excluding some (five) distributions that were nearly equal to in Equation (5), the results for 30 different distributions are summarized in Table 1.
Table 1.
Values of , , and D in Equations (16), (30b), and (31) and of their corresponding uniformity (evenness) indices from Equation (29) based on 30 randomly generated probability distributions.
It is apparent from the data in Table 1 that agrees quite closely with D and clearly more so than does . Exceptions are Data Sets 1, 11, and 26 when the -values differ considerably from those of D, but still less so than do the -values. If D is used to predict (i.e., for the fitted model ), it is found for the 30 data sets in Table 1 that the coefficient of determination, when properly computed as [27], becomes (i.e., 98% of the variation of is explained by the fitted model ) as compared to in the case of . Also, the root mean square (RMS) of the differences between the values of and D is found to be 0.64 as compared to 1.33 for and . Similarly, when comparing the values of the indices , the -values are considerably closer to the -values than are the -values, with and
No other member of in Equation (27) is generally in as close agreement with D as is , but more so than . This can be explained by the fact that (a) whenever there is a notable difference between the values of and D, those of tend to be less than those of D as seen from Table 1; and (b) is a strictly increasing function of for any given other than in Equation (5).
5.2. Comparative Weights on
The difference between values of and as demonstrated in Table 1, or between any of the members of the family, is due to the fact that places different weights or emphases on the (i = 1, …, n) depending upon . When considering each pairwise mean in Equation (27), are weighted equally only when . Then, since (a) is strictly increasing in and (b) is zero-indifferent (Property 3 of ) only for , the in Equations (15) and (16) is the zero-indifferent member of that is always closest in value to and whose pairwise means are always closest to for all i and j.
Besides the weights placed on each component of all pairs , the weights given to each individual can also be examined by expressing the in Equation (27) as the following weighted sum:
which shows that the weights are increasing in both and i. In the case of , with in Equation (32), for i = 2, …, n whereas, for when for i = 2, …, n. These weights for are basically a compromise between the weights for and those for . Note also that these weights for from Equation (32) can differ substantially from those in Equation (30b) as can the weights for from Equation (32) when compared with the weights from the expression .
When comparing and , small ’s are given more weight by than by and the addition of low probability components to a set of events has more effect on than on . However, when weighting the pros and cons of such relative sensitivity to small ’s, it is important to keep in mind the relationship in Equation (29) and not jump to conclusion. For example, when going from to , increases from , a 25% increase, while and , a 6% increase. However, from Equation (29), the dimensional component of both and increased by 67% (from n − 1 = 3 to n − 1 = 5) whereas the uniformity (evenness) components decreased by 25% in the case of (from 2.32/3 to 2.91/5) and 37% for (from 2.00/3 to 2.12/5). In this regard the 25% increase in randomness (uncertainty) as measured by does not appear unreasonable.
5.3. Inconsistent Orderings
Although all members of the family in Equation (27) have the same types of properties, including the value-validity property in Equation (10), this does not necessarily imply that different members will always produce the same results for the comparisons in Equation (2). Such lack of consistency is inevitable whenever measures are used to summarize data sets into a single number. However, as stated by Patil and Taillie [24] (p. 551), “Inconsistent measures…are a familiar problem and should not be a cause for undue pessimism”, pointing out the fact that, for instance, the arithmetic mean and the median are not consistent measures of average (central tendency) and the standard deviation and the mean absolute deviation are inconsistent measures of variability (spread). One type of consistent results for all members of is the size (order) comparison in Equation (2a) whenever is majorized by and is not a permutation of . This is the result of Equation (20) and the fact, as proved above, that is strictly Schur-concave for all .
It is only when two measures and have a perfect linear relationship that (a) the comparison results from Equation (2) will always be consistent and (b) the compliance by with the value-validity conditions in Equations (10) and (11) also implies compliance by . In the case of and , and from the simulation results in Table 1, Pearson’s correlation coefficient between and is found to be r = 0.993, indicating a near perfect linear relationship between and . However, since the linearity is not truly perfect or exact, and will not always give the same results for the comparisons in Equation (2) as is evident from some of the data in Table 1.
6. Discussion
The value-validity condition in Equation (10) as a necessary requirement for valid difference comparisons as in Equation (2) is based on Euclidean distances. Such distances are also being used as a basis for the preference of over other potential members of the family of entropies in Equation (27). This distance metric is the standard one in engineering and science. The use of any other “distance” measures, such as directed divergencies discussed below, would seem to require particular justification in the context of value-validity assessment.
As a simple numerical example illustrating the reasoning behind the value-validity arguments in Equations (6)–(13) and the use of Euclidean distances, consider the following probability distributions based on in Equation (6):
The Euclidean distances and for A measure of uncertainty (randomness) M that takes on reasonable numerical values within the general bounds and should in this example satisfy the equality so that, with , . That is, since is the same distance from as it is from and each element of is the same distance from the corresponding element of as it is from that of , M would reflect this fact by taking on the value . The in Equation (3) or Equation (16) meets this requirement with . However, in the case of H in Equation (1), , a substantial overstatement of the extent of the uncertainty or randomness.
A similar comparison between and for with and is given in Table 2 together with the results from some other probability distributions. The results are also given in terms of the normalized measures and as well as for D in Equation (31). As seen from Table 2, while , for both -values. For all distributions in Table 2, the values of are quite comparable to those of , but those of are all considerably greater.
Table 2.
Comparative results for in Equation (1) and in Equation (16) and their normalized forms as well as from Equation (31) for various probability distributions.
The distributions – are included in Table 2 to exemplify the types of contradictory results that may be obtained when making the difference comparisons in Equation (2) based on versus . The distributions and in Table 2 are real data for the market shares (proportions) of the carbonated soft drinks industry in the U.S. and the world-wide market shares of cell phones, respectively (obtained by Googling “market shares” by industries). Some of the smaller market shares are not given in Table 2 because of space limitation, but were included in the computations. The , which has been used as a measure of market concentration or rather of its converse, deconcentration (e.g., [28]), would indicate that these two industries have nearly the same market deconcentration. By contrast, when considered in terms of for which such comparison is valid because of the value-validity property of , the results in Table 2 show that the cell-phone industry is about 20% more deconcentrated than the soft-drink industry. Similarly, for the fictitious distributions – in Table 2, the type of difference comparison in Equation (2b) shows that whereas the result would have been the reverse had been used for this comparison, with .
Instead of using the Euclidean metric to formulate the value-validity conditions in Section 2, one could perhaps consider other potential “distance” measures such as divergencies, also referred to as “statistical distances”. The best known such measure of the divergence of the distribution from the distribution is the Kullback-Leibler divergence [29] defined as:
This measure is directional or asymmetric in and . A symmetric measure is the so-called Jensen–Shannon divergence (JSD) (e.g., [30,31,32,33]), which can be expressed in terms of the Kullback–Leibler divergence (KLD) as:
where . Neither nor are metrics, but is [34].
Consider now the family of distributions in Equation (6) and the extreme members of and in Equation (5). For the case of n = 5, for example, it is found that ( is undefined) and . In the case of , so that . Similarly, and . These results differ greatly from those based on Euclidean distances for which . The fact that for all n, which is also reflected by the normalized , corresponds to the fact that each component of is of equal distance from the corresponding components of and . However, no such correspondence exists for the divergence measures and .
The derivation of in Equation (3) or Equation (16) is based on the exclusion of the modal probability . Of course, can enter the expression for since . One may wonder what the result would be if a different were to be excluded. If the smallest is excluded, the measure would not be zero-indifferent (expansible). If any other than is excluded, then the measure would not be strictly Schur-concave as can be verified from the proof of Property 6 of . This property is essential for any measure of uncertainty (randomness). In fact, the exclusion of makes unique in this regard.
It should also be emphasized that even though the entropy H in Equation (1) lacks the value-validity property, it has many of the same properties as and has undoubtedly numerous useful and appropriate applications as demonstrated in the extensive published literature. The problems with H arise when it is used uncritically and indiscriminately in fields far from its origin: a statistical concept of communication theory. Both Shannon [35] and Wiener [36] cautioned against such uncritical applications. It is when H or its normalized form is used as a summary measure (statistic) of various attributes (characteristics) and when its values are interpreted and compared that its lack of the value-validity property can lead to incorrect and misleading results and conclusions. This is the motivation for introducing the new entropy as a measure that overcomes the lack of value validity by H.
7. Statistical Inference about
Consider now the case when each is the multinomial sample estimate of the unknown population probability for i = 1, …, n and based on sample size . It may then be of interest to investigate the potential statistical bias of and to construct confidence intervals for the population measure for the population probability distribution .
7.1. Bias
Using bold letters to distinguish random variables from their sample values and expanding in Equation (16) into a Taylor series about for , the following result is obtained:
Taking the expected value of each side of Equation (33) and using the well-known expectations , , and for i, , it is found that:
Equation (34) shows that the estimator , while asymptotically unbiased, does have a small bias for finite sample size N. However, unless N is small, this bias can effectively be ignored for all practical purposes.
7.2. Confidence Interval Construction
Under multinomial sampling and based on the delta method of large sample theory (e.g., [37] (Chapter 14)), the following convergence to the normal distribution holds:
where, in terms of partial derivatives:
That is, for large N, is approximately normally distributed with mean and variance Var or standard error . The limiting normal distribution in Equation (35) also holds when, as is necessary in practice, the is replaced with the estimated variance by substituting the sample estimates (proportions) for the population probabilities for i = 1, …, n, resulting in the estimated standard error . It then readily follows from Equations (16) and (36) that:
As a simple numerical example, consider the sample distribution based on sample size N = 100. From Equation (16), and, from Equation (37), . Therefore, because of Equation (35), an approximate 95% confidence interval for the population measure becomes , or [1.85, 2.59]. Statistical hypotheses such as versus can also be tested based on Equations (35) and (37).
8. Conclusions
Since the ubiquitous Boltzmann–Shannon entropy H is only valid for making size (order or “larger than”) comparisons, the entropy is being introduced as an alternative measure of randomness (uncertainty) that is more informative than H in the sense that can also be used for making valid difference comparisons as in Equations (2b) and (2c). The , which is a particular member of the family of entropies and is basically a compromise between the members and , has the types of desirable properties one would reasonably expect of a randomness (uncertainty) measure. One of the differences between and is that small probabilities have a greater influence on than on . The addition of some small probability events causes a larger increase in than in , but causes a smaller decrease in the uniformity (evenness) index than in as defined in Equation (29).
Besides being computationally most simple, which is certainly a practical advantage, is also that member of that appears to be most nearly linearly (and decreasingly) related to the Euclidean distance between the points and or to the standard deviation of . The is the usual measure of variability (spread) for a set of data, although it is not resistant against “outliers” (extreme and suspect data points). However, “outliers” are not a concern when dealing with probabilities . Therefore, cannot justifiably be criticized for being excessively influenced by large or small ’s, with the same argument extending to .
Acknowledgments
The author would like to thank the reviewers for constructive and helpful comments.
Conflicts of Interest
The author declares no conflict of interest.
References
- Von Boltzmann, L. Weitere Studien über das Wärmegleichgewicht unter Gasmolekülen. In Sitzungsberichte der Kaiserliche Akademie der Wissenschaften, II Abteil; (Vol. 66, Pt. 2); K.-K. Hof- und Staatsdruckerei in Commission bei C. Gerold’s Sohn: Wien, Austria, 1872; pp. 275–370. (In German) [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
- Klir, G.J. Uncertainty and Information: Foundations of a Generalized Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
- Ruelle, D. Chance and Chaos; Princeton University Press: Princeton, NJ, USA, 1991. [Google Scholar]
- Han, T.S.; Kobayashi, K. Mathematics of Information and Coding; American Mathematical Society: Providence, RI, USA, 2002. [Google Scholar]
- Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1949. [Google Scholar]
- Arndt, C. Information Measures: Information and Its Description in Science and Engineering; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
- Kapur, J.N. Measures of Information and Their Applications; Wiley: New Delhi, India, 1994. [Google Scholar]
- Kvålseth, T.O. Entropy. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; Part 5; pp. 436–439. [Google Scholar]
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1961; University of California Press: Berkeley, CA, USA, 1961; Volume 1, pp. 547–561. [Google Scholar]
- Peitgen, H.-O.; Jürgens, H.; Saupe, D. Chaos and Fractals: New Frontiers of Science, 2nd ed.; Springer: New York, NY, USA, 2004. [Google Scholar]
- Tsallis, C. Possible generalization of Boltzmann–Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
- Norwich, K.H. Information, Sensation, and Perception; Academic Press: San Diego, CA, USA, 1993. [Google Scholar]
- Aczél, J.; Daróczy, Z. On Measures of Information and Their Characterizations; Academic Press: New York, NY, USA, 1975. [Google Scholar]
- Kvålseth, T.O. Entropy evaluation based on value validity. Entropy 2014, 16, 4855–4873. [Google Scholar] [CrossRef]
- Hand, D.J. Measurement Theory and Practice; Wiley: London, UK, 2004. [Google Scholar]
- Kvålseth, T.O. The Lambda distribution and its applications to categorical summary measures. Adv. Appl. Stat. 2011, 24, 83–106. [Google Scholar]
- Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications, 2nd ed.; Springer: New York, NY, USA, 2011. [Google Scholar]
- Bullen, P.S. A Dictionary of Inequalities; Addison Wesley Longman: Essex, UK, 1998. [Google Scholar]
- Aczél, J. Lectures on Functional Equations and Their Applications; Academic Press: New York, NY, USA, 1966. [Google Scholar]
- Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1934. [Google Scholar]
- Ebanks, B. Looking for a few good means. Am. Math. Mon. 2012, 119, 658–669. [Google Scholar]
- Morales, D.; Pardo, L.; Vajda, I. Uncertainty of discrete stochastic systems: General theory and statistical inference. IEEE Trans. Syst. Man Cyber Part A 1996, 26, 681–697. [Google Scholar] [CrossRef]
- Patil, G.P.; Taillie, C. Diversity as a concept and its measurement. J. Am. Stat. Assoc. 1982, 77, 548–567. [Google Scholar] [CrossRef]
- Kvålseth, T.O. Coefficients of variation for nominal and ordinal categorical data. Percept. Mot. Skills 1995, 80, 843–847. [Google Scholar] [CrossRef]
- Kvålseth, T.O. Variation for categorical variables. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; Part 22; pp. 1642–1645. [Google Scholar]
- Kvålseth, T.O. Cautionary note about R2. Am. Stat. 1985, 39, 279–285. [Google Scholar]
- Nawrocki, D.; Carter, W. Industry competitiveness using Herfindahl and entropy concentration indices with firm market capitalization data. Appl. Econ. 2010, 42, 2855–2863. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Match. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Lin, J. Divergence measures based on Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
- Wong, A.K.C.; You, M. Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1985, PAMI-7, 599–609. [Google Scholar] [CrossRef]
- Sagar, R.P.; Laguna, H.G.; Guevara, N.L. Electron pair density information measures in atomic systems. Int. J. Quantum Chem. 2011, 111, 3497–3504. [Google Scholar] [CrossRef]
- Antolin, J.; Angulo, J.C.; López-Rosa, S. Fisher and Jensen–Shannon divergences: Quantitative comparisons among distributions. Application to position and momentum atomic densities. J. Chem. Phys. 2009, 130, 074110. [Google Scholar] [CrossRef] [PubMed]
- Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
- Shannon, C.E. The bandwagon. IRE Trans. Inf. Theory 1956, 2, 3. [Google Scholar] [CrossRef]
- Weiner, N. What is information theory? IRE Trans. Inf. Theory 1956, 2, 48. [Google Scholar] [CrossRef]
- Bishop, Y.M.M.; Fienberg, S.E.; Holland, P.W. Discrete Multivariate Analysis: Theory and Practice; MIT Press: Cambridge, MA, USA, 1975. [Google Scholar]
© 2016 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).