1. Introduction
Entropy and information are among the most fundamental concepts in thermodynamics, statistical mechanics and information theory [
1,
2,
3,
4,
5,
6,
7]. According to the common view, entropy and information (entropy for short in what follows) are two names for the same thing. They are a measure of disorder or statistical uncertainty associated with probability. The relationship between entropy and probability has been subject to rigorous mathematical study since the work of Shannon [
2] and Khinchin [
7]. Nevertheless, some confusion persists concerning the expression of entropy for continuous probability distributions [
4,
5,
8].
The best known and most employed statistical expression of informational entropy is in the form of a logarithmic functional of probability distribution (Boltzmann–Shannon entropy), a form proposed for the first time by Boltzmann in his H-theorem [
1] with continuous distribution of particles, then used by Gibbs in his work on statistical mechanics [
3], and by Shannon in his information theory [
2]. An axiomatic derivation (or proof of the uniqueness) of the formula was given in [
2] by using discrete probability distributions. For a system having W discrete microstates, each having probability
(
, the Boltzmann–Shannon (BS) entropy
is given by
(We suppose here that the Boltzmann constant
). This information measure is always positive because
.
When the variable
of the states becomes continuous, entropy is sometimes called differential or continuous entropy given by [
2,
3,
4,
8]:
where
is the probability density distribution giving the probability
of finding the system in the state interval between
and
, across the range of all possible states. The first use of this integral form dates back to Boltzmann [
1]. Gibbs mentioned this formula in his book [
3]. Shannon [
2] as well intuitively took Equation (2) for granted as an analogue of Equation (1) without giving a mathematical derivation or proof of its uniqueness. As far as we know, an axiomatic derivation of Equation (2), as was done for Equation (1) by Shannon [
2] and Khinchin [
7], is still missing to date.
In this work, we focus on one specific mathematical property of Equation (2) concerning its sign. As mentioned above, in the case of Equation (1) with a discrete probability distribution, since
is positive and smaller than 1,
, which guarantees
[
2]. However, in the case of a continuous probability distribution as in Equation (2),
can be larger than 1, leading to the possibility that
and, therefore, to negative informational entropy. The reader can see a list of continuous entropies calculated from Equation (2) for many probability density distributions, in which most entropies can be negative. The list includes some common distributions such as uniform, normal and exponential distributions [
8]. This negative entropy problem can occur to many other entropy measures proposed in different contexts as generalization of BS formula (see, for example, [
9]). As all these formulas contain adjustable parameters and recover the BS formula when the parameters take specific values, they risk negative values with continuous probability distribution in the same way as Equation (2). For example, the continuous Tsallis entropy
[
10] and the Renyi entropy
[
11] are both negative when
, which can happen for
with
and for
with
. Most of the generalized continuous entropies containing the integral of
(counterpart of
) [
12] show negative values.
As is well known, thermodynamic entropy cannot be negative due to the third law of thermodynamics [
6]. It may contain a constant in addition to the function term of probability, as proposed by Boltzmann, Gibbs and Shannon [
1,
2,
3]. It is this function, depending only on probability, that assumes the role of entropy as a measure of disorder or probabilistic uncertainty. Logically, this function is expected to be zero in a non-probabilistic situation, as is the case of all the different forms of entropy, generalized or not, for discrete probability distribution; when only one state has
, all other states have
. Hence, the functions of probability of both thermodynamic entropy and informational entropy are expected to have positive lower bounds. But it is not the case with those entropy functions using continuous distribution as mentioned above. Many of them enter into contradiction with the definition of entropy as an uncertainty measure and even lose the finite lower bound. For example, the continuous entropy of the uniform distribution
in the interval
is
, as given by Equation (2) [
8,
13]. We straightforwardly have
if
, while the probabilistic uncertainty of such a distribution should not be zero. On the other hand, when
,
, which can be considered equivalent to the Dirac delta function (see
Section 3.1 below) with the normalization
. The Dirac delta function is a continuous counterpart of the discrete probability distribution:
and all other
, a situation where the probabilistic uncertainty should be zero from Equation (1). However, Equation (2) yields
. This loss of lower bound implies that it is impossible to add some constant to
S to shift it to a positive domain as proposed by Jaynes (see
Section 4 below) [
4]. A similar paradox happens with other continuous entropies, for instance
[
8] of the exponential distribution
(
).
These undesirable features of continuous entropy raise the question whether Equation (2) is still suitable for measuring probabilistic uncertainty and whether it deserves the name entropy. Despite this uncertain situation, continuous entropies have been considered acceptable [
8,
13] and widely used in many applications (see [
11,
14,
15], for example) sometimes with their negative values [
14]. There are also other solutions to this negativity issue in using relative entropy or Kulback–Leibler divergence instead of Equation (2) [
13,
14], although this is not a solution to the issue of Equation (2) itself.
As is well known, the use of Equation (2) can be traced back to Boltzmann [
1]. It has been proposed rather by intuition than by mathematical proof as has been done for Equation (1) by Shannon [
2]. From a mathematical point of view, Equation (1) cannot be simply replaced by Equation (2) because, when
is replaced by its continuous counterpart
, a divergent term
appears [
4,
5], which implies that the replacement of
by
in Equation (1) is questionable. Jaynes has tried to avoid negative entropy using a continuous version of BS entropy
where
is called the invariant measure of the density of discrete values of
x [
4,
5]. The term
, divergent when
, was simply removed, giving a continuous entropy
. According to Jaynes [
4], it is possible to choose the invariant measure
in an appropriate (albeit ad hoc) way for
to be positive.
In what follows, we present an alternative informational entropy called varentropy, which possibly helps to avoid negative value and other undesirable properties of continuous entropy. Varentropy is an extension of the fundamental equation of equilibrium thermodynamics
, where
is a variation of internal energy
in a reversible process,
a variation of thermodynamic entropy,
the temperature, and
the work done during the quasi-equilibrium equilibrium process [
6]. We show that varentropy, when extended to continuous probability distributions, allows us to avoid negative informational entropy in many cases where BS formula yields negative values.
2. Definition of Varentropy
Varentropy was defined on the basis of the variational form of the entropy of the second law of thermodynamics [
16,
17]. This is a definition from scratch without any prerequisite or postulate about the property of entropy. The motivation was to look for a probabilistic uncertainty measure that has a sound physical background and is maximizable to generate probability distribution. This objective implies that this measure cannot be defined with a given formula because the maximization of a given functional can only yield one or two distributions. One just cannot maximize a given functional for any distribution. For example, the Shannon entropy in Equation (1), which has been widely used as a universal uncertainty measure for any probability distribution, can only be maximized for exponential or uniform distribution, depending on the constraints [
4,
5] and the Tsallis or Renyi entropy can only be maximized for q-exponential distributions [
10,
11]. Hence, this measure we looked for should be something more general and fundamental than the formulas of the different entropies [
9,
12]; more fundamental in that it should have some physical background relative to thermodynamic entropy; more general in that it can reproduce the different entropies for different distributions, more or less in similar ways as a differential equation of physical law generates different trajectories with different interactions, as has been shown in [
16,
17,
18].
The idea of [
16] was to start from the first law of equilibrium thermodynamics given by
. As is well known, in classical statistical mechanics, the internal energy is the average
of the energy
of all microstates
i, i.e.,
, and the work done in an infinitesimal reversible process is given by the average of the energy change
of each state:
, where
is the probability of finding the system at the state
having energy
. The statistical expression of the fundamental equation
becomes then
, giving a statistical expression of the variation of entropy during the process
, giving
. From the above calculation, we see that this variational expression of thermodynamic entropy as a function of
multiplied by the random variable energy
is a statistical form of the first law. For us, this expression reveals a general kinship between a measure of probabilistic uncertainty (
S) and the related random variable (
) with its probability distribution
. This kinship was deeply hidden in the first and second law of thermodynamics and can only be seen when these laws are expressed in statistical form as shown above. In a previous work [
10], we have extended this relationship to any single random variable
with its probability distribution
and defined varentropy
in the variational form:
. This extension is purely mathematical;
can be any random variable (frequency of words, price, population, position, etc.).
measures the uncertainty in
, and its functional form and property (extensivity, additivity, concavity, etc.) are entirely determined by
and
.
is the thermodynamic entropy only when
is energy
.
It turns out that this variational definition of varentropy could really generate many known entropies with discrete distributions. For example,
has the form of the Boltzmann–Shannon entropy Equation (1) when
is exponential distribution, it is the nonadditive Tsallis or additive Renyi entropy if
is
q-exponential distribution, and can take different forms for other distributions (Power law, stretched exponential, Cauchy, Gauss, etc.), and
(nonadditive) for power law for example [
16,
17,
18,
19]. A remarkable property of varentropy is that each functional form of
can be maximized using the calculus of variation
to generate its original distribution. For instance, maximizing Equation (1) generates exponential distribution, maximizing Tsallis entropy generates q-exponential, and maximizing the power law varentropy
generates the power law
(
positive) [
16]. This property of varentropy is the meaning of the statement “maximizable measure” and is an intrinsic nature of varentropy, because its origin is in the variational definition
, which is equivalent to writing
or
. This means that the maximization of
subject to the constraint of the constant
, i.e.,
, implies
. This expression can be better understood if we consider the thermodynamic case where
is energy
and
, which is the work in the variational process considered as the virtual work. In this way,
can be understood as prescribed by the principle of virtual work and yields the maximum entropy calculus (Maxent principle)
. This kinship between two fundamental principles was studied in [
20] with the conclusion that, in the case of thermodynamic entropy, maxent with the calculus
can be considered as a law of physics derived from the fundamental principle of virtual work.
In what follows, we extend this varentropy to continuous probability replacing
by
. Considering
and
, we arrive at
where
is a constant to be chosen according to the nature of
. For example, in a reversible thermodynamic process where
,
is the thermodynamic entropy with
, as shown above; for the exponential decay distribution, we can choose
or
, depending on the domain of
in the considered distribution (see below). It should be noticed in the above discussion of maximization of varentropy that A also plays the role of Lagrange multiplier in the maximum varentropy to generate probability distribution. It is then natural to see different A for different varentropies and distributions due to the different physics or processes. It is worth stressing that an important role of
is to guarantee
be positive, as required both for entropy and for any measure of statistical uncertainty, which is the main aim of this work.
Suppose
is bijective in
, so we can write
, Equation (3) becomes
. Then we define the function of the upper bound as follows:
. According to the second fundamental theorem of calculus,
is differentiable with respect
, so that we can write
, which implies
where
is a constant.
If
is not bijective in
, we can cut the domain
into
w sub-domains
with
(
) in such a way that, in each sub-domain
k,
is bijective and a local varentropy
can be calculated using
. The total varentropy
is then the sum
[
19] or:
If
is not bijective in
but is differentiable, one can use the relationships
and
to change the definition of varentropy in the following way:
, which implies
which can be used for any differentiable distribution.
4. Discussion
To summarize, despite some undesirable features, the continuous entropies, as a heritage of the long history of statistical mechanics and information theory, continue to be considered as a measure of the uncertainty in continuous random variables, sometimes with the help of relative entropies to avoid negative values [
11,
13,
14,
15]. We have proposed here an alternative measure called varentropy to avoid several undesirable features of continuous entropies. Examples of varentropy for several well-known continuous probabilities have been calculated from its variational definition, showing the following features with respect to continuous entropies.
Varentropy is positive for the distributions studied in this work. The continuous entropies of these distributions can be negative.
is zero for deterministic case as expected for a measure of probabilistic uncertainty, while continuous entropy goes to minus infinity.
can avoid the improper use of logarithm
because the probability density distribution is a dimensioned quantity. Other generalized continuous entropies [
10,
12] containing the term
have the same undesirable feature: loss of scale invariance for example.
has a sound physical background because it is defined with a variational equation, which is just the statistical form of the first law of thermodynamics. Different from other entropies defined with given functional formulas, it has great flexibility to generate different functionals for different distributions.
As each given formula of varentropy is maximized for its distribution, it is the optimal measure of the uncertainty of that distribution, meaning that its value is always the largest one among all the possible measures (Shannon, Tsallis, etc.) for the same distribution. A case study of this feature was presented in [
19,
21].
Further investigation is necessary to confirm these features of varentropy for other continuous distributions.
One should have noticed that, in this work, for varentropy to be positive, a constant
A must be chosen according to the kind of distribution function. The existence of such a constant is quite natural in such a general definition of a quantity whose nature differs in different situations. Such a constant is also in the Boltzmann–Shannon formula, written as
. For a binary system (
), the choice of
defines the unit of the information measure in bit. When the formula is used for ideal gas to measure the probabilistic uncertainty of the distribution of internal energy, we must write
, the Boltzmann constant, for
to be the thermodynamic entropy of the gas. But it is impossible to choose a single constant
to make continuous entropies always positive because, as discussed above in
Section 3, they can be positive and negative for a given distribution, and sometimes go to minus infinity.
Finally, we would like to mention that we need information measures for continuous probability distributions in both classical and quantum physics. An example is the calculation of the path entropy [
22,
23] of random dynamics using path integral method [
24] and the classical path probability [
23] or the quantum propagator [
24], which are both continuous exponential functions of the classical action [
25,
26]; hence, the use of the BS formula (or other generalized ones) risks giving negative values. The path entropy is an increasing function of time of random motion [
23,
25,
26] and an important ingredient in the study of the irreversible processes in both classical and quantum world [
22].