Next Article in Journal
Gallager Exponent Analysis of Coherent MIMO FSO Systems over Gamma-Gamma Turbulence Channels
Next Article in Special Issue
Strongly Convex Divergences
Previous Article in Journal
Classification of Actigraphy Records from Bipolar Disorder Patients Using Slope Entropy: A Feasibility Study
Previous Article in Special Issue
On Relations Between the Relative Entropy and χ2-Divergence, Generalizations and Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Two-Moment Inequality with Applications to Rényi Entropy and Mutual Information

1
Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA
2
Department of Statistical Science, Duke University, Durham, NC 27708, USA
Entropy 2020, 22(11), 1244; https://doi.org/10.3390/e22111244
Submission received: 14 September 2020 / Accepted: 6 October 2020 / Published: 1 November 2020

Abstract

:
This paper explores some applications of a two-moment inequality for the integral of the rth power of a function, where 0 < r < 1 . The first contribution is an upper bound on the Rényi entropy of a random vector in terms of the two different moments. When one of the moments is the zeroth moment, these bounds recover previous results based on maximum entropy distributions under a single moment constraint. More generally, evaluation of the bound with two carefully chosen nonzero moments can lead to significant improvements with a modest increase in complexity. The second contribution is a method for upper bounding mutual information in terms of certain integrals with respect to the variance of the conditional density. The bounds have a number of useful properties arising from the connection with variance decompositions.

1. Introduction

The interplay between inequalities and information theory has a rich history, with notable examples including the relationship between the Brunn–Minkowski inequality and the entropy power inequality as well as the matrix determinant inequalities obtained from differential entropy [1]. In this paper, the focus is on a “two-moment” inequality that provides an upper bound on the integral of the rth power of a function. Specifically, if f is a nonnegative function defined on R n and p , q , r are real numbers satisfying 0 < r < 1 and p < 1 / r 1 < q , then
f ( x ) r d x 1 r C n , p , q , r x n p f ( x ) d x q r + r 1 ( q p ) r x n q f ( x ) d x 1 r p r ( q p ) r ,
where the best possible constant C n , p , q , r is given exactly; see Propositions 2 and 3 ahead. The one-dimensional version of this inequality is a special case of the classical Carlson–Levin inequality [2,3,4], and the multidimensional version is a special case of a result presented by Barza et al. [5]. The particular formulation of the inequality used in this paper was derived independently in [6], where the proof follows from a direct application of Hölder’s inequality and Jensen’s inequality.
In the context of information theory and statistics, a useful property of the two-moment inequality is that it provides a bound on a nonlinear functional, namely the r-quasi-norm · r , in terms of integrals that are linear in f. Consequently, this inequality is well suited to settings where f is a mixture of simple functions whose moments can be evaluated. We note that this reliance on moments to bound a nonlinear functional is closely related to bounds obtained from variational characterizations such as the Donsker–Varadhan representation of Kullback divergence [7] and its generalizations to Rényi divergence [8,9].
The first application considered in this paper concerns the relationship between the entropy of a probability measure and its moments. This relationship is fundamental to the principle of maximum entropy, which originated in statistical physics and has since been applied to statistical inference problems [10]. It also plays a prominent role in information theory and estimation theory where the fact that the Gaussian distribution maximizes differential entropy under second moment constraints ([11], [Theorem 8.6.5]) plays a prominent role. Moment–entropy inequalities for Rényi entropy were studied in a series of works by Lutwak et al. [12,13,14], as well as related works by Costa et al. [15,16] and Johonson and Vignat [17], in which it is shown that, under a single moment constraint, Rényi entropy is maximized by a family of generalized Gaussian distributions. The connection between these moment–entropy inequalities and the Carlson–Levin inequality was noted recently by Nguyen [18].
In this direction, one of the contributions of this paper is a new family of moment–entropy inequalities. This family of inequalities follows from applying Inequality (1) in the setting where f is a probability density function, and thus there is a one-to-one correspondence between the integral of the rth power and the Rényi entropy of order r. In the special case where one of the moments is the zeroth moment, this approach recovers the moment–entropy inequalities given in previous work. More generally, the additional flexibility provided by considering two different moments can lead to stronger results. For example, in Proposition 6, it is shown that if f is the standard Gaussian density function defined on R n , then the difference between the Rényi entropy and the upper bound given by the two-moment inequality (equivalently, the ratio between the left- and right-hand sides of (1)) is bounded uniformly with respect to n under the following specification of the moments:
p n = 1 r r 1 r 2 ( 1 r ) n + 1 , q n = 1 r r + 1 r 2 ( 1 r ) n + 1 .
Conversely, if one of the moments is restricted to be equal to zero, as is the case in the usual moment–entropy inequalities, then the difference between the Rényi entropy and the upper bound diverges with n.
The second application considered in this paper is the problem of bounding mutual information. In conjunction with Fano’s inequality and its extensions, bounds on mutual information play a prominent role in establishing minimax rates of statistical estimation [19] as well as the information-theoretic limits of detection in high-dimensional settings [20]. In many cases, one of the technical challenges is to provide conditions under which the dependence between the observations and an underlying signal or model parameters converges to zero in the limit of high dimension.
This paper introduces a new method for bounding mutual information, which can be described as follows. Let P X , Y be a probability measure on X × Y such that P Y X = x and P Y have densities f ( y x ) and f ( y ) with respect to the Lebesgue measure on R n . We begin by showing that the mutual information between X and Y satisfies the upper bound
I ( X ; Y ) Var ( f ( y X ) ) d y ,
where Var ( p ( y X ) ) = f ( y x ) f ( y ) 2 d P X ( x ) is the variance of f ( y X ) ; see Proposition 8 ahead. In view of (3), an application of the two-moment Inequality (1) with r = 1 / 2 leads to an upper bound with respect to the moments of the variance of the density:
y n s Var ( f ( y X ) ) d y
where this expression is evaluated at s { p , q } with p < 1 < q . A useful property of this bound is that the integrated variance is quadratic in P X , and thus Expression (4) can be evaluated by swapping the integration over y and with the expectation of over two independent copies of X. For example, when P X , Y is a Gaussian scale mixture, this approach provides closed-form upper bounds in terms of the moments of the Gaussian density. An early version of this technique is used to prove Gaussian approximations for random projections [21] arising in the analysis of a random linear estimation problem appearing in wireless communications and compressed sensing [22,23].

2. Moment Inequalities

Let L p ( S ) be the space of Lebesgue measurable functions from S to R whose pth power is absolutely integrable, and for p 0 , define
f p : = S f ( x ) p d x 1 / p .
Recall that · p is a norm for p 1 but only a quasi-norm for 0 < p < 1 because it does not satisfy the triangle inequality. The sth moment of f is defined as
M s ( f ) : = S x s | f ( x ) | d x ,
where · denotes the standard Euclidean norm on vectors.
The two-moment Inequality (1) can be derived straightforwardly using the following argument. For r ( 0 , 1 ) , the mapping f f r is concave on the subset of nonnegative functions and admits the variational representation
f r = inf f g 1 g r * : g L r * ,
where r * = r / ( r 1 ) ( , 0 ) is the Hölder conjugate of r. Consequently, each g L r * leads to an upper bound on f r . For example, if f has bounded support S, choosing g to be the indicator function of S leads to the basic inequality f r ( Vol ( S ) ) ( 1 r ) / r f 1 . The upper bound on f r given in Inequality (1) can be obtained by restricting the minimum in Expression (5) to the parametric class of functions of the form g ( x ) = ν 1 x n p + ν 2 x n q with ν 1 , ν 2 > 0 and then optimizing over the parameters ( ν 1 , ν 2 ) . Here, the constraints on p , q are necessary and sufficient to ensure that g L r * ( R n ) .
In the following sections, we provide a more detailed derivation, starting with the problem of maximizing f r under multiple moment constraints and then specializing to the case of two moments. For a detailed account of the history of the Carlson type inequalities as well as some further extensions, see [4].

2.1. Multiple Moments

Consider the following optimization problem:
maximize f r subject to f ( x ) 0 for all x S M s i ( f ) m i for 1 i k .
For r ( 0 , 1 ) , this is a convex optimization problem because · r r is concave and the moment constraints are linear. By standard theory in convex optimization (e.g., [24]), it can be shown that if the problem is feasible and the maximum is finite, then the maximizer has the form
f * ( x ) = i = 1 k ν i * x s i 1 r 1 , for all x S .
The parameters ν 1 * , , ν k * are nonnegative and the ith moment constraint holds with equality for all i such that ν i * is strictly positive—that is, ν i * > 0 μ s i ( f * ) = m i . Consequently, the maximum can be expressed in terms of a linear combination of the moments:
f * r r = ( f * ) r 1 = f * ( f * ) r 1 1 = i = 1 k ν i * m i .
For the purposes of this paper, it is useful to consider a relative inequality in terms of the moments of the function itself. Given a number 0 < r < 1 and vectors s R k and ν R + k , the function c r ( ν , s ) is defined according to
c r ( ν , s ) = 0 i = 1 k ν i x s i r 1 r d x 1 r r ,
if the integral exists. Otherwise, c r ( ν , s ) is defined to be positive infinity. It can be verified that c r ( ν , s ) is finite provided that there exists i , j such that ν i and ν j are strictly positive and s i < ( 1 r ) / r < s j .
The following result can be viewed as a consequence of the constrained optimization problem described above. We provide a different and very simple proof that depends only on Hölder’s inequality.
Proposition 1.
Let f be a nonnegative Lebesgue measurable function defined on the positive reals R + . For any number 0 < r < 1 and vectors s R k and ν R + k , we have
f r c r ( ν , s ) i = 1 k ν i M s i ( f ) .
Proof. 
Let g ( x ) = i = 1 k ν i x s i . Then, we have
f r r = g r ( f g ) r 1 g r 1 1 r ( g f ) r 1 r = g r 1 r 1 1 r g f 1 r = c r ( ν , s ) i = 1 k ν i M s i ( f ) r ,
where the second step is Hölder’s inequality with conjugate exponents 1 / ( 1 r ) and 1 / r . □

2.2. Two Moments

For a , b > 0 , the beta function B ( a , b ) and gamma function Γ ( a ) are given by
B ( a , b ) = 0 1 t a 1 ( 1 t ) b 1 d t Γ ( a ) = 0 t a 1 e t d t ,
and satisfy the relation B ( a , b ) = Γ ( a ) Γ ( b ) / Γ ( a + b ) , a , b > 0 . To lighten the notation, we define the normalized beta function
B ˜ ( a , b ) = B ( a , b ) ( a + b ) a + b a a b b .
Properties of these functions are provided in Appendix A.
The next result follows from Proposition 1 for the case of two moments.
Proposition 2.
Let f be a nonnegative Lebesgue measurable function defined on [ 0 , ) . For any numbers p , q , r with 0 < r < 1 and p < 1 / r 1 < q ,
f r ψ r ( p , q ) 1 r r [ M p ( f ) ] λ [ M q ( f ) ] 1 λ ,
where λ = ( q + 1 1 / r ) / ( q p ) and
ψ r ( p , q ) = 1 ( q p ) B ˜ r λ 1 r , r ( 1 λ ) 1 r ,
where B ˜ ( · , · ) is defined in Equation (6).
Proof. 
Letting s = ( p , q ) and ν = ( γ 1 λ , γ λ ) with λ > 0 , we have
[ c r ( ν , s ) ] r 1 r = 0 γ 1 λ x p + γ λ x q r 1 r d x .
Making the change of variable x ( γ u ) 1 q p leads to
[ c r ( ν , s ) ] r 1 r = 1 q p 0 u b 1 ( 1 + u ) a + b d u = B a , b q p ,
where a = r 1 r λ and b = r 1 r ( 1 λ ) and the second step follows from recognizing the integral representation of the beta function given in Equation (A3). Therefore, by Proposition 1, the inequality
f r B a , b q p 1 r r γ 1 λ M p ( f ) + γ λ M q ( f ) ,
holds for all γ > 0 . Evaluating this inequality with
γ = λ M q ( f ) ( 1 λ ) M p ( f ) ,
leads to the stated result. □
The special case r = 1 / 2 admits the simplified expression
ψ 1 / 2 ( p , q ) = π λ λ ( 1 λ ) ( 1 λ ) ( q p ) sin ( π λ ) ,
where we have used Euler’s reflection formula for the beta function ([25], [Theorem 1.2.1]).
Next, we consider an extension of Proposition 2 for functions defined on R n . Given any measurable subset S of R n , we define
ω ( S ) = Vol ( B n cone ( S ) ) ,
where B n = { u R n : u 1 } is the n-dimensional Euclidean ball of radius one and
cone ( S ) = { x R n : t x S for some t > 0 } .
The function ω ( S ) is proportional to the surface measure of the projection of S on the Euclidean sphere and satisfies
ω ( S ) ω ( R n ) = π n 2 Γ ( n 2 + 1 ) ,
for all S R n . Note that ω ( R + ) = 1 and ω ( R ) = 2 .
Proposition 3.
Let f be a nonnegative Lebesgue measurable function defined on a subset S of R n . For any numbers p , q , r with 0 < r < 1 and p < 1 / r 1 < q ,
f r ω ( S ) ψ r ( p , q ) 1 r r [ M n p ( f ) ] λ [ M n q ( f ) ] 1 λ ,
where λ = ( q + 1 1 / r ) / ( q p ) and ψ r ( p , q ) is given by Equation (7).
Proof. 
Let f be extended to R n using the rule f ( x ) = 0 for all x outside of S and let g : R + R + be defined according to
g ( y ) = 1 n S n 1 f ( y 1 / n u ) d σ ( u ) ,
where S n 1 = { u R n : u = 1 } is the Euclidean sphere of radius one and σ ( u ) is the surface measure of the sphere. In the following, we will show that
f r ω ( S ) 1 r r g r
M n s ( f ) = M s ( g ) .
Then, the stated inequality then follows from applying Proposition 2 to the function g.
To prove Inequality (11), we begin with a transformation into polar coordinates:
f r r = 0 S n 1 f ( t u ) r t n 1 d σ ( u ) d t .
Letting 1 cone ( S ) ( x ) denote the indicator function of the set cone ( S ) , the integral over the sphere can be bounded using:
S n 1 f ( t u ) r d σ ( u ) = S n 1 1 cone ( S ) ( u ) f ( t u ) r d σ ( u ) ( a ) S n 1 1 cone ( S ) ( u ) d σ ( u ) 1 r S n 1 f ( t u ) d σ ( u ) r = ( b ) n ω ( S ) 1 r g r ( t n ) ,
where: (a) follows from Hölder’s inequality with conjugate exponents 1 1 r and 1 r , and (b) follows from the definition of g and the fact that
ω ( S ) = 0 1 S n 1 1 cone ( S ) ( u ) t n 1 d σ ( u ) d t = 1 n S n 1 1 cone ( S ) ( u ) d σ ( u ) .
Plugging Inequality (14) back into Equation (13) and then making the change of variable t y 1 n yields
f r r n ω ( S ) 1 r 0 g r ( t n ) t n 1 d t = ω ( S ) 1 r g r r .
The proof of Equation (12) follows along similar lines. We have
M n s ( f ) = ( a ) 0 S n 1 t n s f ( t u ) t n 1 d σ ( u ) d t = ( b ) 1 n 0 S n 1 y s f ( y 1 n u ) d σ ( u ) d y = M s ( g )
where (a) follows from a transformation into polar coordinates and (b) follows from the change of variable t y 1 n .
Having established Inequality (11) and Equation (12), an application of Proposition 2 completes the proof. □

3. Rényi Entropy Bounds

Let X be a random vector that has a density f ( x ) with respect to the Lebesgue measure on R n . The differential Rényi entropy of order r ( 0 , 1 ) ( 1 , ) is defined according to [11]:
h r ( X ) = 1 1 r log R n f r ( x ) d x .
Throughout this paper, it is assumed that the logarithm is defined with respect to the natural base and entropy is measured in nats. The Rényi entropy is continuous and nonincreasing in r. If the support set S = { x R n : f ( x ) > 0 } has finite measure, then the limit as r converges to zero is given by h 0 ( X ) = log Vol ( S ) . If the support does not have finite measure, then h r ( X ) increases to infinity as r decreases to zero. The case r = 1 is given by the Shannon differential entropy:
h 1 ( X ) = S f ( x ) log f ( x ) d x .
Given a random variable X that is not identical to zero and numbers p , q , r with 0 < r < 1 and p < 1 / r 1 < q , we define the function
L r ( X ; p , q ) = r λ 1 r log E | X | p + r ( 1 λ ) 1 r log E | X | q ,
where λ = ( q + 1 1 / r ) / ( q p ) .
The next result, which follows directly from Proposition 3, provides an upper bound on the Rényi entropy.
Proposition 4.
Let X be a random vector with a density on R n . For any numbers p , q , r with 0 < r < 1 and p < 1 / r 1 < q , the Rényi entropy satisfies
h r ( X ) log ω ( S ) + log ψ r ( p , q ) + L r ( X n ; p , q ) ,
where ω ( S ) is defined in Equation (9) and ψ r ( p , q ) is defined in Equation (7).
Proof. 
This result follows immediately from Proposition 3 and the definition of Rényi entropy. □
The relationship between Proposition 4 and previous results depends on whether the moment p is equal to zero:
  • One-moment inequalities: If p = 0 , then there exists a distribution such that Inequality (15) holds with equality. This is because the zero-moment constraint ensures that the function that maximizes the Rényi entropy integrates to one. In this case, Proposition 4 is equivalent to previous results that focused on distributions that maximize Rényi entropy subject to a single moment constraint [12,13,15]. With some abuse of terminology, we refer to these bounds as one-moment inequalities. (A more accurate name would be two-moment inequalities under the constraint that one of the moments is the zeroth moment.)
  • Two-moment inequalities: If p 0 , then the right-hand side of Inequality (15) corresponds to the Rényi entropy of a nonnegative function that might not integrate to one. Nevertheless, the expression provides an upper bound on the Rényi entropy for any density with the same moments. We refer to the bounds obtained using a general pair ( p , q ) as two-moment inequalities.
The contribution of two-moment inequalities is that they lead to tighter bounds. To quantify the tightness, we define Δ r ( X ; p , q ) to be the gap between the right-hand side and left-hand side of Inequality (15) corresponding to the pair ( p , q ) —that is,
Δ r ( X ; p , q ) = log ω ( S ) + log ψ r ( p , q ) + L r ( X n ; p , q ) h r ( X ) .
The gaps corresponding to the optimal two-moment and one-moment inequalities are defined according to
Δ r ( X ) = inf p , q Δ r ( X ; p , q ) Δ ˜ r ( X ) = inf q Δ r ( X ; 0 , q ) .

3.1. Some Consequences of These Bounds

By Lyapunov’s inequality, the mapping s 1 s log E | X | s is nondecreasing on [ 0 , ) , and thus
L r ( X ; p , q ) L r ( X ; 0 , q ) = 1 q log E | X | q , p 0 .
In other words, the case p = 0 provides an upper bound on L r ( X ; p , q ) for nonnegative p. Alternatively, we also have the lower bound
L r ( X ; p , q ) r 1 r log E | X | 1 r r ,
which follows from the convexity of log E | X | s .
A useful property of L r ( X ; p , q ) is that it is additive with respect to the product of independent random variables. Specifically, if X and Y are independent, then
L r ( X Y ; p , q ) = L r ( X ; p , q ) + L r ( Y ; p , q ) .
One consequence is that multiplication by a bounded random variable cannot increase the Rényi entropy by an amount that exceeds the gap of the two-moment inequality with nonnegative moments.
Proposition 5.
Let Y be a random vector on R n with finite Rényi entropy of order 0 < r < 1 , and let X be an independent random variable that satisfies 0 < X t . Then,
h r ( X Y ) h r ( t Y ) + Δ r ( Y ; p , q ) ,
for all 0 < p < 1 / r 1 < q .
Proof. 
Let Z = X Y and let S Z and S Y denote the support sets of Z and Y, respectively. The assumption that X is nonnegative means that cone ( S Z ) = cone ( S Y ) . We have
h r ( Z ) ( a ) log ω ( S Z ) + log ψ r ( p , q ) + L r ( Z n ; p , q ) = ( b ) h r ( Y ) + L r ( | X | n ; p ; q ) + Δ r ( Y ; p , q ) ( c ) h r ( Y ) + n log t + Δ r ( Y ; p , q ) ,
where (a) follows from Proposition 4, (b) follows from Equation (18) and the definition of Δ r ( Y ; p , q ) , and (c) follows from Inequality (16) and the assumption | X | t . Finally, recalling that h r ( t Y ) = h r ( Y ) + n log t completes the proof. □

3.2. Example with Log-Normal Distribution

If W N ( μ , σ 2 ) , then the random variable X = exp ( W ) has a log-normal distribution with parameters ( μ , σ 2 ) . The Rényi entropy is given by
h r ( X ) = μ + 1 2 1 r r σ 2 + 1 2 log ( 2 π r 1 r 1 σ 2 ) ,
and the logarithm of the sth moment is given by
log E | X | s = μ s + 1 2 σ 2 s 2 .
With a bit of work, it can be shown that the gap of the optimal two-moment inequality does not depend on the parameters ( μ , σ 2 ) and is given by
Δ r ( X ) = log B ˜ r 2 ( 1 r ) , r 2 ( 1 r ) r 4 ( 1 r ) + 1 2 1 2 log ( 2 π r 1 r 1 ) .
The details of this derivation are given in Appendix B.1. Meanwhile, the gap of the optimal one-moment inequality is given by
Δ ˜ r ( X ) = inf q log B ˜ r 1 r 1 q , 1 q 1 q + 1 2 q σ 2 1 2 1 r r σ 2 1 2 log ( 2 π r 1 r 1 σ 2 ) .
The functions Δ r ( X ) and Δ ˜ r ( X ) are illustrated in Figure 1 as a function of r for various σ 2 . The function Δ r ( X ) is bounded uniformly with respect to r and converges to zero as r increases to one. The tightness of the two-moment inequality in this regime follows from the fact that the log-normal distribution maximizes Shannon entropy subject to a constraint on E log X . By contrast, the function Δ ˜ r ( X ) varies with the parameter σ 2 . For any fixed r ( 0 , 1 ) , it can be shown that Δ ˜ r ( X ) increases to infinity if σ 2 converges to zero or infinity.

3.3. Example with Multivariate Gaussian Distribution

Next, we consider the case where Y N ( 0 , I n ) is an n-dimensional Gaussian vector with mean zero and identity covariance. The Rényi entropy is given by
h r ( Y ) = n 2 log ( 2 π r 1 r 1 ) ,
and the sth moment of the magnitude Y is given by
E Y s = 2 s 2 Γ ( n + s 2 ) Γ ( n 2 ) .
The next result shows that as the dimension n increases, the gap of the optimal two-moment inequality converges to the gap for the log-normal distribution. Moreover, for each r ( 0 , 1 ) , the following choice of moments is optimal in the large-n limit:
p n = 1 r r 1 r 2 ( 1 r ) n + 1 , q n = 1 r r + 1 r 2 ( 1 r ) n + 1 .
The proof is given in Appendix B.3.
Proposition 6.
If Y N ( 0 , I n ) , then, for each r ( 0 , 1 ) ,
lim n Δ r ( Y ) = lim n Δ r ( Y ; p n , q n ) = Δ r ( X ) ,
where X has a log-normal distribution and ( p n , q n ) are given by (21).
Figure 2 provides a comparison of Δ r ( Y ) , Δ r ( Y ; p n , q n ) , and Δ ˜ r ( Y ) as a function of n for r = 0.1 . Here, we see that both Δ r ( Y ) and Δ r ( Y ; p n , q n ) converge rapidly to the asymptotic limit given by the gap of the log-normal distribution. By contrast, the gap of the optimal one-moment inequality Δ ˜ r ( Y ) increases without bound.

3.4. Inequalities for Differential Entropy

Proposition 4 can also be used to recover some known inequalities for differential entropy by considering the limiting behavior as r converges to one. For example, it is well known that the differential entropy of an n-dimensional random vector X with finite second moment satisfies
h ( X ) 1 2 log 2 π e E 1 n X 2 ,
with equality if and only if the entries of X are i.i.d. zero-mean Gaussian. A generalization of this result in terms of an arbitrary positive moment is given by
h ( X ) log Γ n s + 1 Γ n 2 + 1 + n 2 log π + n s log e s E 1 n X s ,
for all s > 0 . Note that Inequality (22) corresponds to the case s = 2 .
Inequality (23) can be proved as an immediate consequence of Proposition 4 and the fact that h r ( X ) is nonincreasing in r. Using properties of the beta function given in Appendix A, it is straightforward to verify that
lim r 1 ψ r ( 0 , q ) = e q 1 q Γ 1 q + 1 , for all q > 0 .
Combining this result with Proposition 4 and Inequality (16) leads to
h ( X ) log ω ( S ) + log Γ 1 q + 1 + 1 q log e q E X n q .
Using Inequality (10) and making the substitution s = n q leads to Inequality (23).
Another example follows from the fact that the log-normal distribution maximizes the differential entropy of a positive random variable X subject to constraints on the mean and variance of log ( X ) , and hence
h ( X ) E log ( X ) + 1 2 log 2 π e Var ( log ( X ) ) ,
with equality if and only if X is log-normal. In Appendix B.4, it is shown how this inequality can be proved using our two-moment inequalities by studying the behavior as both p and q converge to zero as r increases to one.

4. Bounds on Mutual Information

4.1. Relative Entropy and Chi-Squared Divergence

Let P and Q be distributions defined on a common probability space that have densities p and q with respect to a dominating measure μ . The relative entropy (or Kullback–Leibler divergence) is defined according to
D P Q = p log p q d μ ,
and the chi-squared divergence is defined as
χ 2 ( P Q ) = p q 2 q d μ .
Both of these divergences can be seen as special cases of the general class of f-divergence measures and there exists a rich literature on comparisons between different divergences [8,26,27,28,29,30,31,32]. The chi-squared divergence can also be viewed as the squared L 2 distance between p / q and q . The chi-square can also be interpreted as the first non-zero term in the power series expansion of the relative entropy ([26], [Lemma 4]). More generally, the chi-squared divergence provides an upper bound on the relative entropy via
D P Q log ( 1 + χ 2 ( P Q ) ) .
The proof of this inequality follows straightforwardly from Jensen’s inequality and the concavity of the logarithm; see [27,31,32] for further refinements.
Given a random pair ( X , Y ) , the mutual information between X and Y is defined according to
I ( X ; Y ) = D P X , Y P X P Y .
From Inequality (25), we see that the mutual information can always be upper bounded using
I ( X ; Y ) log ( 1 + χ 2 ( P X , Y P X P Y ) ) .
The next section provides bounds on the mutual information that can improve upon this inequality.

4.2. Mutual Information and Variance of Conditional Density

Let ( X , Y ) be a random pair such that the conditional distribution of Y given X has a density f Y | X ( y | x ) with respect to the Lebesgue measure on R n . Note that the marginal density of Y is given by f Y ( y ) = E f Y | X ( y | X ) . To simplify notation, we will write f ( y | x ) and f ( y ) where the subscripts are implicit. The support set of Y is denoted by S Y .
The measure of the dependence between X and Y that is used in our bounds can be understood in terms of the variance of the conditional density. For each y, the conditional density f ( y | X ) evaluated with a random realization of X is a random variable. The variance of this random variable is given by
Var ( f ( y | X ) ) = E f ( y | X ) f ( y ) 2 ,
where we have used the fact that the marginal density f ( y ) is the expectation of f ( y | X ) . The sth moment of the variance of the conditional density is defined according to
V s ( Y | X ) = S Y y s Var ( f ( y | X ) ) d y .
The variance moment V s ( Y | X ) is nonnegative and equal to zero if and only if X and Y are independent.
The function κ ( t ) is defined according to
κ ( t ) = sup u ( 0 , ) log ( 1 + u ) u t , t ( 0 , 1 ] .
The proof of the following result is given in Appendix C. The behavior of κ ( t ) is illustrated in Figure 3.
Proposition 7.
The function κ ( t ) defined in Equation (29) can be expressed as
κ ( t ) = log ( 1 + u ) u t , t ( 0 , 1 ]
where
u = exp W 1 t exp 1 t + 1 t 1 ,
and W ( · ) denotes Lambert’s W- function, i.e., W ( z ) is the unique solution to the equation z = w exp ( w ) on the interval [ 1 , ) . Furthermore, the function g ( t ) = t κ ( t ) is strictly increasing on ( 0 , 1 ] with lim t 0 g ( t ) = 1 / e and g ( 1 ) = 1 , and thus
1 e t κ ( t ) 1 t , t ( 0 , 1 ] ,
where the lower bound 1 / ( e t ) is tight for small values of t ( 0 , 1 ) and the upper bound 1 / t is tight for values of t close to 1.
We are now ready to give the main results of this section, which are bounds on the mutual information. We begin with a general upper bound in terms of the variance of the conditional density.
Proposition 8.
For any 0 < t 1 , the mutual information satisfies
I ( X ; Y ) κ ( t ) S Y f ( y ) 1 2 t Var ( f ( y X ) ) t d y .
Proof. 
We use the following series of inequalities:
I ( X ; Y ) = ( a ) f ( y ) D P X | Y = y P X d y ( b ) f ( y ) log 1 + χ 2 ( P X | Y = y P X ) d y = ( c ) f ( y ) log 1 + Var ( f ( y X ) ) f 2 ( y ) d y ( d ) κ ( t ) f ( y ) Var ( f ( y X ) ) f 2 ( y ) t d y ,
where (a) follows from the definition of mutual information, (b) follows from Inequality (25), and (c) follows from Bayes’ rule, which allows us to write the chi-square in terms of the variance of the conditional density:
χ 2 ( P X | Y = y P X ) = E f ( y | X ) f ( y ) 1 2 = Var ( f ( y | X ) ) f 2 ( y ) .
Inequality (d) follows from the nonnegativity of the variance and the definition of κ ( t ) . □
Evaluating Proposition 8 with t = 1 recovers the well-known inequality I ( X ; Y ) χ 2 ( P X , Y P X P Y ) . The next two results follow from the cases 0 < t < 1 2 and t = 1 2 , respectively.
Proposition 9.
For any 0 < r < 1 , the mutual information satisfies
I ( X ; Y ) κ ( t ) e h r ( Y ) V 0 ( Y | X ) t ,
where t = ( 1 r ) / ( 2 r ) .
Proof. 
Starting with Proposition 8 and applying Hölder’s inequality with conjugate exponents 1 / ( 1 t ) and 1 / t leads to
I ( X ; Y ) κ ( t ) f r ( y ) d y 1 t Var ( f ( y X ) ) d y t = κ ( t ) e t h r ( Y ) V 0 t ( Y | X ) ,
where we have used the fact that r = ( 1 2 t ) / ( 1 t ) . □
Proposition 10.
For any p < 1 < q , the mutual information satisfies
I ( X ; Y ) C ( λ ) ω ( S Y ) V n p λ ( Y | X ) V n q 1 λ ( Y | X ) ( q p ) ,
where λ = ( q 1 ) / ( q p ) and
C ( λ ) = κ ( 1 2 ) π λ λ ( 1 λ ) ( 1 λ ) sin ( π λ ) ,
with κ ( 1 2 ) = 0.80477 .
Proof. 
Evaluating Proposition 8 with t = 1 / 2 gives
I ( X ; Y ) κ ( 1 2 ) S Y Var ( f ( y X ) ) d y .
Evaluating Proposition 3 with r = 1 2 leads to
S Y Var ( f ( y X ) ) d y 2 ω ( S Y ) ψ 1 / 2 ( p , q ) V n p λ ( Y | X ) V n q 1 λ ( Y | X ) .
Combining these inequalities with the expression for ψ 1 / 2 ( p , q ) given in Equation (8) completes the proof. □
The contribution of Propositions 9 and 10 is that they provide bounds on the mutual information in terms of quantities that can be easy to characterize. One application of these bounds is to establish conditions under which the mutual information corresponding to a sequence of random pairs ( X k , Y k ) converges to zero. In this case, Proposition 9 provides a sufficient condition in terms of the Rényi entropy of Y n and the function V 0 ( Y n | X n ) , while Proposition 10 provides a sufficient condition in terms of V s ( Y n | X n ) evaluated with two difference values of s. These conditions are summarized in the following result.
Proposition 11.
Let ( X k , Y k ) be a sequence of random pairs such that the conditional distribution of Y k given X k has a density on R n . The following are sufficient conditions under which the mutual information of I ( X k ; Y k ) converges to zero as k increases to infinity:
  • There exists 0 < r < 1 such that
    lim k e h r ( Y k ) V 0 ( Y k | X k ) = 0 .
  • There exists p < 1 < q such that
    lim k V n p q 1 ( Y k | X k ) V n q 1 p ( Y k | X k ) = 0 .

4.3. Properties of the Bounds

The variance moment V s ( Y | X ) has a number of interesting properties. The variance of the conditional density can be expressed in terms of an expectation with respect to two independent random variables X 1 and X 2 with the same distribution as X via the decomposition:
Var ( f ( y | X ) ) = E f ( y | X ) f ( y | X ) f ( y | X 1 ) f ( y | X 2 ) .
Consequently, by swapping the order of the integration and expectation, we obtain
V s ( Y | X ) = E K s ( X , X ) K s ( X 1 , X 2 ) ,
where
K s ( x 1 , x 2 ) = y s f ( y | x 1 ) f ( y | x 2 ) d y .
The function K s ( x 1 , x 2 ) is a positive definite kernel that does not depend on the distribution of X. For s = 0 , this kernel has been studied previously in the machine learning literature [33], where it is referred to as the expected likelihood kernel.
The variance of the conditional density also satisfies a data processing inequality. Suppose that U X Y forms a Markov chain. Then, the square of the conditional density of Y given U can be expressed as
f Y | U 2 ( y | u ) = E f Y | X ( y | X 1 ) f Y | X ( y | X 2 ) U = u ,
where ( U , X 1 , X 2 ) P U P X 1 | U P X 2 | U . Combining this expression with Equation (30) yields
V s ( Y | U ) = E K s ( X 1 , X 2 ) K s ( X 1 , X 2 ) ,
where we recall that ( X 1 , X 2 ) are independent copies of X
Finally, it is easy to verify that the function V s ( Y ) satisfies
V s ( a Y | X ) = | a | s n V s ( Y | X ) , for all a 0 .
Using this scaling relationship, we see that the sufficient conditions in Proposition 11 are invariant to scaling of Y.

4.4. Example with Additive Gaussian Noise

We now provide a specific example of our bounds on the mutual information. Let X R n be a random vector with distribution P X and let Y be the output of a Gaussian noise channel
Y = X + W ,
where W N ( 0 , I n ) is independent of X. If X has finite second moment, then the mutual information satisfies
I ( X ; Y ) n 2 log 1 + 1 n E X 2 ,
where equality is attained if and only if X has zero-mean isotropic Gaussian distribution. This inequality follows straightforwardly from the fact that the Gaussian distribution maximizes differential entropy subject to a second moment constraint [11]. One of the limitations of this bound is that it can be loose when the second moment is dominated by events that have small probability. In fact, it is easy to construct examples for which X does not have a finite second moment, and yet I ( X ; Y ) is arbitrarily close to zero.
Our results provide bounds on I ( X ; Y ) that are less sensitive to the effects of rare events. Let ϕ n ( x ) = ( 2 π ) n / 2 exp ( x 2 / 2 ) denote the density of the standard Gaussian distribution on R n . The product of the conditional densities can be factored according to
f ( y x 1 ) f ( y x 2 ) = ϕ 2 n y x 1 y x 2 = ϕ 2 n 2 y ( x 1 + x 2 ) / 2 ( x 1 x 2 ) / 2 = ϕ n 2 y x 1 + x 2 2 ϕ n x 1 x 2 2 ,
where the second step follows because ϕ 2 n ( · ) is invariant to orthogonal transformations. Integrating with respect to y leads to
K s ( x 1 , x 2 ) = 2 n + s 2 E W + x 1 + x 2 2 s ϕ n x 1 x 2 2 ,
where we recall that W N ( 0 , I n ) . For the case s = 0 , we see that K 0 ( x 1 , x 2 ) is a Gaussian kernel, thus
V 0 ( Y | X ) = ( 4 π ) n 2 1 E e 1 4 X 1 X 2 2 .
A useful property of V 0 ( Y | X ) is that the conditions under which it converges to zero are weaker than the conditions needed for other measures of dependence. Observe that the expectation in Equation (34) is bounded uniformly with respect to ( X 1 , X 2 ) . In particular, for every ϵ > 0 and x R , we have
1 E e 1 4 X 1 X 2 2 ϵ 2 + 2 P | X x | ϵ ,
where we have used the inequality 1 e x x and the fact that P | X 1 X 2 | 2 ϵ 2 P | X x | ϵ . Consequently, V 0 ( Y | X ) converges to zero whenever X converges to a constant value x in probability.
To study some further properties of these bounds, we now focus on the case where X is a Gaussian scalar mixture generated according to
X = A U , A N ( 0 , 1 ) , U 0 ,
with A and U independent. In this case, the expectations with respect to the kernel K s ( x 1 , x 2 ) can be computed explicitly, leading to
V s ( Y | X ) = Γ ( 1 + s 2 ) 2 π E 1 + 2 U s 2 ( 1 + U 1 ) s 2 ( 1 + U 2 ) s 2 ( 1 + 1 2 ( U 1 + U 2 ) ) s + 1 2 ,
where ( U 1 , U 2 ) are independent copies of U. It can be shown that this expression depends primarily on the magnitude of U. This is not surprising given that X converges to a constant if and only if U converges to zero.
Our results can also be used to bound the mutual information I ( U ; Y ) by noting that U X Y forms a Markov chain, and taking advantage of the characterization provided in Equation (31). Letting X 1 = A 1 U and X 2 = A 2 U with ( A 1 , A 2 , U ) be mutually independent leads to
V s ( Y | U ) = Γ ( 1 + s 2 ) 2 π E 1 + U s 1 2 ( 1 + U 1 ) s 2 ( 1 + U 2 ) s 2 ( 1 + 1 2 ( U 1 + U 2 ) ) s + 1 2 ,
In this case, V s ( Y | U ) is a measure of the variation in U. To study its behavior, we consider the simple upper bound
V s ( Y | U ) Γ ( 1 + s 2 ) 2 π P U 1 U 2 E 1 + U s 1 2 ,
which follows from noting that the term inside the expectation in Equation (37) is zero on the event U 1 = U 2 . This bound shows that if s 1 then V s ( Y | U ) is bounded uniformly with respect to distributions on U, and if s > 1 , then V s ( Y | U ) is bounded in terms of the ( s 1 2 ) th moment of U.
In conjunction with Propositions 9 and 10, the function V s ( Y | U ) provides bounds on the mutual information I ( U ; Y ) that can be expressed in terms of simple expectations involving two independent copies of U. Figure 4 provides an illustration of the upper bound in Proposition 10 for the case where U is a discrete random variable supported on two points, and X and Y are generated according to Equations (32) and (35). This example shows that there exist sequences of distributions for which our upper bounds on the mutual information converge to zero while the chi-squared divergence between P X Y and P X P Y is bounded away from zero.

5. Conclusions

This paper provides bounds on Rényi entropy and mutual information that are based on a relatively simple two-moment inequality. Extensions to inequalities with more moments are worth exploring. Another potential application is to provide a refined characterization of the “all-or-nothing” behavior seen in a sparse linear regression problem [34,35], where the current methods of analysis depend on a complicated conditional second moment method.

Funding

This research was supported in part by the National Science Foundation under Grant 1750362 and in part by the Laboratory for Analytic Sciences (LAS). Any opinions, findings, conclusions, and recommendations expressed in this material are those of the author and do not necessarily reflect the views of the sponsors.

Conflicts of Interest

The author declares no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. The Gamma and Beta Functions

This section reviews some properties of the gamma and beta functions. For x > 0 , the gamma function is defined according to Γ ( x ) = 0 t x 1 e t d t . Binet’s formula for the logarithm for the gamma function ([25], [Theorem 1.6.3]) gives
log Γ ( x ) = x 1 2 log x x + 1 2 log ( 2 π ) + θ ( x ) ,
where the remainder term θ ( x ) is convex and nonincreasing with lim x 0 θ ( x ) = and lim x θ ( x ) = 0 . Euler’s reflection formula ([25], [Theorem 1.2.1]) gives
Γ ( x ) Γ ( 1 x ) = π sin ( π x ) , 0 < x < 1 .
For x , y > 0 , the beta function can be expressed as follows
B ( x , y ) = Γ ( x ) Γ ( y ) Γ ( x + y ) = 0 1 t x 1 ( 1 t ) y 1 d t = 0 u a 1 ( 1 + u ) a + b d u ,
where the second integral expression follows from the change of variables t u / ( 1 + u ) . Recall that B ˜ ( x , y ) = B ( x , y ) ( x + y ) x + y x x y y . Using Equation (A1) leads to
log B ˜ ( x , y ) x y 2 π ( x + y ) = θ ( x ) + θ ( y ) θ ( x + y ) .
It can also be shown that ([36], [Equation (2), pg. 2])
B ˜ ( x , y ) x + y x y .

Appendix B. Details for Rényi Entropy Examples

This appendix studies properties of the two-moment inequalities for Rényi entropy described in Section 3.

Appendix B.1. Log-Normal Distribution

Let X be a log-normal random variable with parameters ( μ , σ 2 ) and consider the parametrization
p = 1 r r ( 1 λ ) ( 1 r ) u r λ ( 1 λ ) q = 1 r r + λ ( 1 r ) u r λ ( 1 λ ) .
where λ ( 0 , 1 ) and u ( 0 , ) . Then, we have
ψ r ( p , q ) = B ˜ r λ 1 r , r ( 1 λ ) 1 r r λ ( 1 λ ) ( 1 r ) u L r ( X ; p , q ) = μ + 1 2 1 r r σ 2 + 1 2 u σ 2 .
Combining these expressions with Equation (A4) leads to
Δ r ( X ; p , q ) = θ r λ 1 r + θ r ( 1 λ ) 1 r θ r 1 r + 1 2 u σ 2 1 2 log u σ 2 1 2 log ( r 1 r 1 ) .
We now characterize the minimum with respect to the parameters ( λ , u ) . Note that the mapping λ θ ( r λ 1 r ) + θ ( r ( 1 λ ) 1 r ) is convex and symmetric about the point λ = 1 / 2 . Therefore, the minimum with respect to λ is attained at λ = 1 / 2 . Meanwhile, mapping u u σ 2 log ( u σ 2 ) is convex and attains its minimum at u = 1 / σ 2 . Evaluating Equation (A6) with these values, we see that the optimal two-moment inequality can be expressed as
Δ r ( X ) = 2 θ r 2 ( 1 r ) θ r 1 r + 1 2 log e r 1 1 r .
By Equation (A4), this expression is equivalent to Equation (A1). Moreover, the fact that Δ r ( X ) decreases to zero as r increases to one follows from the fact that θ ( x ) decreases to zero and x increases to infinity.
Next, we express the gap in terms of the pair ( p , q ) . Comparing the difference between Δ r ( X ; p , q ) and Δ r ( X ) leads to
Δ r ( X ; p , q ) = Δ r ( X ) + 1 2 φ r λ ( 1 λ ) 1 r ( q p ) 2 σ 2 + θ r λ 1 r + θ r ( 1 λ ) 1 r 2 θ r 2 ( 1 r ) ,
where φ ( x ) = x log ( x ) 1 . In particular, if p = 0 , then we obtain the simplified expression
Δ r ( X ; 0 , q ) = Δ r ( X ) + 1 2 φ q 1 r r σ 2 + θ r 1 r 1 q + θ 1 q 2 θ r 2 ( 1 r ) .
This characterization shows that the gap of the optimal one-moment inequality Δ ˜ r ( X ) increases to infinity in the limit as either σ 2 0 or σ 2 .

Appendix B.2. Multivariate Gaussian Distribution

Let Y N ( 0 , I n ) be an n-dimensional Gaussian vector and consider the parametrization
p = 1 r r 1 λ r 2 ( 1 r ) z λ ( 1 λ ) n q = 1 r r + λ r 2 ( 1 r ) z λ ( 1 λ ) n .
where λ ( 0 , 1 ) and z ( 0 , ) . We can write
log ω ( S Y ) = n 2 log π log n 2 log Γ n 2 ψ r ( p , q ) = B ˜ r λ 1 r , r ( 1 λ ) 1 r r λ ( 1 λ ) ( 1 r ) n r 2 z .
Furthermore, if
( 1 λ ) 2 ( 1 r ) z λ ( 1 λ ) n < 1 ,
then L r ( Y n ; p , q ) is finite and is given by
L r ( Y n ; p , q ) = Q r , n ( λ , z ) + n 2 log 2 + r 1 r log Γ n 2 r log Γ n 2 ,
where
Q r , n ( λ , z ) = r λ 1 r log Γ n 2 r 1 λ r ( 1 r ) n z 2 λ ( 1 λ ) + r ( 1 λ ) 1 r log Γ n 2 r + λ r ( 1 r ) n z 2 λ ( 1 λ ) r 1 r log Γ n 2 r .
Here, we note that the scaling in Equation (21) corresponds to λ = 1 / 2 and z = n / ( n + 1 ) , and thus the condition Inequality (A7) is satisfied for all n 1 . Combining the above expressions and then using Equations (A1) and (A4) leads to
Δ r ( Y ; p , q ) = θ r λ 1 r + θ r ( 1 λ ) 1 r θ r 1 r + Q r , n ( z , λ ) 1 2 log z 1 2 log r 1 r 1 + r 1 r θ n 2 r 1 1 r θ n 2 .
Next, we study some properties of Q r , n ( λ , z ) . By Equation (A1), the logarithm of the gamma function can be expressed as the sum of convex functions:
log Γ ( x ) = φ ( x ) + 1 2 log 1 x + 1 2 log ( 2 π ) 1 + θ ( x ) ,
where φ ( x ) = x log x + 1 x . Starting with the definition of Q ( λ , z ) and then using Jensen’s inequality yields
Q r , n ( z , λ ) r λ 1 r φ n 2 r 1 λ r ( 1 r ) n z 2 λ ( 1 λ ) + r ( 1 λ ) 1 r φ n 2 r + λ r ( 1 r ) n z 2 λ ( 1 λ ) r 1 r φ n 2 r = λ a φ 1 1 λ λ a z + ( 1 λ ) a φ 1 + λ 1 λ a z ,
where a = 2 ( 1 r ) / n . Using the inequality φ ( x ) ( 3 / 2 ) ( x 1 ) 2 / ( x + 2 ) leads to
Q r , n ( λ , z ) z 2 1 1 λ λ b z 1 + λ 1 λ b z 1 z 2 1 + λ 1 λ b z 1 ,
where b = 2 ( 1 r ) / ( 9 n ) .
Observe that the right-hand side of Inequality (A10) converges to z / 2 as n increases to infinity. It turns out this limiting behavior is tight. Using Equation (A1), it is straightforward to show that Q n ( λ , z ) converges pointwise to z / 2 as n increases to infinity—that is,
lim n Q r , n ( λ , z ) = 1 2 z ,
for any fixed pair ( λ , z ) ( 0 , 1 ) × ( 0 , ) .

Appendix B.3. Proof of Proposition 6

Let D = ( 0 , 1 ) × ( 0 , ) . For fixed r ( 0 , 1 ) , we use Q n ( λ , z ) to denote the function Q r , n ( λ , z ) defined in Equation (A8) and we use G n ( λ , z ) to denote the right-hand side of Equation (A9). These functions are defined to be equal to positive infinity for any pair ( λ , z ) D such that Inequality (A7) does not hold.
Note that the terms θ ( n / ( 2 r ) ) and θ ( n / 2 ) converge to zero in the limit as n increases to infinity. In conjunction with Equation (A11), this shows that G n ( λ , z ) converges pointwise to a limit G ( λ , z ) given by
G ( λ , z ) = θ r λ 1 r + θ r ( 1 λ ) 1 r θ r 1 r + 1 2 z 1 2 log z 1 2 log ( r 1 r 1 ) .
At this point, the correspondence with the log-normal distribution can be seen from the fact that G ( λ , z ) is equal to the right-hand side of Equation (A6) evaluated with u σ 2 = z .
To show that the gap corresponding to the log-normal distribution provides an upper bound on the limit, we use
lim sup n Δ r ( Y ) = lim sup n inf ( λ , z ) D G n ( λ , z ) inf ( λ , z ) D lim sup n G n ( λ , z ) = inf ( λ , z ) D G ( λ , z ) = Δ r ( X ) .
Here, the last equality follows from the analysis in Appendix B.1, which shows that the minimum of G ( λ , z ) is a attained at λ = 1 / 2 and z = 1 .
To prove the lower bound requires a bit more work. Fix any ϵ ( 0 , 1 ) and let D ϵ = ( 0 , 1 ϵ ] × ( 0 , ) . Using the lower bound on Q n ( λ , z ) given in Inequality (A10), it can be verified that
lim inf n inf ( λ , z ) D ϵ Q r , n ( z , λ ) 1 2 log z 1 2 .
Consequently, we have
lim inf n inf ( λ , z ) D ϵ G n ( λ , z ) = inf ( λ , z ) D ϵ G ( λ , z ) Δ r ( X ) .
To complete the proof we will show that for any sequence λ n that converges to one as n increases to infinity, we have
lim inf n inf z ( 0 , ) G n ( λ n , z ) = .
To see why this is the case, note that by Equation (A4) and Inequality (A5),
θ r λ 1 r + θ r ( 1 λ ) 1 r θ r 1 r 1 2 log 1 r 2 π r λ ( 1 λ ) .
Therefore, we can write
G n λ , z Q n ( λ , z ) 1 2 log λ ( 1 λ ) z + c n ,
where c n is bounded uniformly for all n. Making the substitution u = λ ( 1 λ ) z , we obtain
inf z > 0 G n λ , z inf u > 0 Q n λ , u λ ( 1 λ ) 1 2 log u + c n .
Next, let b n = 2 ( 1 r ) / ( 9 n ) . The lower bound in Inequality (A10) leads to
inf u > 0 Q n λ , u λ ( 1 λ ) 1 2 log u inf u > 0 u 2 λ 1 1 λ + b n u 1 2 log u .
The limiting behavior in Equation (A14) can now be seen as a consequence of Inequality (A15) and the fact that, for any sequence λ n converging to one, the right-hand side of Inequality (A16) increases without bound as n increases. Combining Inequality (A12), Inequality (A13), and Equation (A14) establishes that the large n limit of Δ r ( Y ) exists and is equal to Δ r ( X ) . This concludes the proof of Proposition 6.

Appendix B.4. Proof of Inequality (uid39)

Given any λ ( 0 , 1 ) and u ( 0 , ) let
p ( r ) = 1 r r 1 r r 1 λ λ u q ( r ) = 1 r r + 1 r r λ 1 λ u .
We need the following results, which characterize the terms in Proposition 4 in the limit as r increases to one.
Lemma A1.
The function ψ r ( p ( r ) , q ( r ) ) satisfies
lim r 1 ψ r ( p ( r ) , q ( r ) ) = 2 π u .
Proof. 
Starting with Equation (A4), we can write
ψ r ( p , q ) = 1 q p 2 π ( 1 r ) r λ ( 1 λ exp θ r λ 1 r + θ r ( 1 λ ) 1 r θ r 1 r .
As r converges to one, the terms in the exponent converge to zero. Note that q ( r ) p ( r ) = r λ ( 1 λ ) / ( 1 r ) completes the proof. □
Lemma A2.
If X is a random variable such that s E | X | s is finite in a neighborhood of zero, then E log ( X ) and Var ( log ( X ) ) are finite, and
lim r 1 L r ( X ; p ( r ) , q ( r ) ) = E log | X | + u 2 Var ( log | X | ) .
Proof. 
Let Λ ( s ) = log ( E | X | s ) . The assumption that E | X | s is finite in a neighborhood of zero means that E ( log | X | ) m is finite for all positive integers m, and thus Λ ( s ) is real analytic in a neighborhood of zero. Hence, there exist constants δ > 0 and C < , depending on the distribution of X, such that
Λ ( s ) a s + b s 2 C | s | 3 , for all | s | δ ,
where a = E log | X | and b = 1 2 Var ( | X | ) . Consequently, for all r such that 1 δ < p ( r ) < ( 1 r ) / r < q ( r ) < 1 + δ , it follows that
L r ( X ; p ( r ) , q ( r ) ) a 1 r r + u b C r 1 r λ | p ( r ) | 3 + ( 1 λ ) | q ( r ) | 3 .
Taking the limit as r increases to one completes the proof. □
We are now ready to prove Inequality (24). Combining Proposition 4 with Lemma A1 and Lemma A2 yields
lim sup r h r ( X ) 1 2 log 2 π u + E log X + u 2 Var ( log X ) .
The stated inequality follows from evaluating the right-hand side with u = 1 / Var ( log X ) , recalling that h ( X ) corresponds to the limit of h r ( X ) as r increases to one.

Appendix C. Proof of Proposition 7

The function κ : ( 0 , 1 ] R + can be expressed as
κ ( t ) = sup u ( 0 , ) ρ t ( u ) ,
where ρ t ( u ) = log ( 1 + u ) / u t . For t = 1 , the bound log ( 1 + u ) u implies that ρ 1 ( u ) 1 . Noting that lim u 0 ρ 1 ( u ) = 1 , we conclude that κ ( 1 ) = 1 .
Next, we consider the case t ( 0 , 1 ) . The function ρ t is continuously differentiable on ( 0 , ) with
sgn ( ρ t ( u ) ) = sgn u t ( 1 + u ) log ( 1 + u ) .
Under the assumption t ( 0 , 1 ) , we see that ρ t ( u ) is increasing for all u sufficiently close to zero and decreasing for all u sufficiently large, and thus the supremum is attained at a stationary point of ρ t ( u ) on ( 0 , ) . Making the substitution w = log ( 1 + u ) 1 / t leads to
ρ t ( u ) = 0 w e w = 1 t e 1 t .
For t ( 0 , 1 ) , it follows that 1 t e 1 t ( e 1 , 0 ) , and thus ρ t ( u ) has a unique root that can be expressed as
u t * = exp W 1 t exp 1 t + 1 t 1 ,
where Lambert’s function W ( z ) is the solution to the equation z = w e w on the interval on [ 1 , ) .
Lemma A3.
The function g ( t ) = t κ ( t ) is strictly increasing on ( 0 , 1 ] with lim t 0 g ( t ) = 1 / e and g ( 1 ) = 1 .
Proof. 
The fact that g ( 1 ) = 1 follows from κ ( 1 ) = 1 . By the envelope theorem [37], the derivative of g ( t ) can be expressed as
g ( t ) = d d t t ρ t ( u ) | u = u t * = log ( 1 + u t * ) ( u t * ) t t log ( u t * ) log ( 1 + u t * ) ( u t * ) t
In view of Equation (A18), it follows that ρ t ( u t * ) = 0 can be expressed equivalently as
u t * ( 1 + u t * ) log ( 1 + u t * ) = t ,
and thus
sgn ( g ( t ) ) = sgn 1 u t * log u t * ( 1 + u t * ) log ( 1 + u t * ) .
Noting that u log u < ( 1 + u ) log ( 1 + u ) for all u ( 0 , ) , it follows that g ( t ) > 0 is strictly positive, and thus g ( t ) is strictly increasing.
To prove the small t limit, we use Equation (A19) to write
log ( g ( t ) ) = log u t * 1 + u t * u t * log u t * ( 1 + u t * ) log ( 1 + u t * ) .
Now, as t decreases to zero, Equation (A19) shows that u t * increases to infinity. By Equation (A21), it then follows that log ( g ( t ) ) converges to negative one, which proves the desired limit. □

References

  1. Dembo, A.; Cover, T.M.; Thomas, J.A. Information Theoretic Inequalities. IEEE Trans. Inf. Theory 1991, 37, 1501–1518. [Google Scholar] [CrossRef] [Green Version]
  2. Carslon, F. Une inégalité. Ark. Mat. Astron. Fys. 1934, 25, 1–5. [Google Scholar]
  3. Levin, V.I. Exact constants in inequalities of the Carlson type. Doklady Akad. Nauk. SSSR (N. S.) 1948, 59, 635–638. [Google Scholar]
  4. Larsson, L.; Maligranda, L.; Persson, L.E.; Pečarić, J. Multiplicative Inequalities of Carlson Type and Interpolation; World Scientific Publishing Company: Singapore, 2006. [Google Scholar]
  5. Barza, S.; Burenkov, V.; Pečarić, J.E.; Persson, L.E. Sharp multidimensional multiplicative inequalities for weighted Lp spaces with homogeneous weights. Math. Inequalities Appl. 1998, 1, 53–67. [Google Scholar] [CrossRef] [Green Version]
  6. Reeves, G. Two-Moment Inequailties for Rényi Entropy and Mutual Information. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 664–668. [Google Scholar]
  7. Gray, R.M. Entropy and Information Theory; Springer-Verlag: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  8. van Erven, T.; Harremoës, P. Rényi Divergence and Kullback–Liebler Divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  9. Atar, R.; Chowdharry, K.; Dupuis, P. Abstract. Robust Bounds on Risk-Sensitive Functionals via Rényi Divergence. SIAM/ASA J. Uncertain. Quantif. 2015, 3, 18–33. [Google Scholar] [CrossRef] [Green Version]
  10. Rosenkrantz, R. (Ed.) E. T. Jaynes: Papers on Probability, Staistics and Statistical Physics; Springer: Berlin/Heidelberg, Germany, 1989. [Google Scholar]
  11. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  12. Lutwak, E.; Yang, D.; Zhang, G. Moment-entropy inequalities. Ann. Probab. 2004, 32, 757–774. [Google Scholar] [CrossRef]
  13. Lutwak, E.; Yang, D.; Zhang, G. Moment-Entropy Inequalities for a Random Vector. IEEE Trans. Inf. Theory 2007, 53, 1603–1607. [Google Scholar] [CrossRef]
  14. Lutwak, E.; Lv, S.; Yang, D.; Zhang, G. Affine Moments of a Random Vector. IEEE Trans. Inf. Theory 2013, 59, 5592–5599. [Google Scholar] [CrossRef]
  15. Costa, J.A.; Hero, A.O.; Vignat, C. A Characterization of the Multivariate Distributions Maximizing Rényi Entropy. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Lausanne, Switzerland, 30 June–5 July 2002. [Google Scholar] [CrossRef]
  16. Costa, J.A.; Hero, A.O.; Vignat, C. A Geometric Characterization of Maximum Rényi Entropy Distributions. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Seattle, WA, USA, 9–14 July 2006; pp. 1822–1826. [Google Scholar]
  17. Johnson, O.; Vignat, C. Some results concerning maximum Rényi entropy distributions. Ann. de l’Institut Henri Poincaré (B) Probab. Stat. 2007, 43, 339–351. [Google Scholar] [CrossRef] [Green Version]
  18. Nguyen, V.H. A simple proof of the Moment-Entropy inequalities. Adv. Appl. Math. 2019, 108, 31–44. [Google Scholar] [CrossRef]
  19. Barron, A.; Yang, Y. Information-theoretic determination of minimax rates of convergence. Ann. Stat. 1999, 27, 1564–1599. [Google Scholar] [CrossRef]
  20. Wu, Y.; Xu, J. Statistical problems with planted structures: Information-theoretical and computational limits. In Information-Theoretic Methods in Data Science; Rodrigues, M.R.D., Eldar, Y.C., Eds.; Cambridge University Press: Cambridge, UK, 2020; Chapter 13. [Google Scholar]
  21. Reeves, G. Conditional Central Limit Theorems for Gaussian Projections. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 3055–3059. [Google Scholar]
  22. Reeves, G.; Pfister, H.D. The Replica-Symmetric Prediction for Random Linear Estimation with Gaussian Matrices is Exact. IEEE Trans. Inf. Theory 2019, 65, 2252–2283. [Google Scholar] [CrossRef]
  23. Reeves, G.; Pfister, H.D. Understanding Phase Transitions via Mutual Information and MMSE. In Information-Theoretic Methods in Data Science; Rodrigues, M.R.D., Eldar, Y.C., Eds.; Cambridge University Press: Cambridge, UK, 2020; Chapter 7. [Google Scholar]
  24. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
  25. Andrews, G.E.; Askey, R.; Roy, R. Special Functions; Vol. 71, Encyclopedia of Mathematics and its Applications; Cambridge University Press: Cambridge, UK, 1999. [Google Scholar]
  26. Nielsen, F.; Nock, R. On the Chi Square and Higher-Order Chi Distrances for Approximationg f-Divergences. IEEE Signal Process. Lett. 1014, 1, 10–13. [Google Scholar]
  27. Sason, I.; Verdú, S. f-Divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  28. Sason, I. On the Rényi Divergence, Joint Range of Relative Entropy, and a Channel Coding Theorem. IEEE Trans. Inf. Theory 2016, 62, 23–34. [Google Scholar] [CrossRef]
  29. Sason, I.; Verdú, S. Improved Bounds on Lossless Source Coding and Guessing Moments via Rényi Measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4326. [Google Scholar] [CrossRef] [Green Version]
  30. Sason, I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy 2018, 20, 383. [Google Scholar] [CrossRef] [Green Version]
  31. Melbourne, J.; Madiman, M.; Salapaka, M.V. Relationships between certain f-divergences. In Proceedings of the Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 24–27 September 2019; pp. 1068–1073. [Google Scholar]
  32. Nishiyama, T.; Sason, I. On Relations Between the Relative Entropy and χ2-Divergence, Generalizations and Applications. Entropy 2020, 22, 563. [Google Scholar] [CrossRef]
  33. Jebara, T.; Kondor, R.; Howard, A. Probability Product Kernels. J. Mach. Learn. Res. 2004, 5, 818–844. [Google Scholar]
  34. Reeves, G.; Xu, J.; Zadik, I. The All-or-Nothing Phenomenon in Sparse Linear Regression. In Proceedings of the Conference On Learning Theory (COLT), Phoenix, AZ, USA, 25–28 June 2019. [Google Scholar]
  35. Reeves, G.; Xu, J.; Zadik, I. All-or-nothing phenomena from single-letter to high dimensions. In Proceedings of the IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Guadeloupe, France, 15–18 December 2019. [Google Scholar]
  36. Grenié, L.; Molteni, G. Inequalities for the beta function. Math. Inequalities Appl. 2015, 18, 1427–1442. [Google Scholar] [CrossRef] [Green Version]
  37. Milgrom, P.; Segal, I. Envelope Theorems for Arbitrary Choice Sets. Econometrica 2002, 70, 583–601. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Comparison of upper bounds on Rényi entropy in nats for the log-normal distribution as a function of the order r for various σ 2 .
Figure 1. Comparison of upper bounds on Rényi entropy in nats for the log-normal distribution as a function of the order r for various σ 2 .
Entropy 22 01244 g001
Figure 2. Comparison of upper bounds on Rényi entropy in nats for the multivariate Gaussian distribution N ( 0 , I n ) as a function of the dimension n with r = 0.1 . The solid black line is the gap of the optimal two-moment inequality for the log-normal distribution.
Figure 2. Comparison of upper bounds on Rényi entropy in nats for the multivariate Gaussian distribution N ( 0 , I n ) as a function of the dimension n with r = 0.1 . The solid black line is the gap of the optimal two-moment inequality for the log-normal distribution.
Entropy 22 01244 g002
Figure 3. Graphs of κ ( t ) and t κ ( t ) as a function of t.
Figure 3. Graphs of κ ( t ) and t κ ( t ) as a function of t.
Entropy 22 01244 g003
Figure 4. Bounds on the mutual information I ( U ; Y ) in nats when U ( 1 ϵ ) δ 1 + ϵ δ a ( ϵ ) , with a ( ϵ ) = 1 + 1 / ϵ , and X and Y are generated according to Equations (32) and (35). The bound from Proposition 10 is evaluated with p = 0 and q = 2 .
Figure 4. Bounds on the mutual information I ( U ; Y ) in nats when U ( 1 ϵ ) δ 1 + ϵ δ a ( ϵ ) , with a ( ϵ ) = 1 + 1 / ϵ , and X and Y are generated according to Equations (32) and (35). The bound from Proposition 10 is evaluated with p = 0 and q = 2 .
Entropy 22 01244 g004
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Reeves, G. A Two-Moment Inequality with Applications to Rényi Entropy and Mutual Information. Entropy 2020, 22, 1244. https://doi.org/10.3390/e22111244

AMA Style

Reeves G. A Two-Moment Inequality with Applications to Rényi Entropy and Mutual Information. Entropy. 2020; 22(11):1244. https://doi.org/10.3390/e22111244

Chicago/Turabian Style

Reeves, Galen. 2020. "A Two-Moment Inequality with Applications to Rényi Entropy and Mutual Information" Entropy 22, no. 11: 1244. https://doi.org/10.3390/e22111244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop