1. Introduction
The interplay between inequalities and information theory has a rich history, with notable examples including the relationship between the Brunn–Minkowski inequality and the entropy power inequality as well as the matrix determinant inequalities obtained from differential entropy [
1]. In this paper, the focus is on a “two-moment” inequality that provides an upper bound on the integral of the 
rth power of a function. Specifically, if 
f is a nonnegative function defined on 
 and 
 are real numbers satisfying 
 and 
, then
      
      where the best possible constant 
 is given exactly; see Propositions 2 and 3 ahead. The one-dimensional version of this inequality is a special case of the classical Carlson–Levin inequality [
2,
3,
4], and the multidimensional version is a special case of a result presented by Barza et al. [
5]. The particular formulation of the inequality used in this paper was derived independently in [
6], where the proof follows from a direct application of Hölder’s inequality and Jensen’s inequality.
In the context of information theory and statistics, a useful property of the two-moment inequality is that it provides a bound on a nonlinear functional, namely the 
r-quasi-norm 
, in terms of integrals that are linear in 
f. Consequently, this inequality is well suited to settings where 
f is a mixture of simple functions whose moments can be evaluated. We note that this reliance on moments to bound a nonlinear functional is closely related to bounds obtained from variational characterizations such as the Donsker–Varadhan representation of Kullback divergence [
7] and its generalizations to Rényi divergence [
8,
9].
The first application considered in this paper concerns the relationship between the entropy of a probability measure and its moments. This relationship is fundamental to the principle of maximum entropy, which originated in statistical physics and has since been applied to statistical inference problems [
10]. It also plays a prominent role in information theory and estimation theory where the fact that the Gaussian distribution maximizes differential entropy under second moment constraints ([
11], [Theorem 8.6.5]) plays a prominent role. Moment–entropy inequalities for Rényi entropy were studied in a series of works by Lutwak et al. [
12,
13,
14], as well as related works by Costa et al. [
15,
16] and Johonson and Vignat [
17], in which it is shown that, under a single moment constraint, Rényi entropy is maximized by a family of generalized Gaussian distributions. The connection between these moment–entropy inequalities and the Carlson–Levin inequality was noted recently by Nguyen [
18].
In this direction, one of the contributions of this paper is a new family of moment–entropy inequalities. This family of inequalities follows from applying Inequality (
1) in the setting where 
f is a probability density function, and thus there is a one-to-one correspondence between the integral of the 
rth power and the Rényi entropy of order 
r. In the special case where one of the moments is the zeroth moment, this approach recovers the moment–entropy inequalities given in previous work. More generally, the additional flexibility provided by considering two different moments can lead to stronger results. For example, in Proposition 6, it is shown that if 
f is the standard Gaussian density function defined on 
, then the difference between the Rényi entropy and the upper bound given by the two-moment inequality (equivalently, the ratio between the left- and right-hand sides of (
1)) is bounded uniformly with respect to 
n under the following specification of the moments:
	  Conversely, if one of the moments is restricted to be equal to zero, as is the case in the usual moment–entropy inequalities, then the difference between the Rényi entropy and the upper bound diverges with 
n.
The second application considered in this paper is the problem of bounding mutual information. In conjunction with Fano’s inequality and its extensions, bounds on mutual information play a prominent role in establishing minimax rates of statistical estimation [
19] as well as the information-theoretic limits of detection in high-dimensional settings [
20]. In many cases, one of the technical challenges is to provide conditions under which the dependence between the observations and an underlying signal or model parameters converges to zero in the limit of high dimension.
This paper introduces a new method for bounding mutual information, which can be described as follows. Let 
 be a probability measure on 
 such that 
 and 
 have densities 
 and 
 with respect to the Lebesgue measure on 
. We begin by showing that the mutual information between 
X and 
Y satisfies the upper bound
      
      where 
 is the variance of 
; see Proposition 8 ahead. In view of (
3), an application of the two-moment Inequality (
1) with 
 leads to an upper bound with respect to the moments of the variance of the density:
      where this expression is evaluated at 
 with 
. A useful property of this bound is that the integrated variance is quadratic in 
, and thus Expression (
4) can be evaluated by swapping the integration over 
y and with the expectation of over two independent copies of 
X. For example, when 
 is a Gaussian scale mixture, this approach provides closed-form upper bounds in terms of the moments of the Gaussian density. An early version of this technique is used to prove Gaussian approximations for random projections [
21] arising in the analysis of a random linear estimation problem appearing in wireless communications and compressed sensing [
22,
23].
  2. Moment Inequalities
Let 
 be the space of Lebesgue measurable functions from 
S to 
 whose 
pth power is absolutely integrable, and for 
, define
      
	  Recall that 
 is a norm for 
 but only a quasi-norm for 
 because it does not satisfy the triangle inequality. The 
sth moment of 
f is defined as
      
      where 
 denotes the standard Euclidean norm on vectors.
The two-moment Inequality (
1) can be derived straightforwardly using the following argument. For 
, the mapping 
 is concave on the subset of nonnegative functions and admits the variational representation
      
      where 
 is the Hölder conjugate of 
r. Consequently, each 
 leads to an upper bound on 
. For example, if 
f has bounded support 
S, choosing 
g to be the indicator function of 
S leads to the basic inequality 
. The upper bound on 
 given in Inequality (
1) can be obtained by restricting the minimum in Expression (
5) to the parametric class of functions of the form 
 with 
 and then optimizing over the parameters 
. Here, the constraints on 
 are necessary and sufficient to ensure that 
.
In the following sections, we provide a more detailed derivation, starting with the problem of maximizing 
 under multiple moment constraints and then specializing to the case of two moments. For a detailed account of the history of the Carlson type inequalities as well as some further extensions, see [
4].
  2.1. Multiple Moments
Consider the following optimization problem: 
		For 
, this is a convex optimization problem because 
 is concave and the moment constraints are linear. By standard theory in convex optimization (e.g., [
24]), it can be shown that if the problem is feasible and the maximum is finite, then the maximizer has the form
        
		The parameters 
 are nonnegative and the 
ith moment constraint holds with equality for all 
i such that 
 is strictly positive—that is, 
. Consequently, the maximum can be expressed in terms of a linear combination of the moments:
For the purposes of this paper, it is useful to consider a relative inequality in terms of the moments of the function itself. Given a number 
 and vectors 
 and 
, the function 
 is defined according to
        
        if the integral exists. Otherwise, 
 is defined to be positive infinity. It can be verified that 
 is finite provided that there exists 
 such that 
 and 
 are strictly positive and 
.
The following result can be viewed as a consequence of the constrained optimization problem described above. We provide a different and very simple proof that depends only on Hölder’s inequality.
Proposition 1. Let f be a nonnegative Lebesgue measurable function defined on the positive reals . For any number  and vectors  and , we have  Proof.  Let 
. Then, we have
          
          where the second step is Hölder’s inequality with conjugate exponents 
 and 
. □
   2.2. Two Moments
For 
, the beta function 
 and gamma function 
 are given by
        
        and satisfy the relation 
, 
. To lighten the notation, we define the normalized beta function
        
		Properties of these functions are provided in 
Appendix A.
The next result follows from Proposition 1 for the case of two moments.
Proposition 2. Let f be a nonnegative Lebesgue measurable function defined on . For any numbers  with  and ,where  andwhere  is defined in Equation (6).  Proof.  Letting 
 and 
 with 
, we have
          
		  Making the change of variable 
 leads to
          
          where 
 and 
 and the second step follows from recognizing the integral representation of the beta function given in Equation (
A3). Therefore, by Proposition 1, the inequality
          
          holds for all 
. Evaluating this inequality with
          
          leads to the stated result. □
 The special case 
 admits the simplified expression
        
        where we have used Euler’s reflection formula for the beta function ([
25], [Theorem 1.2.1]).
Next, we consider an extension of Proposition 2 for functions defined on 
. Given any measurable subset 
S of 
, we define
        
        where 
 is the 
n-dimensional Euclidean ball of radius one and
        
		The function 
 is proportional to the surface measure of the projection of 
S on the Euclidean sphere and satisfies
        
        for all 
. Note that 
 and 
.
Proposition 3. Let f be a nonnegative Lebesgue measurable function defined on a subset S of . For any numbers  with  and ,where  and  is given by Equation (7).  Proof.  Let 
f be extended to 
 using the rule 
 for all 
x outside of 
S and let 
 be defined according to
          
          where 
 is the Euclidean sphere of radius one and 
 is the surface measure of the sphere. In the following, we will show that
          
		  Then, the stated inequality then follows from applying Proposition 2 to the function 
g.
To prove Inequality (
11), we begin with a transformation into polar coordinates:
          
		  Letting 
 denote the indicator function of the set 
, the integral over the sphere can be bounded using:
          
          where: (a) follows from Hölder’s inequality with conjugate exponents 
 and 
, and (b) follows from the definition of 
g and the fact that
          
		  Plugging Inequality (
14) back into Equation (
13) and then making the change of variable 
 yields
          
The proof of Equation (12) follows along similar lines. We have
          
          where (a) follows from a transformation into polar coordinates and (b) follows from the change of variable 
.
Having established Inequality (
11) and Equation (12), an application of Proposition 2 completes the proof. □
   3. Rényi Entropy Bounds
Let 
X be a random vector that has a density 
 with respect to the Lebesgue measure on 
. The differential Rényi entropy of order 
 is defined according to [
11]:
	  Throughout this paper, it is assumed that the logarithm is defined with respect to the natural base and entropy is measured in nats. The Rényi entropy is continuous and nonincreasing in 
r. If the support set 
 has finite measure, then the limit as 
r converges to zero is given by 
. If the support does not have finite measure, then 
 increases to infinity as 
r decreases to zero. The case 
 is given by the Shannon differential entropy:
Given a random variable 
X that is not identical to zero and numbers 
 with 
 and 
, we define the function
      
      where 
.
The next result, which follows directly from Proposition 3, provides an upper bound on the Rényi entropy.
Proposition 4. Let X be a random vector with a density on . For any numbers  with  and , the Rényi entropy satisfieswhere  is defined in Equation (9) and  is defined in Equation (7).  Proof.  This result follows immediately from Proposition 3 and the definition of Rényi entropy. □
 The relationship between Proposition 4 and previous results depends on whether the moment p is equal to zero:
One-moment inequalities: If 
, then there exists a distribution such that Inequality (
15) holds with equality. This is because the zero-moment constraint ensures that the function that maximizes the Rényi entropy integrates to one. In this case, Proposition 4 is equivalent to previous results that focused on distributions that maximize Rényi entropy subject to a single moment constraint [
12,
13,
15]. With some abuse of terminology, we refer to these bounds as one-moment inequalities. (A more accurate name would be two-moment inequalities under the constraint that one of the moments is the zeroth moment.)
 Two-moment inequalities: If 
, then the right-hand side of Inequality (
15) corresponds to the Rényi entropy of a nonnegative function that might not integrate to one. Nevertheless, the expression provides an upper bound on the Rényi entropy for any density with the same moments. We refer to the bounds obtained using a general pair 
 as two-moment inequalities.
 
The contribution of two-moment inequalities is that they lead to tighter bounds. To quantify the tightness, we define 
 to be the gap between the right-hand side and left-hand side of Inequality (
15) corresponding to the pair 
—that is,
      
	  The gaps corresponding to the optimal two-moment and one-moment inequalities are defined according to
      
  3.1. Some Consequences of These Bounds
By Lyapunov’s inequality, the mapping 
 is nondecreasing on 
, and thus
        
		In other words, the case 
 provides an upper bound on 
 for nonnegative 
p. Alternatively, we also have the lower bound
        
        which follows from the convexity of 
.
A useful property of 
 is that it is additive with respect to the product of independent random variables. Specifically, if 
X and 
Y are independent, then
        
		One consequence is that multiplication by a bounded random variable cannot increase the Rényi entropy by an amount that exceeds the gap of the two-moment inequality with nonnegative moments.
Proposition 5. Let Y be a random vector on  with finite Rényi entropy of order , and let X be an independent random variable that satisfies . Then,for all .  Proof.  Let 
 and let 
 and 
 denote the support sets of 
Z and 
Y, respectively. The assumption that 
X is nonnegative means that 
. We have
          
          where (a) follows from Proposition 4, (b) follows from Equation (
18) and the definition of 
, and (c) follows from Inequality (
16) and the assumption 
. Finally, recalling that 
 completes the proof. □
   3.2. Example with Log-Normal Distribution
If 
, then the random variable 
 has a log-normal distribution with parameters 
. The Rényi entropy is given by
        
        and the logarithm of the 
sth moment is given by
        
		With a bit of work, it can be shown that the gap of the optimal two-moment inequality does not depend on the parameters 
 and is given by
        
		The details of this derivation are given in 
Appendix B.1. Meanwhile, the gap of the optimal one-moment inequality is given by
        
The functions 
 and 
 are illustrated in 
Figure 1 as a function of 
r for various 
. The function 
 is bounded uniformly with respect to 
r and converges to zero as 
r increases to one. The tightness of the two-moment inequality in this regime follows from the fact that the log-normal distribution maximizes Shannon entropy subject to a constraint on 
. By contrast, the function 
 varies with the parameter 
. For any fixed 
, it can be shown that 
 increases to infinity if 
 converges to zero or infinity.
  3.3. Example with Multivariate Gaussian Distribution
Next, we consider the case where 
 is an 
n-dimensional Gaussian vector with mean zero and identity covariance. The Rényi entropy is given by
        
        and the 
sth moment of the magnitude 
 is given by
        
The next result shows that as the dimension 
n increases, the gap of the optimal two-moment inequality converges to the gap for the log-normal distribution. Moreover, for each 
, the following choice of moments is optimal in the large-
n limit:
		The proof is given in 
Appendix B.3.
Proposition 6. If , then, for each ,where X has a log-normal distribution and  are given by (21).  Figure 2 provides a comparison of 
, 
, and 
 as a function of 
n for 
. Here, we see that both 
 and 
 converge rapidly to the asymptotic limit given by the gap of the log-normal distribution. By contrast, the gap of the optimal one-moment inequality 
 increases without bound.
   3.4. Inequalities for Differential Entropy
Proposition 4 can also be used to recover some known inequalities for differential entropy by considering the limiting behavior as 
r converges to one. For example, it is well known that the differential entropy of an 
n-dimensional random vector 
X with finite second moment satisfies
        
        with equality if and only if the entries of 
X are i.i.d. zero-mean Gaussian. A generalization of this result in terms of an arbitrary positive moment is given by
        
        for all 
. Note that Inequality (
22) corresponds to the case 
.
Inequality (
23) can be proved as an immediate consequence of Proposition 4 and the fact that 
 is nonincreasing in 
r. Using properties of the beta function given in 
Appendix A, it is straightforward to verify that
        
		Combining this result with Proposition 4 and Inequality (
16) leads to
        
		Using Inequality (
10) and making the substitution 
 leads to Inequality (
23).
Another example follows from the fact that the log-normal distribution maximizes the differential entropy of a positive random variable 
X subject to constraints on the mean and variance of 
, and hence
        
        with equality if and only if 
X is log-normal. In 
Appendix B.4, it is shown how this inequality can be proved using our two-moment inequalities by studying the behavior as both 
p and 
q converge to zero as 
r increases to one.