Next Article in Journal
Exploiting an Ontological Model to Study COVID-19 Contagion Chains in Sustainable Smart Cities
Previous Article in Journal
3D Reconstruction with Coronary Artery Based on Curve Descriptor and Projection Geometry-Constrained Vasculature Matching
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reversing Jensen’s Inequality for Information-Theoretic Analyses

The Viterbi Faculty of Electrical and Computer Engineering, Technion—Israel Institute of Technology, Haifa 3200003, Israel
Information 2022, 13(1), 39; https://doi.org/10.3390/info13010039
Submission received: 15 December 2021 / Revised: 12 January 2022 / Accepted: 13 January 2022 / Published: 13 January 2022
(This article belongs to the Section Information Theory and Methodology)

Abstract

:
In this work, we propose both an improvement and extensions of a reverse Jensen inequality due to Wunder et al. (2021). The new proposed inequalities are fairly tight and reasonably easy to use in a wide variety of situations, as demonstrated in several application examples that are relevant to information theory. Moreover, the main ideas behind the derivations turn out to be applicable to generate bounds to expectations of multivariate convex/concave functions, as well as functions that are not necessarily convex or concave.

1. Introduction

It is very well known that Jensen’s inequality is one of the most elementary mathematical tools, which is extremely useful in a variety of fields, including information theory. At the general level, it covers many other well-known inequalities, which are useful in their own right, as special cases. Examples in general applied mathematics include the Shwartz–Cauchy inequality (which in turn serves the uncertainty principle and the Cramér–Rao inequality), the Hölder inequality, the Lyapunov inequality, and the inequalities of arithmetic, geometric, and harmonic means, just to name a few. More specifically, in information theory, it is basis of the information inequality (i.e., the non-negativity of the Kullback–Leibler divergence), the data processing inequality (which, in turn, leads to the Fano inequality), and the property that conditioning reduces entropy. Moreover, it has a pivotal role in the generation of single-letter formulas in Shannon theory and in the theory of maximum entropy under moment constraints (see, for example, Chapter 12 of [1]).
As often as not, however, applied mathematicians, and in particular, information theorists (the author included), encounter the somewhat frustrating situation that Jensen’s inequality works in the opposite direction than the one they would hope for along their way to obtaining their desired results. It is conceivable to speculate that it is this fact that has triggered a considerable research effort in developing a variety of versions of the so-called reverse Jensen inequality (RJI) (see, e.g., Refs. [2,3,4,5,6,7,8,9,10,11] for a non-exhaustive sample of articles on this topic). In most of these (and other) works, the inequalities derived are demonstrated in a variety of applications, for example, useful relationships between arithmetic and geometric means, reverse bounds on the entropy, the Kullback–Leibler divergence (and, more generally, Csiszár’s f-divergence), reverse versions of the Hölder inequality, etc. In many of the abovementioned papers, the main results are given in the form of an upper bound on the difference, E { f ( X ) } f ( E { X } ) , where f is the convex function, E { · } is the expectation operator, and X is the random variable. These upper bounds, however, depend mostly on global properties of the function f, such as its range and domain, but not quite on the underlying probability density function (PDF) of X (or its probability mass function, in the discrete case). For example, a desirable property of an RJI would be tightness when the PDF of X is well concentrated in the vicinity of its mean, just like the same well-known property of the ordinary Jensen inequality,
E { f ( X ) } f ( E { X } ) .
In this work, we build on [9] and provide several new variants of the RJI, which possess the abovementioned desired property of tightness under measure concentration. Our starting point is the same as in (the proof of) Lemma 1 of [9], but the continuation of our derivation is substantially different. As a result of this difference, the proposed approach yields considerably tighter bounds, which are reasonably convenient to analyze and calculate in many cases of interest, as we demonstrate throughout this work. We also extend the scope to functions of more than one variable, which are convex (or concave) in each variable alone, but not necessarily so jointly, in all variables. Finally, using similar ideas, we also derive upper and lower bounds on expectations of functions that are not necessarily convex or concave altogether.
To summarize, the main contributions of this work relative to [9] are as follows.
  • Improvement of the tightness of the lower bound on the expectation of a concave function.
  • Relaxing some of the assumptions on the concave function.
  • Proposing a more convenient (and perhaps more natural) passage from lower bounds to expectations of concave functions to upper bounds on expectations of convex functions.
  • Extension to bivariate (and multivariate) functions that are concave (or convex) in each variable.
  • Providing examples of usefulness in information theory (other than the mutual information estimation of [9]).
The outline of this article is as follows. In Section 2, we present the basic inequality, rooted in Lemma 1 of [9], and discuss its weaknesses. In Section 3, we present two alternative approaches to improve on the basic bound of [9] and discuss the relative strengths and weaknesses of each one. In Section 4, we provide several examples for the usefulness of our proposed bounds. In Section 5, we mostly discuss variations and modifications of our main results, and finally, in Section 6, we extend the scope to bivariate (and multivariate) functions.

2. The Basic Reverse Inequality

Our starting point, which is very similar to Lemma 1 of [9], is the following Lemma.
Lemma 1.
Let f : R + R be a concave function with f ( x ) f ( 0 ) for every x 0 . Let X be a non-negative random variable with a finite mean, E { X } = μ . Then,
E { f ( X ) } sup a > 0 μ a · f ( a ) + 1 μ a · f ( 0 ) f ( a ) f ( 0 ) a · E X · I [ X > a ] ,
where I [ X > a ] denotes the indicator function of event { X > a } .
The proof is almost the same as in the first few steps in the proof of ([9], Lemma 1), but nevertheless, we present it here for completeness.
Proof of Lemma 1.
For every x [ 0 , a ] ,
f ( x ) = f 1 x a · 0 + x a · a
1 x a · f ( 0 ) + x a · f ( a )
= f ( 0 ) + x · f ( a ) f ( 0 ) a ,
and so,
f ( x ) f ( 0 ) x f ( a ) f ( 0 ) a whenever 0 x a .
Thus,
E { f ( X ) } = E f ( X ) f ( 0 ) X · X + f ( 0 )
E f ( X ) f ( 0 ) X · X · I [ X a ] + f ( 0 )
E f ( a ) f ( 0 ) a · X · I [ X a ] + f ( 0 )
= f ( a ) f ( 0 ) a · E X · I [ X a ] + f ( 0 )
= f ( a ) f ( 0 ) a · E X · ( 1 I [ X > a ] ) + f ( 0 )
= f ( a ) f ( 0 ) a · μ + f ( 0 ) f ( a ) f ( 0 ) a · E X · I [ X > a ]
= μ a · f ( a ) + 1 μ a · f ( 0 ) f ( a ) f ( 0 ) a · E X · I [ X > a ] ,
where in the first inequality, we relied on the assumption that f ( x ) f ( 0 ) for every x 0 and in the second inequality we have used (6). Since this chain of inequalities holds true for every a > 0 , we may take the supremum of the right-hand side (r.h.s.) over all a > 0 . This completes the proof of Lemma 1. □
Lemma 1 serves as the basis of our subsequent derivations throughout this paper. The main remaining issue now is how to assess the term
q ( a ) = E X · I [ X > a ] .
In simple cases, q ( a ) can be calculated exactly by a closed form expression, for example, when the PDF of X is uniform, or triangular, or exponential, etc. In most cases of interest, however, it is difficult, if not impossible, to derive an exact, closed-form expression for q ( a ) , and then one has to resort to upper bounds in order to further lower bound the r.h.s. of Equation (2).
In [9] (Lemma 1), it was proposed to upper bound q ( a ) by applying the Hölder inequality and then the Markov inequality, to obtain
q ( a ) inf p > 1 E { X p } 1 / p · μ a 1 1 / p .
This is a very interesting bound, but unfortunately, it suffers from two main weaknesses. The first is that in many cases, the calculation of high order moments of X may not be a trivial task, let alone the step of taking the infimum over p. More importantly, the second weakness is that, as mentioned in the Introduction, it does not yield a tight lower bound to E { f ( X ) } when the PDF of X concentrates strongly around μ . To see the intuition, consider the following argument. When X fluctuates in the vicinity of μ with high probability, it is clear that the optimal choice of a, in the sense of maximizing the r.h.s. of (2), is slightly larger than μ , because then q ( a ) would be small and the main term of (2) would be close to f ( μ ) , which, in turn, is also the Jensen upper bound to E { f ( X ) } and hence it is fairly tight. Now, by the Lypunov inequality, the factor E { X p } ] 1 / p is increasing in p, and so, for every p 1 , [ E { X p } ] 1 / p E { X } = μ . At the same time, for every a > μ , we have ( μ / a ) 1 1 / p μ / a . Consequently, if we take, for the sake of simplicity, f ( 0 ) = 0 , the lower bound of [9] (Lemma 1), cannot be larger than
sup a > μ f ( a ) a · μ μ 2 a = sup a > μ f ( a ) · μ a 1 μ a = sup 0 θ < 1 f μ θ θ ( 1 θ ) .
In fact, this is a very generous assessment of the bound of ([9], Lemma 1), as the inequality [ E { X p } ] 1 / p E X = μ is met with equality for p = 1 , whereas the inequality ( μ / a ) 1 1 / p μ / a is met for p .
For example, if f ( x ) = x ,9] (Lemma 1) yields a lower bound which is smaller than
sup 0 θ < 1 μ θ θ ( 1 θ ) = μ · sup 0 θ < 1 θ ( 1 θ ) = 2 3 9 · μ = 0.3849 μ
even if X = μ with probability one.
These observations motivate the quest for improvements of the upper bound on q ( a ) , which is the subject of the next section.

3. Alternative Upper Bounds to q ( a )

For cases where q ( a ) does not lend itself to an exact closed-form expression, we propose two basic alternative approaches for upper bounding q ( a ) , which both have the property that when X concentrates around μ , q ( a ) is small even when a is only slightly larger than μ , which in turn yields a tight bound very close to f ( μ ) .
The choice between the two approaches depends on the problem at hand and the capability to obtain closed-form expressions for the moments involved, if they exist at all.
  • The Chernoff approach. The first approach is to upper bound the indicator function, I { x > a } , by the exponential function e s ( x a ) ( s 0 ), exactly like in the Chernoff bound. This would yield
    q ( a ) inf s 0 E { X e s ( X a ) } = inf s 0 e a s E { X e s X } = inf s 0 e a s Φ ( s ) = q C h e r n o f f ( a ) ,
    where Φ ( s ) is the derivative of the moment generating function (MGF), Φ ( s ) = E { e s X } . Thus, Equation (2) is further lower bounded as
    E { f ( X ) } sup a > 0 μ a · f ( a ) + 1 μ a · f ( 0 ) f ( a ) f ( 0 ) a · q C h e r n o f f ( a ) .
This bound is useful when X has a finite MGF, Φ ( s ) , at least in some range of s > 0 , and Φ ( s ) is differentiable in that range. Moreover, for the bound to be useful, q C h e r n o f f ( a ) must lend itself to a reasonably simple closed-form expression.
As a slight variation of the Chernoff approach is to upper bound the function x · I [ x > a ] itself (and not just the indicator function factor) by an exponential function of the form a · e s ( x a ) , where s is such that the derivative w.r.t. x at x = a is not less than 1, so that it is at least tangential to the function x · I [ x > a ] for x a . This means a s 1 , or s 1 / a . Thus,
q ( a ) a · inf s 1 / a { e a s Φ ( s ) } = q ˜ C h e r n o f f ( a )
which, of course, may replace q C h e r n o f f ( a ) in Equation (19). The usefulness of this version of the bound is essentially under the same circumstances as those of q C h e r n o f f ( a ) . It has the small advantage that there is no need to differentiate Φ ( s ) , but the range of the optimization over s is somewhat smaller.
2.
The Chebychev–Cantelli approach. According to this approach, the function x · I [ x > a ] is upper bounded by a quadratic function, in the spirit of the Chebychev–Cantelli inequality, i.e.,
x · I [ x > a ] a ( x + s ) 2 ( a + s ) 2 ,
where the parameter s 0 is optimized under the constraint that the derivative at x = a , which is 2 a / ( a + s ) , is at least 1 (again, to be at least tangential to the function itself at x a ), which is equivalent to the requirement, s a . In this case, denoting σ 2 = V a r { X } , we obtain
q ( a ) a E { ( X + s ) 2 } ( a + s ) 2 = a [ σ 2 + ( μ + s ) 2 ] ( a + s ) 2 ,
which, when minimized over s [ 0 , a ] , yields
s * = min a , σ 2 a μ μ ,
and then the best bound is given by
q ( a ) q C h e b C a n t ( a ) = σ 2 + ( a + μ ) 2 4 a a < a c a σ 2 σ 2 + ( a μ ) 2 a a c
where a c = σ 2 + μ 2 .
The Chernoff approach yields better bounds than the Chebychev–Cantelli approach in many cases. Suppose, for example, that X = i = 1 n Y i , where Y 1 , , Y n are independently and identically distributed (i.i.d.) random variables, all having mean μ Y , variance σ Y 2 , and MGF Φ Y ( s ) . Then, of course, μ = n μ Y , σ 2 = n σ Y 2 , and Φ ( s ) = [ Φ Y ( s ) ] n . For simplicity, suppose also that f ( 0 ) = 0 . In this case, the Chernoff approach yields
E f i = 1 n Y i n μ Y a · f ( a ) f ( a ) a inf s 0 e s a d d s [ Φ Y ( s ) ] n = n μ Y a · f ( a ) n f ( a ) a inf s 0 e s a [ Φ Y ( s ) ] n 1 Φ Y ( s ) = n f ( a ) a μ Y inf s 0 e s a [ Φ Y ( s ) ] n · d ln Φ Y ( s ) d s .
Now, if Y 1 , Y 2 , obey a large deviations principle, the second term in the square brackets tends to zero exponentially for the choice a = n ( μ Y + ϵ ) with arbitrarily small ϵ > 0 . In this case, let s * > 0 be the maximizer of [ s ( μ + ϵ ) ln Φ Y ( s ) ] , yielding I ( ϵ ) = s * ( μ + ϵ ) ln Φ Y ( s * ) . Then,
E f i = 1 n Y i f [ ( μ Y + ϵ ) n ] μ Y + ϵ μ Y e n I ( ϵ ) d ln Φ Y ( s ) d s | s = s * .
For large enough n, the second term in the square brackets becomes negligible, and the lower bound becomes arbitrarily close to f [ ( μ Y + ϵ ) n ] · μ Y / ( μ Y + ϵ ) . On the other hand, Jensen’s upper bound is f ( μ Y n ) . In some cases, the difference is not very large, at least for asymptotic evaluations. For example, if f ( x ) = ln ( 1 + x ) , which is a frequently encountered concave function in information theory, ln [ 1 + n ( μ Y + ϵ ) ] ln n + ln ( μ Y + ϵ ) , whereas ln ( 1 + n μ Y ) ln n + ln ( μ Y + 1 / n ) , which are very close for large n and small ϵ > 0 . In Section 4.4, we will get back to the example of the logarithm of the sum independent random variables.
In the Chebychev–Cantelli approach, on the other hand, we have a c = n 2 μ Y 2 + n σ Y 2 n μ Y for large n. Thus, if we take a = n ( μ Y + ϵ ) > a c , we have
q C h e b C a n t [ n ( μ Y + ϵ ) ] = n σ Y 2 n σ Y 2 + n 2 ϵ 2 = σ Y 2 σ Y 2 + n ϵ 2 ,
which tends to zero, but only at the rate of 1 / n , as opposed to the exponential decay in the Chernoff approach. Still, for large n, the main term of the bound becomes asymptotically tight, as before.
In spite of the superiority of the Chernoff approach relative to the Chebychev–Cantelli approach, as we demonstrated, one should keep in mind that there are also situations where the random variable X does not have an MGF (i.e., when the PDF of X has a heavy tail), yet it does have a mean and a variance. In such cases, the Chebychev–Cantelli approach is applicable while the Chernoff approach is not. However, even when the MGF exists, in certain cases, the calculation of the first and the second moment is easier than the calculation of the exponential moment.
To conclude this section, we summarize its main findings in the form of a theorem.
Theorem 1.
Under the conditions of Lemma 1,
E { f ( X ) } sup a > 0 μ a · f ( a ) + 1 μ a · f ( 0 ) f ( a ) f ( 0 ) a · q min ( a ) ,
where
q min ( a ) = min { q C h e r n o f f ( a ) , q ˜ C h e r n o f f ( a ) , q C h e b C a n t ( a ) } .

4. Examples

In this section, we demonstrate the lower bound in several information-theoretic application examples.

4.1. Example 1—Capacity of the Gaussian Channel with Random SNR

We begin with a simple example of a zero-mean, circularly symmetric complex Gaussian channel whose signal-to-noise ratio (SNR), Z, is a random variable (e.g., due to fading), known to both the transmitter and the receiver. The capacity is given by C = E { ln ( 1 + g Z ) } , where g is a certain deterministic gain factor and the expectation is with respect to (w.r.t.) the randomness of Z. For simplicity, let us assume that Z is distributed exponentially, i.e.,
p Z ( z ) = θ e θ z , z 0 ,
where the parameter θ > 0 is given. In this case, f ( x ) = ln ( 1 + g x ) , μ = 1 / θ and q ( a ) can be easily derived in closed form, to obtain
q ( a ) = θ · a z e θ z d z = a + 1 θ · e θ a .
Consequently,
C sup a 1 / θ ln ( 1 + g a ) a 1 θ a + 1 θ · e a θ = sup s 1 1 ( s + 1 ) e s s · ln 1 + g s θ ,
whereas the Jensen upper bound is C ln ( 1 + g / θ ) . Figure 1 displays both the upper bound and the lower bound to C as functions of θ . As can be seen, the gap between the bounds depends strongly on θ . For large θ , which is the case where the p Z becomes narrower and more concentrated, the gap between the bounds decays rapidly in the sense that not only does the difference shrink (which by itself is not surprising, since both bounds tend to zero), but their ratio also becomes closer to unity.

4.2. Example 2—Moments of the Number of Guesses in Randomized Guessing

Consider the problem of guessing the realization of a random variable, which takes on values in a finite alphabet, using a sequence of yes/no questions of the form “Is U = u 1 ?”, “Is U = u 2 ?”, etc., until a positive response is provided by a party that observes the actual realization of U (see, e.g., Ref. [12] whose Introduction includes an overview on this problem). Given a distribution of U, a commonly used performance metric for this problem is the ρ th moment of the number of guesses until U is guessed successfully (for a given ρ > 0 ). For guessing random vectors of length n, minimizing the moments of the number of guesses by different (deterministic or randomized) guessing strategies has several applications and motivations in information theory, such as sequential decoding, guessing passwords, etc., and it is also strongly related to lossless source coding. In this vector case, the moments of the number of guesses behave asymptotically exponentially in n. Here, we refer to randomized guessing strategies, rather than deterministic strategies.
Let the random variable U take on values in a finite alphabet U . Consider a random guessing strategy where the guesser sequentially submits a sequence of independently drawn random guesses according to a certain probability distribution, P ( · ) , defined on U . Randomized guessing strategies have the advantage that they can be used by multiple asynchronous agents, which submit their guesses concurrently (again, see [12] and references therein). Let u U n be any realization of U, and let the guessing distribution, P, be given. The random number, X, of independent guesses until success has a geometric distribution:
Pr { X = k } = [ 1 P ( u ) ] k 1 · P ( u ) , k = 1 , 2 ,
and we are interested in a lower bound to E { X ρ } for ρ ( 0 , 1 ) , which is the case where f ( x ) = x ρ is concave. Obviously, the condition f ( x ) = x ρ 0 = f ( 0 ) is met as well in this case. We next use the shorthand notation p instead of P ( u ) , and later we will think of this quantity as an exponential function of n, denoted e n E , where E > 0 is some constant that depends on the type of u. We will also denote q = 1 p . Then,
E { X ρ } μ a · a ρ a ρ a inf s 0 { e s a Φ ( s ) } = a ρ 1 μ inf s 0 { e s a Φ ( s ) } .
Now, X is a geometric random variable with μ = 1 / p , and
Φ ( s ) = k = 1 e k s Pr { X = k } = p e s 1 q e s , s < ln 1 q .
which yields
Φ ( s ) = p e s ( 1 q e s ) 2 .
Choosing s = s * = ln [ ( a 1 ) / a q ] , with a = e n ( E + ϵ ) (for some arbitrarily small ϵ > 0 ), gives
e a s Φ ( s ) = p q · a ( a 1 ) ( 1 1 / a ) a · q a e n E 1 e n E · e 2 n E 1 e n ( E + ϵ ) e n ( E + ϵ ) · ( 1 e n E ) e n ( E + ϵ ) = e n E ( 1 e n E ) 1 e n ( E + ϵ ) e n ( E + ϵ ) · ( 1 e n E ) e n ( E + ϵ ) .
Now, as n , the denominator tends to 1 / e and the factor ( 1 e n E ) e n ( E + ϵ ) tends to zero double-exponentially, and so the entire expression goes to zero double-exponentially. Thus, the main term of the lower bound is
μ a ρ 1 = e n E ρ ( 1 ρ ) ϵ ,
which is roughly e n E ρ on the exponential scale. The Jensen upper bound is ( E { X } ) ρ = 1 / p ρ = e n E ρ , which is of the same exponential order if ϵ is neglected.

4.3. Example 3—Moments of the Error in Parameter Estimation

Consider the estimation of a parameter θ from given observations. Suppose that we have an expression (or a lower bound), ϵ 2 ( θ ) , on the mean square error (MSE) of a certain estimator, and the question is how to obtain a lower bound to other moments of the estimation error, such as E { | θ ^ θ | q } where q < 2 . Using the proposed approach, we have, in this example, X = ( θ ^ θ ) 2 and f ( x ) = x q / 2 . Consequently,
E { | θ ^ θ | q } = E | ( θ ^ θ ) 2 | q / 2 a q / 2 a · E { ( θ ^ θ ) 2 } E { ( θ ^ θ ) 2 · I [ ( θ ^ θ ) 2 a ] } a q / 2 1 · ϵ 2 ( θ ) E { ( θ ^ θ ) 2 · I [ ( θ ^ θ ) 2 a ] } .
Now, if the given estimator has the additional large-deviations property to make the second term exponentially small for large n, then we can lower bound E { | θ ^ θ | q } by an expression whose main term is [ ϵ 2 ( θ ) ] q / 2 . For example, if θ is the mean of a Bernoulli source and θ ^ is the empirical relative frequency, or θ is the mean of Gaussian observations, we can calculate the bound relatively easily. We will not pursue this example any further here.

4.4. Logarithms of Sums of Independent Random Variables

Referring to the discussion near the end of Section 3, where X is given by the sum of n independent random variables, we now focus on the special case where f ( x ) = ln ( 1 + x ) , which was also mentioned therein (see the discussion around Equation (26)). Here, we list several examples where this example is encountered in information theory applications.

4.4.1. Example 4—Universal Source Coding

Consider the evaluation of the expected code length associated with the universal lossless source code due to Krichevsky and Trofimov [13]. In a nutshell, this is a universal code for memoryless sources. In the binary case, at each time instant t, it sequentially assigns probabilities to the next binary symbol according to (a biased version of) the empirical distribution pertaining to the source data observed so far, s 1 , , s t . Specifically, consider the ideal code-length function (in nats),
L ( s n ) = t = 0 n 1 ln Q ( s t + 1 | s 1 , , s t ) ,
where
Q ( s t + 1 = s | s 1 , , s t ) = N t ( s ) + 1 t + 2 ,
and N t ( s ) , s { 0 , 1 } is the number of occurrences of the symbol s in ( s 1 , , s t ) . Therefore,
E { L ( S n ) } = t = 0 n 1 ln ( t + 2 ) t = 0 n 1 E { ln [ N t ( S t + 1 ) + 1 ] } = ln [ ( n + 1 ) ! ] t = 0 n 1 E { ln ( 1 + i = 0 t I [ S i = S t + 1 ] ) } = ln [ ( n + 1 ) ! ] p · t = 0 n 1 E { ln ( 1 + i = 0 t I [ S i = 1 ] ) } ( 1 p ) · t = 0 n 1 E { ln ( 1 + i = 0 t I [ S i = 0 ] ) } ,
where I [ · ] are indicator functions of the corresponding events and where p and 1 p are the probabilities of “1” and “0”, respectively. Now, in order to obtain an upper bound to E { L ( S n ) } , one can lower bound each of the terms, E ln 1 + i = 0 t I [ S i = 1 ] and E ln 1 + i = 0 t I [ S i = 0 ] . This falls in the framework of logarithms of sums of i.i.d. random variables, where { U i } are binary.

4.4.2. Example 5—Ergodic Capacity of the Rayleigh SIMO Channel

Consider the single-input, multiple-output (SIMO) channel with L receiving antennas and assume that the channel transfer coefficients, h 1 , , h L , are independent, zero-mean, circularly symmetric complex Gaussian random variables with variances σ 1 2 , , σ L 2 . The ergodic capacity (in nats per channel use) of the SIMO channel is given by
C = E ln 1 + A = 1 L | h | 2 = E ln 1 + A = 1 L f 2 + g 2 ,
where f = Re { h } , g = Im { h } , and A is the SNR (see, e.g., Refs [14,15] and many references therein).
Once again, here f ( x ) = ln ( 1 + x ) , and now, X = A = 1 L | h | 2 . Here, the transfer coefficients are assumed independent, but not identically distributed. Therefore, the means should be summed, and so should the variances, whereas the MGFs should be multiplied. Clearly, the case where they all share the same PDF is a special case.

4.4.3. Example 6—Differential Entropy of the Generalized Multivariate Cauchy Distribution

Let ( Y 1 , , Y n ) be a multivariate Cauchy random vector, whose PDF is given by
p ( y 1 , , y n ) = C n 1 + i = 1 n y i 2 ( n + 1 ) / 2 .
Using the Laplace transform relation,
1 s ( n + 1 ) / 2 = 1 Γ ( ( n + 1 ) / 2 ) 0 t ( n 1 ) / 2 e s t d t , Re ( s ) > 0 ,
f can be represented as a mixture of product measures:
f ( y 1 , , y n ) = C n 1 + i = 1 n y i 2 ( n + 1 ) / 2 = C n Γ ( ( n + 1 ) / 2 ) 0 t ( n 1 ) / 2 e t exp t i = 1 n y i 2 d t .
Defining
Z ( t ) = e t y 2 d y = π t , t > 0 ,
we get from (46),
1 = C n Γ ( ( n + 1 ) / 2 ) 0 t ( n 1 ) / 2 e t R n exp t i = 1 n y i 2 d y 1 d y n d t = C n Γ ( ( n + 1 ) / 2 ) 0 t ( n 1 ) / 2 e t e t y 2 d y n d t = C n Γ ( ( n + 1 ) / 2 ) 0 t ( n 1 ) / 2 e t π t n / 2 d t ,
and so,
C n = Γ ( ( n + 1 ) / 2 ) 0 t ( n 1 ) / 2 e t π t n / 2 d t .
The calculation of the differential entropy is associated with the evaluation of the expectation E ln 1 + i = 1 n Y i 2 . This falls within the framework of this application example, where X = i = 1 n Y i 2 . Here, Y i 2 are not i.i.d., but they are distributed under a mixture of Gaussian i.i.d. distributions, as can be seen in (46). In particular,
p ( y 1 , , y n ) = C n 1 + i = 1 n y i 2 ( n + 1 ) / 2 = C n Γ ( ( n + 1 ) / 2 ) 0 t ( n 1 ) / 2 e t π t n / 2 exp t i = 1 n y i 2 ( π / t ) n / 2 d t = C n π n / 2 Γ ( ( n + 1 ) / 2 ) 0 t 1 / 2 e t exp t i = 1 n y i 2 ( π / t ) n / 2 d t
Let us denote
w ( t ) = C n π n / 2 e t Γ ( ( n + 1 ) / 2 ) t , t > 0 .
Then, ( Y 1 , , Y n ) is governed by a mixture of i.i.d., zero-mean Gaussians of variance 1 / ( 2 t ) with weight function w ( t ) , t > 0 . Therefore,
E ln 1 + i = 1 n Y i 2 = 0 d t w ( t ) E ln 1 + i = 1 n Y i 2 | t ,
and we can now lower bound E { ln 1 + i = 1 n Y i 2 | t } for each t > 0 and, finally, integrate the result (choosing possibly different a and s for every t). Following the discussion near the end of Section 3, the result will be roughly
0 d t w ( t ) ln 1 + n 2 t ,
which is in agreement with the Jensen upper bound.

5. Discussion

In this section, we discuss several additional aspects and observations regarding our main result, as well as modifications and suggestions for relaxing certain assumptions. We summarize our comments below.
  • The maximization over a is not necessary. Our first comment is quite trivial but, nevertheless, it is important to mention at least as a reminder. The explicit maximization over the parameter a may not be trivial to carry out in most examples, but for certain purposes, it may not be necessary. One can select an arbitrary value of a and obtain a legitimate lower bound. In some cases, however, it is not too difficult to guess what could be a good choice of this value, as we saw in some of the examples of Section 4.
2.
Softening the assumption f ( x ) f ( 0 ) . One may partially relax the assumption f ( x ) f ( 0 ) for all x 0 , and replace it with the softer assumption that there exists Δ 0 , such that f ( x ) + Δ · x f ( 0 ) for all x 0 (or more precisely, within the support of the PDF of X). By applying Lemma 1 and Theorem 1 to f ( x ) + Δ · x and compensating for the term E { Δ · X } = Δ μ , we can easily use exactly the same technique and obtain the modified lower bound,
E { f ( X ) } sup a > 0 μ a · f ( a ) + 1 μ a f ( 0 ) f ( a ) f ( 0 ) a + Δ · q ( a ) ,
and so the cost of this relaxation is the extra Δ in the last term. This means that the best choice of Δ is the smallest one for which f ( x ) + Δ x f ( 0 ) for all x 0 , namely,
Δ = Δ * = sup x > 0 f ( 0 ) f ( x ) x .
Nevertheless, if we can make q ( a ) small by selecting a to be just slightly above μ , this cost can be kept small. The results presented in Section 3 then become the special case of Δ = 0 . Moreover, the artificially added (and subtracted) term, Δ x , can also be replaced by a non-linear term, say, g ( x ) , provided that f ( x ) + g ( x ) remains concave and f ( x ) + g ( x ) f ( 0 ) + g ( 0 ) . This is useful, of course, only if E { g ( X ) } lends itself to an explicit calculation, for example, g ( x ) = x n , or g ( x ) = e a x .
3.
Convex functions. So far we have dealt with RJIs for concave functions. RJIs associated with expectations of convex functions (upper bounds) can be obtained in exactly the same manner, except that the signs are flipped. Specifically, by replacing f with f , we have similar statements for convex functions: Let f : R + R be a convex function with f ( x ) f ( 0 ) for every x 0 . Then,
E { f ( X ) } inf a > 0 μ a · f ( a ) + 1 μ a · f ( 0 ) + f ( 0 ) f ( a ) a · q ( a ) ,
and, again, the assumption f ( x ) f ( 0 ) can be softened in the same manner as described in item 2 above to obtain
E { f ( X ) } inf a > 0 μ a · f ( a ) + 1 μ a · f ( 0 ) + f ( 0 ) f ( a ) a + Δ · q ( a ) .
4.
Functions that are neither convex nor concave. Using the same line of thought as in item 2 above, one can obtain upper and lower bounds to expectations of general functions, that are not necessarily convex or concave. Indeed, let f : [ a , b ] R be a real, continuous function that satisfies the following condition: y > x implies f ( y ) f ( x ) + Δ ( y x ) for all x , y . Assume also that f is bounded from below by a constant, f min . Then, f ( x ) + Δ · ( b x ) f min is monotonically non-increasing and positive. Thus, for every u [ a , b ] ,
f ( x ) + Δ · ( b x ) f min [ f ( u ) + Δ ( b u ) f min ] · I { x u } ,
and so,
E { f ( X ) } + Δ · ( b μ ) f min [ f ( u ) + Δ ( b u ) f min ] · P r { X u } ,
or
E { f ( X ) } f min Δ · ( b μ ) + f ( u ) + Δ · ( b u ) f min
[ f ( u ) + Δ ( b u ) f min ] · P r { X u }
= f ( u ) Δ · ( u μ ) [ f ( u ) + Δ ( b u ) f min ] · P r { X u } .
so, finally,
E { f ( X ) } sup u [ a , b ] f ( u ) Δ · ( u μ ) [ f ( u ) + Δ ( b u ) f min ] · Pr { X u } ,
with the understanding that we can further lower bound E { f ( X ) } by using any of the available upper bounds on P r { X u } (Markov, Chebychev, Chebychev–Cantelli, Chernoff, Hoeffding, etc.). The choice depends on considerations of tightness and the calculability of the bound, as described before.
Likewise, if f : [ a , b ] R is a real, continuous function that satisfies the following condition: y > x implies f ( y ) f ( x ) Δ ( y x ) for all x , y , and is upper bounded by f max < , we obtain, in the same manner,
E { f ( X ) } inf u [ a , b ] f ( u ) + Δ · ( u μ ) [ f ( u ) Δ ( b u ) f max ] · P r { X u } .

6. Extension to Bivariate (and Multivariate) Concave Functions

We conclude this article by outlining a possible extension of our main results to bivariate concave functions, with the understanding that the main ideas also extend to multivariate functions of more than two variables. It should also be understood that a parallel upper bound for convex functions can easily be obtained by flipping the signs as before. For simplicity, we assume parallel assumptions to those of Lemma 1 and Theorem 1. In this section, we confine attention to the Chernoff approach, but the other approaches are valid as well.
For a pair of random variables ( X , Y ) , governed by a given joint PDF, we define the following functions:
Φ ( s , t ) = E { e s X + t Y }
Φ s ( s ) = Φ ( s , 0 ) s = E { X e s X }
Φ t ( t ) = Φ ( 0 , t ) t = E { Y e t Y }
Φ s t ( s , t ) = 2 Φ ( s , t ) s t = E { X Y e s X + t Y }
q 1 ( a ) = inf s 0 { e s a Φ s ( s ) }
q 2 ( b ) = inf t 0 { e t b Φ t ( t ) }
q 3 ( a ) = inf s 0 { e s a Φ s t ( s , 0 ) }
q 4 ( b ) = inf t 0 { e t b Φ s t ( 0 , t ) } .
We then have the following theorem.
Theorem 2.
Let f : R + 2 R have the property f ( x , y ) f ( x , 0 ) and f ( x , y ) f ( 0 , y ) for all positive x and y in their domains, and assume that f is concave in x for fixed y, and concave in y for fixed x (but not necessarily jointly concave in both variables). Then,
E { f ( X , Y ) } sup a > 0 sup b > 0 [ f ( a , b ) f ( a , 0 ) a b [ E { X Y } q 3 ( a ) q 4 ( b ) ] + f ( a , 0 ) a [ E { X } q 1 ( a ) ] 1 a E X · f ( 0 , E { Y | X } ) + f ( 0 , b ) f ( 0 , 0 ) b · [ E { Y } q 2 ( b ) ] + f ( 0 , 0 ) ] .
Likewise,
E { f ( X , Y ) } sup a > 0 sup b > 0 [ f ( a , b ) f ( 0 , b ) a b [ E { X Y } q 3 ( a ) q 4 ( b ) ] + f ( 0 , b ) b [ E { Y } q 2 ( b ) ] 1 b E Y · f ( E { X | Y } , 0 ) + f ( a , 0 ) f ( 0 , 0 ) a · [ E { X } q 1 ( a ) ] + f ( 0 , 0 ) ] .
Proof of Theorem 2.
We begin with the simple identity,
E { f ( X , Y ) } = E { E { f ( X , Y ) | Y } } .
Now, applying Theorem 1 to f ( X , y ) (for fixed y) w.r.t. the conditional PDF of X given Y = y , we obtain:
E { f ( X , y ) | Y = y } f ( a , y ) f ( 0 , y ) a · E { X · I { X a } | Y = y } + f ( 0 , y ) .
Thus,
E { f ( X , Y ) } ( a ) 1 a E [ f ( a , Y ) f ( 0 , Y ) ] X · I { X a } + E { f ( 0 , Y ) } ( b ) 1 a E [ f ( a , Y ) X · I { X a } 1 a E [ f ( 0 , Y ) X · I { X a } +
f ( 0 , b ) f ( 0 , 0 ) b · ( E { Y } E { Y · I { Y b } } ) + f ( 0 , 0 ) ( c ) 1 a E f ( a , b ) f ( a , 0 ) b · Y · I { Y b } + f ( a , 0 ) X · I { X a }
1 a E [ f ( 0 , Y ) X + f ( 0 , b ) f ( 0 , 0 ) b · ( E { Y } E { Y · I { Y b } } ) + f ( 0 , 0 ) = ( d ) f ( a , b ) f ( a , 0 ) a b E { X Y · I { X a } · I { Y b } } + f ( a , 0 ) a E { X · I { X a } } 1 a E E f ( 0 , Y ) X | X + f ( 0 , b ) f ( 0 , 0 ) b · ( E { Y } E { Y · I { Y b } } ) + f ( 0 , 0 ) ( e ) f ( a , b ) f ( a , 0 ) a b E { X Y } E { X Y · I { X a } } E { X Y · I { Y b } + f ( a , 0 ) a [ E { X } E { X · I { X a } ] 1 a E X · f ( 0 , E { Y | X } ) + f ( 0 , b ) f ( 0 , 0 ) b · ( E { Y } E { Y · I { Y b } } ) + f ( 0 , 0 ) = ( f ) f ( a , b ) f ( a , 0 ) a b [ E { X Y } q 3 ( a ) q 4 ( b ) ] + f ( a , 0 ) a [ E { X } q 1 ( a ) ] 1 a E X · f ( 0 , E { Y | X } ) +
f ( 0 , b ) f ( 0 , 0 ) b · [ E { Y } q 2 ( b ) ] + f ( 0 , 0 ) ,
where (a) is by taking the expectation over Y on both sides of (74), (b) is by applying Theorem 1 to f ( 0 , Y ) , (c) is by applying Theorem 1 to f ( a , Y ) (first, w.r.t. to the conditional PDF of Y given X, and then by averaging w.r.t. X), (d) is by rearranging terms, (e) is by the inequality I { X a } · I { Y b } 1 I { X a } I { Y b } , and (f) is by the definitions of the functions q 1 ( · ) , , q 4 ( · ) . The second inequality in Theorem 2 is proved in the same manner by switching the roles of: (i) X and Y, (ii) a and b, (iii) q 3 ( a ) and q 4 ( b ) , and (iv) f ( a , 0 ) and f ( 0 , b ) . In other words, we first condition on X rather than on Y. This completes the proof of Theorem 2. □
The usefulness of this bound depends on the ability to calculate, or at least to upper bound the term, 1 a E X · f ( 0 , E { Y | X } ) . If the function g ( x ) = x · f ( 0 , E { Y | X = x } ) happens to be a concave function, we can use Jensen’s inequality. If it is convex, we can apply a version of reverse Jensen for convex functions. If it is neither convex nor concave, we can use the method presented in the previous section to upper bound it. When X and Y are independent, this term simplifies significantly, as it becomes μ X · f ( 0 , μ Y ) , where μ X and μ Y are the means of X and Y, respectively.
Another case where the lower bound simplifies considerably is the case where f ( a , 0 ) = f ( 0 , b ) = 0 for all a and b, as it becomes
E { f ( X , Y ) } f ( a , b ) a b [ E { X Y } q 3 ( a ) q 4 ( b ) ] .
We next provide two application examples where this is the case.
  • Example 7—minimum between two sums of independent random variables. Let X = i = 1 n A i and Y = j = 1 m B j , where both { A i } and { B j } are all non-negative, independent random variables. Obviously, μ X = i = 1 n μ A i and μ Y = j = 1 m μ B j . Further, let f ( x , y ) = min { x , y } , which is concave in x for fixed y and vice versa. Then,
    E min i = 1 n A i , i = 1 m B i min { a , b } a b i = 1 n j = 1 m μ A i μ B j q 3 ( a ) q 4 ( b ) ,
    which is essentially min { μ X , μ Y } for large n and m, provided that X and Y concentrate around their means, as discussed above. This example has a few applications, all of them are relevant in situations where there is a certain additive cost associated with a given task, there are two possible routes (or strategies), and the one with the smaller cost is chosen. For example, suppose we are compressing a realization, u 1 , , u n , of a random source vector, U 1 , , U n , and we have two side informations (available at both ends), V 1 , , V n and W 1 , , W n , which are both conditionally independent noisy versions of u 1 , , u n , but for practical reasons, we use only one of them—the one for which code-length is shorter for the given realization (also adding a flag bit). In this case, n = m , A i = log P ( u i | V i ) , and B i = log P ( u i | W i ) , which are independent. As a second step, we have, of course, to take the expectation over ( U 1 , , U n ) . Other examples of costs might be prices, distances, waiting times, bit errors, etc.
  • Example 8—channel capacity revisited. Here, we combine Example 1 (capacity of the Gaussian channel with random SNR) and Example 5 (ergodic capacity of the SIMO channel). Consider the expression
    C = E log 1 + Z · = 1 L | h | 2 ,
    where, as in Example 1, { h } are zero-mean, circularly symmetric, complex Gaussian random variables with variances { σ 2 } , and as in Example 5, Z is an exponentially distributed random variable with parameter θ , and independent of { h } . In principle, we could have treated this problem in the framework of univariate functions, using the concavity of f ( X ) = ln ( 1 + X ) , where the random variable X is defined as X = Z · = 1 L h 2 . However, the calculation of the characteristic function of X is not convenient to analyze since it is a product of two random variables. Instead, we treat it as a bivariate function, f ( X , Y ) = ln ( 1 + X Y ) , where X = Z and Y = = 1 L h 2 . Clearly, f is concave in each one of its arguments when the other one is kept fixed. We then have
    C ln ( 1 + a b ) a b 1 θ · = 1 L σ 2 q 3 ( a ) q 4 ( b ) .
As in Example 5, instead of the Chernoff-like bound, q 3 ( a ) , we have the exact expression
E { X Y · I [ X a ] } = E { Y } · E { X · I [ X a ] }
= = 1 L σ 2 · a + 1 θ · e θ a .
The other term can be readily bounded using the Chernoff approach.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Interesting discussions with the authors of [9] are acknowledged with thanks.

Conflicts of Interest

The author declares no conflict of interests.

References

  1. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  2. Jebara, T.; Pentland, A. On reversing Jensen’s inequality. In Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS 2000), Denver, CO, USA, 1 January 2000; pp. 213–219. [Google Scholar]
  3. Budimir, I.; Dragomir, S.S.; Pečarić, J. Further reverse results for Jensen’s discrete inequality and applications in information theory. J. Inequal. Pure Appl. Math. 2001, 2, 5. [Google Scholar]
  4. Simić, S. On a new converse of Jensen’s inequality. Math. Publ. L’Institut Math. 2009, 85, 107–110. [Google Scholar] [CrossRef]
  5. Dragomir, S.S. Some reverses of the Jensen inequality for functions of selfadjoint operators in Hilbert spaces. J. Inequal. Appl. 2010, 2010, 496821. [Google Scholar] [CrossRef] [Green Version]
  6. Dragomir, S.S. Some reverses of the Jensen inequality with applications. Bull. Aust. Math. Soc. 2013, 87, 177–194. [Google Scholar] [CrossRef] [Green Version]
  7. Khan, S.; Khan, M.A.; Chu, Y.-M. New converses of Jensen inequality via Green functions with applications. Rev. Real Acad. Cienc. Exactas Fis. Nat. Ser. A Mat. 2020, 114, 114. [Google Scholar] [CrossRef]
  8. Khan, S.; Khan, M.A.; Chu, Y.-M. Converses of Jensen inequality derived from the Green functions with applications in information theory. Math. Methods Appl. Sci. 2020, 43, 2577–2587. [Google Scholar] [CrossRef]
  9. Wunder, G.; Groβ, B.; Fritschek, R.; Schaefer, R.F. A reverse Jensen inequality result with application to mutual information estimation. In Proceedings of the 2021 IEEE Information Theory Workshop (ITW 2021), Kanazawa, Japan, 17–21 October 2021; Available online: https://arxiv.org/pdf/2111.06676.pdf (accessed on 12 November 2021).
  10. Ali, M.A.; Budak, H.; Zhang, Z. A new extension of quantum Simpson’s and quantum Newton’s inequalities for quantum differentiable convex functions. Math. Methods Appl. Sci. 2021, 1–19. [Google Scholar] [CrossRef]
  11. Budak, H.; Ali, M.A.; Tarhanaci, M. Some new quantum Hermite-Hadamard like inequalities for co-ordinated convex functions. J. Optim. Theory Appl. 2020, 186, 899–910. [Google Scholar] [CrossRef]
  12. Merhav, N.; Cohen, A. Universal randomized guessing with application to asynchronous decentralized brute-force attacks. IEEE Trans. Inform. Theory 2020, 66, 114–129. [Google Scholar] [CrossRef]
  13. Krichevsky, R.E.; Trofimov, V.K. The performance of universal encoding. IEEE Trans. Inform. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef]
  14. Dong, A.; Zhang, H.; Wu, D.; Yuan, D. Logarithmic expectation of the sum of exponential random variables for wireless communication performance evaluation. In Proceedings of the 2015 IEEE 82nd Vehicular Technology Conference (VTC2015-Fall), Boston, MA, USA, 6–9 September 2015. [Google Scholar]
  15. Tse, D.; Viswanath, P. Fundamentals of Wireless Communication; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Figure 1. Capacity upper and lower bounds for g = 5 . The blue curve is the lower bound as a function of θ and the red curve is the ordinary Jensen upper bound, given by ln ( 1 + g / θ ) .
Figure 1. Capacity upper and lower bounds for g = 5 . The blue curve is the lower bound as a function of θ and the red curve is the ordinary Jensen upper bound, given by ln ( 1 + g / θ ) .
Information 13 00039 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Merhav, N. Reversing Jensen’s Inequality for Information-Theoretic Analyses. Information 2022, 13, 39. https://doi.org/10.3390/info13010039

AMA Style

Merhav N. Reversing Jensen’s Inequality for Information-Theoretic Analyses. Information. 2022; 13(1):39. https://doi.org/10.3390/info13010039

Chicago/Turabian Style

Merhav, Neri. 2022. "Reversing Jensen’s Inequality for Information-Theoretic Analyses" Information 13, no. 1: 39. https://doi.org/10.3390/info13010039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop