Some Useful Integral Representations for Information-Theoretic Analyses

This work is an extension of our earlier article, where a well-known integral representation of the logarithmic function was explored and was accompanied with demonstrations of its usefulness in obtaining compact, easily-calculable, exact formulas for quantities that involve expectations of the logarithm of a positive random variable. Here, in the same spirit, we derive an exact integral representation (in one or two dimensions) of the moment of a nonnegative random variable, or the sum of such independent random variables, where the moment order is a general positive non-integer real (also known as fractional moments). The proposed formula is applied to a variety of examples with an information-theoretic motivation, and it is shown how it facilitates their numerical evaluations. In particular, when applied to the calculation of a moment of the sum of a large number, n, of nonnegative random variables, it is clear that integration over one or two dimensions, as suggested by our proposed integral representation, is significantly easier than the alternative of integrating over n dimensions, as needed in the direct calculation of the desired moment.

In the case of the logarithmic function, the common practice is either to resort to approximate evaluations, provided by upper and lower bounds on the desired expression (for example, by using Jensen's inequality) or to approximate the calculations by using the Taylor series expansion of the function ln x. More recently, it has become popular to use the replica trick (see, e.g., Chapter 8 in [23]), which is a non-rigorous, but useful technique, borrowed from statistical physics.
In our earlier work [6], we demonstrated how the following well-known integral representation of the logarithmic function, can be useful in a variety of application areas in the field of information theory, including both source and channel coding, as well as other aspects of this field. To calculate the expectation, E{ln X}, where X is a positive random variable, the idea is simply to invoke the integral representation (1) and to commute the expectation and integration operators, i.e., thereby replacing the calculation of E{ln X} by the calculation of the moment-generating function (MGF), M X (u) := E{e uX } for all u ≤ 0, which is often much easier to express in closed form. Moreover, in frequently encountered situations where X is given by the sum of n independently identically distributed (i.i.d.) random variables, the MGF of X is given by the n th power of the MGF of a single random variable in the sum that forms X. This reduces the dimension of the integration from n (in the original expression) to a single dimension of the integration over u. Interestingly, this integral representation has also been used in the statistical physics literature (see, e.g., [23] (p. 140) [24,25]), but not as much as the replica trick.
In this paper, we proceed in the same spirit as in [6], and we extend the scope to propose an integral representation of a general moment of a nonnegative random variable, X, namely the expectation, E{X ρ } for a given real ρ > 0. Obviously, when ρ is an integer, this moment is simply given by the ρ th order derivative of the MGF of X, calculated at the origin, as is very well known. However, the integral representation we propose, in this work, applies to any non-integer, positive ρ, and here too, it replaces the direct calculation of E{X ρ } by integration of an expression that involves the MGF of X. We refer to this representation as an extension of (2), as the latter can be obtained as a special case of the formula for E{X ρ }, by invoking one of the equivalent identities While the proposed integral representation of E{X ρ } can be readily obtained from [26] (p. 363, Identity (3.434.1)) in the range ρ ∈ (0, 1), the nontrivial extension we propose for a non-integer and real ρ > 1 is new to the best of our knowledge. Fractional moments have been considered in the mathematical literature (see, e.g., [27][28][29][30]). A relationship between fractional and integer-order moments was considered in [27] by expressing a fractional moment as an infinite series, which depends on all the positive integer-order moments, followed by an algorithm for numerical calculations of fractional moments.
As in [6], the proposed integral representation is applied to a variety of examples with an information-theoretic motivation, and it is shown how it facilitates the numerical evaluations. In particular, similar to the case of the logarithmic function, when applied to the calculation of a moment of the sum of a large number, n, of nonnegative random variables, it is clear that integration over one or two dimensions, as suggested by our proposed integral representation, is significantly easier than the alternative of integrating over n dimensions, as needed in the direct calculation of the desired moment. Furthermore, single-or double-dimensional integrals can be instantly and accurately calculated using built-in numerical integration procedures.
Fractional moments have been considered in the mathematical literature (see, e.g., [27][28][29][30]). A relationship between fractional and integer-order moments was considered in [27] by expressing a fractional moment as an infinite series that depends on all the positive integer-order moments, which was followed by an algorithm for numerical calculations of fractional moments.
The outline of the remainder part of this paper is as follows. In Section 2, we provide the mathematical background associated with the integral representation in general. In Section 3, we demonstrate this integral representation in applications, including: moments of guesswork, moments of estimation errors, differential Rényi entropies of generalized multivariate Cauchy distributions, and mutual information calculations of a certain model of a jammed channel. Each one of these examples occupies one subsection of Section 3. The integral representations in this paper are not limited to the examples in Section 3, and such representations can be proven useful in other information-theoretic problems (see, e.g., [6] and the references therein for the integral representation of the logarithmic expectation and some of its information-theoretic applications).

Statistical Moments of Arbitrary Positive Orders
It is well known that integer-order moments of a random variable X are calculable from its MGF by using its derivatives, calculated at u = 0, i.e., Quite often, however, there is a theoretical and practical interest to calculate fractional moments of nonnegative random variables. We next obtain a closed-form integral expression of the ρ th moment of a nonnegative random variable X, as a functional of its MGF, for any positive real ρ. Before we proceed, it should be noted that for ρ ∈ (0, 1), such an expression is available in handbooks of standard tables of integrals, for example in [26] (p. 363, Identity (3.434.1)). The first innovation here, however, is in a nontrivial extension of this formula for all ρ > 0 as an expression that involves a one-dimensional integral. It should be noted that although the definition of a fractional moment of a random variable (RV) is also given by a one-dimensional integral (or a sum, depending on whether the RV is discrete or continuous), the utility of our formula is, e.g., in expressing the ρ th moment of a sum of nonnegative and independent random variables as a one-dimensional integral, instead of an n-dimensional integral, which is obtained by the direct definition. This new formula serves as the basic building block in all of our information-theoretic applications throughout this paper. We first define the Beta and Gamma functions (see, e.g., Section 8.3 in [26] and Chapter 5 in [31]): where these functions are related by the equality Theorem 1. Let X be a nonnegative random variable, and let ρ > 0 be a non-integer real. Then, where, for all j ∈ {0, 1, . . . , }, Proof. See Appendix A.

Remark 1.
The proof of (9) in Appendix A does not apply to ρ ∈ N (see (A7) and (A8) etc., where the denominators vanish for ρ ∈ N). In the latter case, by referring to the second term on the right-hand side of (9), we get sin(πρ) = 0, and also, the integral diverges (specifically, for ρ ∈ N, the integrand scales like 1 u for u that is sufficiently close to zero), yielding an expression of the type 0 · ∞. However, taking a limit in (9) where we let ρ tend to an integer and applying L'Hôpital's rule can reproduce the well-known result in (5).

Corollary 2.
[6] Let X be a positive random variable. Then, A proof of (15) was presented in Section 2 in [6], based on the integral representation of the logarithmic function in (1), and by interchanging the integration and the expectation (due to Fubini's theorem). It can be alternatively proven by using Corollary 1, the identity and swapping the order of the expectation and limit by the dominated convergence theorem. Identity (15) has many useful information-theoretic applications in its own right, as demonstrated in [6], and here, we add even some more. The current work is an extension and further development of [6], whose main theme is exploiting Theorem 1 and studying its information-theoretic applications, as well as some more applications of the logarithmic expectation.

Applications
In this section, we exemplify the usefulness of the integral representation of the ρ th moment in Theorem 1 and the logarithmic expectation in several problem areas in information theory and statistics. These include analyses of randomized guessing, estimation errors, Rényi entropy of n-dimensional generalized Cauchy distributions, and finally, calculations of the mutual information for channels with a certain jammer model. To demonstrate the direct computability of the relevant quantities, we also present graphs of their numerical calculations.

Moments of Guesswork
Consider the problem of guessing the realization of a random variable, which takes on values in a finite alphabet, using a sequence of yes/no questions of the form "Is X = x 1 ?", "Is X = x 2 ?", etc., until a positive response is provided by a party that observes the actual realization of X. Given a distribution of X, a commonly used performance metric for this problem is the expected number of guesses or, more generally, the ρ th moment of the number of guesses until X is guessed successfully. When it comes to guessing random vectors, say, of length n, minimizing the moments of the number of guesses by different (deterministic or randomized) guessing strategies has several applications and motivations in information theory, such as sequential decoding, guessing passwords, etc., and it is also strongly related to lossless source coding (see, e.g., [9][10][11][12][13][19][20][21][22][32][33][34]). In this vector case, the moments of the number of guesses behave as exponential functions of the vector dimension, n, at least asymptotically, as n grows without bound. For random vectors with i.i.d. components, the best achievable asymptotic exponent of the ρ th guessing moment is expressed in [9] by using the Rényi entropy of X of order ρ := 1 1+ρ . Arikan assumed in [9] that the distribution of X is known and analyzed the optimal deterministic guessing strategy, which orders the guesses according to nonincreasing probabilities. Refinements of the exponential bounds in [9] with tight upper and lower bounds on the guessing moments for optimal deterministic guessing were recently derived in [19]. In the sequel, we refer to randomized guessing strategies, rather than deterministic strategies, and we aim to derive exact, calculable expressions for their associated guessing moments (as is later explained in this subsection).
Let the random variable X take on values in a finite alphabet X . Consider a random guessing strategy where the guesser sequentially submits a sequence of independently drawn random guesses according to a certain probability distribution, P(·), defined on X . Randomized guessing strategies have the advantage that they can be used by multiple asynchronous agents, which submit their guesses concurrently (see [33,34]).
In this subsection, we consider the setting of randomized guessing and obtain an exact representation of the guessing moment in the form of a one-dimensional integral. Let x ∈ X be any realization of X, and let the guessing distribution, P, be given. The random number, G, of independent guesses until success has the geometric distribution and so, the corresponding MGF is equal to In view of (9)- (11) and (18), for x ∈ X and non-integer ρ > 0, with α 0 := 1, and for all j ∈ N In (20), Li −j (·) is a polylogarithm (see, e.g., Section 25.12 in [31]), which is given by with x d dx j denoting differentiation with respect to x and multiplication of the derivative by x, repeatedly j times. In particular, we have and so on. The function Li −j (x) is a built-in function in the MATLAB and Mathematica software, which is expressed as polylog(−j, x). By Corollary 1, if ρ ∈ (0, 1), then (19) is simplified to Let P denote the distribution of X. Averaging over X to get the unconditional ρ th moment using (23), one obtains for all ρ ∈ (0, 1), where (24) is obtained by using the substitution z := e −u . A suitable expression of such an integral is similarly obtained, for all ρ > 0, by averaging (19) over X. In comparison, a direct calculation of the ρ th moment gives The double sum in (25) involves a numerical computation of an infinite series, where the number of terms required to obtain a good approximation increases with ρ and needs to be determined. The right-hand side of (24), on the other hand, involves integration over [0, 1]. For every practical purpose, however, definite integrals in one or two dimensions can be calculated instantly using built-in numerical integration procedures in MATLAB, Maple, Mathematica, or any other mathematical software tools, and the computational complexity of the integral in (24) is not affected by ρ.
As a complement to (19) (which applies to a non-integral and positive ρ), we obtain that the ρ th moment of the number of randomized guesses, with ρ ∈ N, is equal to where (26) follows from (20) and since α 0 = 1. By averaging over X, To conclude, (19) and its simplification in (23) for ρ ∈ (0, 1) give calculable one-dimensional integral expressions for the ρ th guessing moment with any ρ > 0. This refers to a randomized guessing strategy whose practical advantages were further explained in [33,34]. This avoids the need for numerical calculations of infinite sums. A further simplification for ρ ∈ N is provided in (26) and (27), expressed in closed form as a function of polylogarithms.

Moments of Estimation Errors
Let X 1 , . . . , X n be i.i.d. random variables with an unknown expectation θ to be estimated, and consider the simple estimator, For given ρ > 0, we next derive an easily-calculable expression of the ρ th moment of the estimation error. Let D n := θ n − θ 2 and ρ := ρ 2 . By Theorem 1, if ρ > 0 is a non-integral multiple of two, then where α 0 := 1, and for all j ∈ N (see (11)) By Corollary 1 and (29), if in particular ρ ∈ (0, 2), then the right-hand side of (30) is simplified to and, for all k ∈ N, In view of (29)-(34), obtaining a closed-form expression for the ρ th moment of the estimation error, for an arbitrary ρ > 0, hinges on the calculation of the right side of (31) for all u ≥ 0. To this end, we invoke the identity which is the MGF of a zero-mean Gaussian random variable with variance 1 2u . Together with (31), it gives (see Appendix B.1) where X is a generic random variable with the same distribution as X i for all i. The combination of (30)-(34) enables calculating exactly the ρ th moment E | θ n − θ| ρ , for any given ρ > 0, in terms of a two-dimensional integral. Combining (33) and (36) yields, for all ρ ∈ (0, 2), where we have used the identity ∞ −∞ 1 2 e −|ω| dω = 1 in the derivation of the first term of the integral on the right-hand side of (37).
As an example, consider the case where {X i } n i=1 are i.i.d. Bernoulli random variables with where the characteristic function is given by Thanks to the availability of the exact expression, we can next compare the exact ρ th moment of the estimation error | θ n − θ|, with the following closed-form upper bound (see Appendix B.2) and thereby assess its tightness: which holds for all n ∈ N, ρ > 0, and θ ∈ [0, 1], with Figures 1 and 2 display plots of E θ n − θ as a function of θ and n, in comparison to the upper bound (40). The difference in the plot of Figure 1 is significant except for the boundaries of the interval [0, 1], where both the exact value and the bound vanish. Figure 2 indicates that the exact value of E θ n − θ , for large n, scales like √ n; this is reflected by the apparent parallelism of the curves in both graphs and by the upper bound (40).
To conclude, this subsection provides an exact, double-integral expression for the ρ th moment of the estimation error of the expectation of n i.i.d. random variables. In other words, the dimension of the integral does not increase with n, and it is a calculable expression. We further compare our expression with an upper bound that stems from concentration inequalities. Although the scaling of the bound as a polynomial of n is correct, the difference between the exact expression and the bound is significant (see Figures 1 and 2). (37) and (39)) versus its upper bound in (40) as functions of θ ∈ [0, 1] with n = 1000.

Rényi Entropy of Extended Multivariate Cauchy Distributions
Generalized Cauchy distributions, their mathematical properties, and applications are of interest (see, e.g., [6,[35][36][37]). The Shannon differential entropy of a family of generalized Cauchy distributions was derived in Proposition 1 in [36], and also, a lower bound on the differential entropy of a family of extended multivariate Cauchy distributions (cf. Equation (42) in [37]) was derived in Theorem 6 in [37]. Furthermore, an exact single-letter expression for the differential entropy of the different family of extended multivariate Cauchy distributions was recently derived in Section 3.1 in [6]. Motivated by these studies, as well as the various information-theoretic applications of Rényi information measures, we apply Theorem 1 to obtain the Rényi (differential) entropy of an arbitrary positive order α for the extended multivariate Cauchy distributions in Section 3.1 in [6]. As we shall see in this subsection, the integral representation for the Rényi entropy of the latter family of extended multivariate Cauchy distributions is two-dimensional, irrespective of the dimension n of the random vector.
Let X n = (X 1 , . . . , X n ) be a random vector whose probability density function is of the form for a certain function g : R → [0, ∞) and a positive constant q such that We refer to this kind of density (see also Section 3.1 in [6]) as a generalized multivariate Cauchy density because the multivariate Cauchy density function is the special case pertaining to the choices g(x) = x 2 and q = 1 2 (n + 1). The differential Shannon entropy of the generalized multivariate Cauchy density was derived in Section 3.1 in [6] using the integral representation of the logarithm (1), where it was presented as a two-dimensional integral.
We next extend the analysis of [6] to differential Rényi entropies of an arbitrary positive order α (recall that the differential Rényi entropy is specialized to the differential Shannon entropy at α = 1 [38]). We show that, for the generalized multivariate Cauchy density, the differential Rényi entropy can be presented as a two-dimensional integral, rather than an n-dimensional integral. Defining we get from (42) (see Section 3.1 in [6]) that For g(x) = |x| θ , with a fixed θ > 0, (44) implies that In particular, for θ = 2 and q = 1 2 (n + 1), we get the multivariate Cauchy density from (42). In this case, it follows from (46) that Z(t) = π t for t > 0, and from (45) For α ∈ (0, 1) ∪ (1, ∞), the (differential) Rényi entropy of order α is given by Using the Laplace transform relation, we obtain that, for α > 1 (see Appendix C), If α ∈ (0, 1), we distinguish between the following two cases: where β 0 := 1, and for all j ∈ N, The proof of the integral expressions of the Rényi entropy of order α ∈ (0, 1), as given in (50)-(54), is provided in Appendix C.
Once again, the advantage of these expressions, which do not seem to be very simple (at least on the face of it), is that they only involve one-or two-dimensional integrals, rather than an expression of an n-dimensional integral (as it could have been in the case of an n-dimensional density).

Mutual Information Calculations for Communication Channels with Jamming
Consider a channel that is fed by an input vector X n = (X 1 , . . . , X n ) ∈ X n and generates an output vector Y n = (Y 1 , . . . , Y n ) ∈ Y n , where X and Y are either finite, countably infinite, or continuous alphabets, and X n and Y n are their n th order Cartesian powers. Let the conditional probability distribution of the channel be given by where r Y|X (·|·) and q Y|X (·|·) are given conditional probability distributions of Y given X, x n = (x 1 , . . . , x n ) ∈ X n and y n = (y 1 , . . . , y n ) ∈ Y n . This channel model refers to a discrete memoryless channel (DMC), which is nominally given by where one of the transmitted symbols is jammed at a uniformly distributed random time, i, and the transition distribution of the jammed symbol is given by r Y|X (y i |x i ) instead of q Y|X (y i |x i ). The restriction to a single jammed symbol is made merely for the sake of simplicity, but it can easily be extended. We wish to evaluate how the jamming affects the mutual information I(X n ; Y n ). Clearly, when one talks about jamming, the mutual information is decreased, but this is not part of the mathematical model, where the relation between r and q has not been specified. Let the input distribution be given by the product form The mutual information (in nats) is given by For the simplicity of notation, we henceforth omit the domains of integration whenever they are clear from the context. We have, p X n ,Y n (x n , y n ) ln p Y n |X n (y n |x n ) dx n dy n = p X n ,Y n (x n , y n ) ln p Y n |X n (y n |x n ) q Y n |X n (y n |x n ) dx n dy n + p X n ,Y n (x n , y n ) ln q Y n |X n (y n |x n ) dx n dy n . (60) By using the logarithmic expectation in (15) and the following equality (see (55) and (56)): we obtain (see Appendix D.1) where, for u ≥ 0, Moreover, owing to the product form of q n , it is shown in Appendix D.2 that p X n ,Y n (x n , y n ) ln q Y n |X n (y n |x n ) dx n dy n Combining (60), (62), and (65), we express h(Y n |X n ) as a double integral over X × Y, independently of n (rather than an integration over X n × Y n ): We next calculate the differential channel output entropy, h(Y n ), induced by p Y n |X n (·|·). From Appendix D.3, where, for all y ∈ Y, v(y) By (1), the following identity holds for every positive random variable Z (see Appendix D.3): where M Z (u) := E{e uZ }. By setting Z : where Combining (58), (66), and (71), we obtain the mutual information for the channel with jamming, which is given by We next exemplify our results in the case where q is a binary symmetric channel (BSC) with crossover probability δ ∈ (0, 1 2 ) and p is a BSC with a larger crossover probability, ε ∈ (δ, 1 2 ]. We assume that the input bits are i.i.d. and equiprobable. The specialization of our analysis to this setup is provided in Appendix D. 4, showing that the mutual information of the channel p X n ,Y n , fed by the binary symmetric source, is given by ln 2] is the binary entropy function with the convention that 0 ln 0 = 0, and denotes the binary relative entropy. By the data processing inequality, the mutual information in (75) is smaller than that of the BSC with crossover probability δ: Figure 3 refers to the case where δ = 10 −3 and n = 128. Here, I q (X n ; Y n ) = 87.71 nats, and I p (X n ; Y n ) is decreased by 2.88 nats due to the jammer (see Figure 3).  . The degradation in mutual information for n = 128. The jammer-free channel q is a binary symmetric channel (BSC) with crossover probability δ = 10 −3 , and r is a BSC with crossover probability ε ∈ δ, 1 2 . The input bits are i.i.d. and equiprobable. The degradation in I(X n ; Y n ) (nats) is displayed as a function of ε. Figure 4 refers to the case where δ = 10 −3 and ε = 1 2 (referring to complete jamming of a single symbol, which is chosen uniformly at random), and it shows the difference in the mutual information I(X n ; Y n ), as a function of the length n, between the jamming-free BSC with crossover probability δ and the channel with jamming.
To conclude, this subsection studies the change in the mutual information I(X n ; Y n ) due to jamming, relative to the mutual information associated with the nominal channel without jamming. Due to the integral representations provided in our analysis, the calculation of the mutual information finally depends on one-dimensional integrals, as opposed to the original n-dimensional integrals, pertaining to the expressions that define the associated differential entropies.   Figure 4. The degradation in mutual information as a function of n. The jammer-free channel q Y|X is a BSC with crossover probability δ = 10 −3 , and r Y|X for the jammed symbol is a BSC with crossover probability ε = 1 2 . The input bits are i.i.d. and equiprobable.
Author Contributions: Both authors contributed equally to this research work, as well as to the write-up of this article. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B.2. Derivation of the Upper Bound in
We next use the Chernoff bound for upper bounding P θ n − θ ≥ ε for all ε > 0, if θ ∈ 0, 1 2 , Combining (A30) and (A32) yields Similarly, it is easy to show that the same Chernoff bound applies also to P θ n − θ ≤ −ε , which overall gives Inequality (A35) is a refined version of Hoeffding's inequality (see Section 2.4.4 in [40]), which is derived for the Bernoulli distribution (see (A30)) and by invoking the Chernoff bound; moreover, (A35) coincides with Hoeffding's inequality in the special case θ = 1 2 (which, from (A33), yields C(θ) = 1 8 ). In view of the fact that (A35) forms a specialization of Theorem 2.4.7 in [40], it follows that the Bernoulli case is the worst one (in the sense of leading to the looser upper bound) among all probability distributions whose support is the interval [0, 1] and whose expected value is θ ∈ [0, 1]. However, in the Bernoulli case, a simple symmetry argument applies for improving the bound (A35) as follows.

Appendix D.3. Proof of Equations (67)-(73)
p Y n (y n ) = p Y n |X n (y n |x n ) p X n (x n ) dx n where v(·) and w(·) are probability densities on Y, as defined in (68) and (69), respectively. This proves (67). We next prove (70), which is used to calculate the entropy of Y n with the density p Y n (·) in (A64). In view of the integral representation of the logarithmic function in (1) and by interchanging the order of the integrations (due to Fubini's theorem), we get that for a positive random variable Z which proves (70). Finally, we prove (71). In view of (A64), h(Y n ) = − p Y n (y n ) ln p Y n (y n ) dy n ln v(y j ) + ln 1 n A calculation of the first integral on the right-hand side of (A66) gives For i = j, the inner integral on the right-hand side of (A67) satisfies ∏ =i v(y ) · w(y i ) ln v(y j ) dy n = ∏ =i v(y ) dy · w(y i ) ln v(y i ) dy i = w(y) ln v(y) dy, and for i = j, ∏ =i v(y ) · w(y i ) ln v(y j ) dy n = ∏ =i,j v(y ) dy · w(y i ) dy i · v(y j ) ln v(y j ) dy j = v(y) ln v(y) dy.
Therefore, combining (A67)-(A69) gives w(y i ) ln v(y j ) v(y i ) dy n = w(y) ln v(y) dy + (n − 1) v(y) ln v(y) dy. (A70) Finally, we calculate the second integral on the right-hand side of (A66). Let µ n be the probability density function defined as µ n (y n ) := n ∏ j=1 v(y j ), y n ∈ Y n , (A71) and let where {V i } n i=1 are i.i.d. Y-valued random variables with a probability density function v. Then, in view of (70), the second integral on the right-hand side of (A66) satisfies The MGF of Z is equal to where K(u) := Y v(y) exp u w(y) v(y) dy, ∀ u ∈ R, and consequently, (A75) yields M Z (u) = K n−1 u n K u n = v(y) exp u w(y) v(y) dy n−1 w(y) exp u w(y) v(y) dy, and M Z (0) = 1.