An Integral Representation of the Logarithmic Function with Applications in Information Theory

We explore a well-known integral representation of the logarithmic function, and demonstrate its usefulness in obtaining compact, easily computable exact formulas for quantities that involve expectations and higher moments of the logarithm of a positive random variable (or the logarithm of a sum of i.i.d. positive random variables). The integral representation of the logarithm is proved useful in a variety of information-theoretic applications, including universal lossless data compression, entropy and differential entropy evaluations, and the calculation of the ergodic capacity of the single-input, multiple-output (SIMO) Gaussian channel with random parameters (known to both transmitter and receiver). This integral representation and its variants are anticipated to serve as a useful tool in additional applications, as a rigorous alternative to the popular (but non-rigorous) replica method (at least in some situations).


Introduction
In analytic derivations pertaining to many problem areas in information theory, one frequently encounters the need to calculate expectations and higher moments of expressions that involve the logarithm of a positive-valued random variable, or more generally, the logarithm of the sum of several i.i.d. positive random variables. The common practice, in such situations, is either to resort to upper and lower bounds on the desired expression (e.g., using Jensen's inequality or any other well-known inequalities), or to apply the Taylor series expansion of the logarithmic function. A more modern approach is to use the replica method (see, e.g., in [1] (Chapter 8)), which is a popular (but non-rigorous) tool that has been borrowed from the field of statistical physics with considerable success.
The purpose of this work is to point out to an alternative approach and to demonstrate its usefulness in some frequently encountered situations. In particular, we consider the following integral representation of the logarithmic function (to be proved in the sequel), The immediate use of this representation is in situations where the argument of the logarithmic function is a positive-valued random variable, X, and we wish to calculate the expectation, E{ln X}. By commuting the expectation operator with the integration over u (assuming that this commutation is valid), the calculation of E{ln X} is replaced by the (often easier) calculation of the moment-generating function (MGF) of X, as Moreover, if X 1 , . . . , X n are positive i.i.d. random variables, then This simple idea is not quite new. It has been used in the physics literature, see, e.g., [1] (Exercise 7.6, p. 140), [2] (Equation (2.4) and onward) and [3] (Equation (12) and onward). With the exception of [4], we are not aware of any work in the information theory literature where it has been used. The purpose of this paper is to demonstrate additional information-theoretic applications, as the need to evaluate logarithmic expectations is not rare at all in many problem areas of information theory. Moreover, the integral representation (1) is useful also for evaluating higher moments of ln X, most notably, the second moment or variance, in order to assess the statistical fluctuations around the mean. We demonstrate the usefulness of this approach in several application areas, including entropy and differential entropy evaluations, performance analysis of universal lossless source codes, and calculations of the ergodic capacity of the Rayleigh single-input multiple-output (SIMO) channel. In some of these examples, we also demonstrate the calculation of variances associated with the relevant random variables of interest. As a side remark, in the same spirit of introducing integral representations and applying them, Simon and Divsalar [5,6] brought to the attention of communication theorists useful, definite-integral forms of the Q-function (Craig's formula [7]) and Marcum Q-function, and demonstrated their utility in applications.
It should be pointed out that most of our results remain in the form of a single-or double-definite integral of certain functions that depend on the parameters of the problem in question. Strictly speaking, such a definite integral may not be considered a closed-form expression, but nevertheless, we can say the following.
(a) In most of our examples, the expression we obtain is more compact, more elegant, and often more insightful than the original quantity. (b) The resulting definite integral can actually be considered a closed-form expression "for every practical purpose" since definite integrals in one or two dimensions can be calculated instantly using built-in numerical integration operations in MATLAB, Maple, Mathematica, or other mathematical software tools. This is largely similar to the case of expressions that include standard functions (e.g., trigonometric, logarithmic, exponential functions, etc.), which are commonly considered to be closed-form expressions. (c) The integrals can also be evaluated by power series expansions of the integrand, followed by term-by-term integration. (d) Owing to Item (c), the asymptotic behavior in the parameters of the model can be evaluated. (e) At least in two of our examples, we show how to pass from an n-dimensional integral (with an arbitrarily large n) to one-or two-dimensional integrals. This passage is in the spirit of the transition from a multiletter expression to a single-letter expression.
To give some preliminary flavor of our message in this work, we conclude this introduction by mentioning a possible use of the integral representation in the context of calculating the entropy of a Poissonian random variable. For a Poissonian random variable, N, with parameter λ, the entropy (in nats) is given by where the nontrivial part of the calculation is associated with the last term, E{ln N!}. In [8], this term was handled by using a nontrivial formula due to Malmstén (see [9] (pp. 20-21)), which represents the logarithm of Euler's Gamma function in an integral form (see also [10]). In Section 2, we derive the relevant quantity using (1), in a simpler and more transparent form which is similar to [11] ((2.3)-(2.4)).
The outline of the remaining part of this paper is as follows. In Section 2, we provide some basic mathematical background concerning the integral representation (2) and some of its variants. In Section 3, we present the application examples. Finally, in Section 4, we summarize and provide some outlook.

Mathematical Background
In this section, we present the main mathematical background associated with the integral representation (1), and provide several variants of this relation, most of which are later used in this paper. For reasons that will become apparent shortly, we extend the scope to the complex plane. Proof.

Remark 1.
In [12] (p. 363, Identity (3.434.2)), it is stated that Proposition 1 also applies to any purely imaginary number, z, which is of interest too (see Corollary 1 in the sequel, and the identity with the characteristic function in (14)).
Proposition 1 paves the way to obtaining some additional related integral representations of the logarithmic function for the reals.
Proof. By Proposition 1 and the identity ln x ≡ Re ln(ix) (with i := √ −1), we get Subtracting both sides by the integral in (13) for x = 1 (which is equal to zero) gives (12).
Let X be a real-valued random variable, and let Φ X (ν) := E e iνX be the characteristic function of X. Then, by Corollary 1, where we are assuming, here and throughout the sequel, that the expectation operation and the integration over u are commutable, i.e., Fubini's theorem applies. Similarly, by returning to Proposition 1 (confined to a real-valued argument of the logarithm), the calculation of E{ln X} can be replaced by the calculation of the MGF of X, as In particular, if X 1 , . . . , X n are positive i.i.d. random variables, then Remark 2. One may further manipulate (15) and (16) as follows. As ln x ≡ 1 s ln(x s ) for any s = 0 and x > 0, then the expectation of ln X can also be represented as The idea is that if, for some s / ∈ {0, 1}, E{e −uX s } can be expressed in closed form, whereas it cannot for s = 1 (or even E{e −uX s } < ∞ for some s / ∈ {0, 1}, but not for s = 1), then (17) may prove useful. Moreover, if X 1 , . . . , X n are positive i.i.d. random variables, s > 0, and Y := (X s 1 + . . . + X s n ) 1/s , then and, from (18) with s = 2, Note that according to the pdf of a chi-squared distribution, one can express E{ln Y} as a one-dimensional integral even without using (18). However, for general s > 0, the direct calculation of E ln (∑ n i=1 |X i | s ) leads to an n-dimensional integral, whereas (18) provides a one-dimensional integral whose integrand involves in turn the calculation of a one-dimensional integral too.
Identity (1) also proves useful when one is interested, not only in the expected value of ln X, but also in higher moments, in particular, its second moment or variance. In this case, the one-dimensional integral becomes a two-dimensional one. Specifically, for any s > 0, More generally, for a pair of positive random variables, (X, Y), and for s > 0, For later use, we present the following variation of the basic identity.

Proposition 2. Let X be a random variable, and let
be the MGF of X. If X is non-negative, then Proof. Equation (27) is a trivial consequence of (15). As for (28), we have The following result relies on the validity of (5) to the right-half complex plane, and its derivation is based on the identity ln(1 + x 2 ) ≡ ln(1 + ix) + ln(1 − ix) for all x ∈ R. In general, it may be used if the characteristic function of a random variable X has a closed-form expression, whereas the MGF of X 2 does not admit a closed-form expression (see Proposition 2). We introduce the result, although it is not directly used in the paper. Proposition 3. Let X be a real-valued random variable, and let be the characteristic function of X. Then, and As a final note, we point out that the fact that the integral representation (2) replaces the expectation of the logarithm of X by the expectation of an exponential function of X, has an additional interesting consequence: an expression like ln(n!) becomes the integral of the sum of a geometric series, which, in turn, is easy to express in closed form (see [11] Thus, for a positive integer-valued random variable, N, the calculation of E{ln N!} requires merely the calculation of E{N} and the MGF, E{e −uN }. For example, if N is a Poissonian random variable, as discussed near the end of the Introduction, both E{N} and E{e −uN } are easy to evaluate. This approach is a simple, direct alternative to the one taken in [8] (see also [10]), where Malmstén's nontrivial formula for ln Γ(z) (see [9] (pp. 20-21)) was invoked. (Malmstén's formula for ln Γ(z) applies to a general, complex-valued z with Re(z) > 0; in the present context, however, only integer real values of z are needed, and this allows the simplification shown in (36)). The above described idea of the geometric series will also be used in one of our application examples, in Section 3.4.

Applications
In this section, we show the usefulness of the integral representation of the logarithmic function in several problem areas in information theory. To demonstrate the direct computability of the relevant quantities, we also present graphs of their numerical calculation. In some of the examples, we also demonstrate calculations of the second moments and variances.

Differential Entropy for Generalized Multivariate Cauchy Densities
Let (X 1 , . . . , X n ) be a random vector whose probability density function is of the form for a certain non-negative function g and positive constant q such that We refer to this kind of density as a generalized multivariate Cauchy density, because the multivariate Cauchy density is obtained as a special case where g(x) = x 2 and q = 1 2 (n + 1). Using the Laplace transform relation, f can be represented as a mixture of product measures: Defining we get from (40), and so, The calculation of the differential entropy of f is associated with the evaluation of the expectation (27), From (40) and by interchanging the integration, In view of (40), (44), and (45), the differential entropy of (X 1 , . . . , X n ) is therefore given by For g(x) = |x| θ , with an arbitrary θ > 0, we obtain from (41) that In particular, for θ = 2 and q = 1 2 (n + 1), we get the multivariate Cauchy density from (37). In this case, as Γ 1 2 = √ π, it follows from (47) that Z(t) = π t for t > 0, and from (43) Combining (46), (47) and (48) gives Figure 1 displays the normalized differential entropy, 1 n h(X 1 , . . . , X n ), for 1 ≤ n ≤ 100. We believe that the interesting point, conveyed in this application example, is that (46) provides a kind of a "single-letter expression"; the n-dimensional integral, associated with the original expression of the differential entropy h(X 1 , . . . , X n ), is replaced by the two-dimensional integral in (46), independently of n.
As a final note, we mention that a lower bound on the differential entropy of a different form of extended multivariate Cauchy distributions (cf. [13] (Equation (42))) was derived in [13] (Theorem 6). The latter result relies on obtaining lower bounds on the differential entropy of random vectors whose densities are symmetric log-concave or γ-concave (i.e., densities f for which f γ is concave for some

Ergodic Capacity of the Fading SIMO Channel
Consider the SIMO channel with L receive antennas and assume that the channel transfer coefficients, {h i } L i=1 , are independent, zero-mean, circularly symmetric complex Gaussian random variables with variances {σ 2 i } L i=1 . Its ergodic capacity (in nats per channel use) is given by where f := Re{h }, g := Im{h }, and ρ := P N 0 is the signal-to-noise ratio (SNR) (see, e.g., [14,15]). Paper [14] is devoted, among other things, to the exact evaluation of (50) by finding the density of the random variable defined by ∑ L =1 ( f 2 + g 2 ), and then taking the expectation w.r.t. that density. Here, we show that the integral representation in (5) suggests a more direct approach to the evaluation of (50). It should also be pointed out that this approach is more flexible than the one in [14], as the latter strongly depends on the assumption that {h i } are Gaussian and statistically independent. The integral representation approach also allows other distributions of the channel transfer gains, as well as possible correlations between the coefficients and/or the channel inputs. Moreover, we are also able to calculate the variance of ln 1 + ρ ∑ L =1 |h | 2 , as a measure of the fluctuations around the mean, which is obviously related to the outage.
Specifically, in view of Proposition 2 (see (27)), let For all u > 0, where (52) holds since From (27), (50) and (52), the ergodic capacity (in nats per channel use) is given by A similar approach appears in [4] (Equation (12)). As for the variance, from Proposition 2 (see (28)) and (52), A similar analysis holds for the multiple-input single-output (MISO) channel. By partial-fraction decomposition of the expression (see the right side of (54)) the ergodic capacity C can be expressed as a linear combination of integrals of the form where E 1 (·) is the (modified) exponential integral function, defined as A similar representation appears also in [14] (Equation (7)). Consider the example of L = 2, σ 2 1 = 1 2 and σ 2 2 = 1. From (54), the ergodic capacity of the SIMO channel is given by The variance in this example (see (55)) is given by (59) Figure 2 depicts the ergodic capacity C as a function of the SNR, ρ, in dB (see (58), and divide by ln 2 for conversion to bits per channel use). The same example exactly appears in the lower graph of Figure 1 in [14]. The variance appears in Figure 3 (see (59), and similarly divide by ln 2 2).    2 ) of the SIMO channel as a function of ρ = SNR (in dB) for L = 2 receive antennas, with noise variances σ 2 1 = 1 2 and σ 2 2 = 1.

Universal Source Coding for Binary Arbitrarily Varying Sources
Consider a source coding setting, where there are n binary DMS's, and let x i ∈ [0, 1] denote the Bernoulli parameter of source no. i ∈ {1, . . . , n}. Assume that a hidden memoryless switch selects uniformly at random one of these sources, and the data is then emitted by the selected source. Since it is unknown a-priori which source is selected at each instant, a universal lossless source encoder (e.g., a Shannon or Huffman code) is designed to match a binary DMS whose Bernoulli parameter is given by 1 n ∑ n i=1 x i . Neglecting integer length constraints, the average redundancy in the compression rate (measured in nats per symbol), due to the unknown realization of the hidden switch, is about where h b : [0, 1] → [0, ln 2] is the binary entropy function (defined to the base e), and the redundancy is given in nats per source symbol. Now, let us assume that the Bernoulli parameters of the n sources are i.i.d. random variables, X 1 , . . . , X n , all having the same density as that of some generic random variable X, whose support is the interval [0, 1]. We wish to evaluate the expected value of the above defined redundancy, under the assumption that the realizations of X 1 , . . . , X n are known. We are then facing the need to evaluate We now express the first and second terms on the right-hand side of (61) as a function of the MGF of X.
In view of (5), the binary entropy function h b admits the integral representation which implies that The expectations on the right-hand side of (63) can be expressed as functionals of the MGF of X, M X (ν) = E{e νX }, and its derivative, for ν < 0. For all u ∈ R, and On substituting (64) and (65) into (63), we readily obtain which yields, in view of (66), (67) and the change of integration variable, t = u n , the following: Similarly as in Section 3.1, here too, we pass from an n-dimensional integral to a one-dimensional integral. In general, similar calculations can be carried out for higher integer moments, thus passing from n-dimensional integration for a moment of order s to an s-dimensional integral, independently of n. For example, if X 1 , . . . , X n are i.i.d. and uniformly distributed on [0,1], then the MGF of a generic random variable X distributed like all {X i } is given by From (68), it can be verified numerically that E h b 1 n ∑ n i=1 X i is monotonically increasing in n, being equal (in nats) to 1 2 , 0.602, 0.634, 0.650, 0.659 for n = 1, . . . , 5, respectively, with the limit h b 1 2 = ln 2 ≈ 0.693 as we let n → ∞ (this is expected by the law of large numbers).

Moments of the Empirical Entropy and the Redundancy of K-T Universal Source Coding
Consider a stationary, discrete memoryless source (DMS), P, with a finite alphabet X of size |X | and letter probabilities {P(x), x ∈ X }. Let (X 1 , . . . , X n ) be an n-vector emitted from P, and let {P(x), x ∈ X } be the empirical distribution associated with (X 1 , . . . , X n ), that is,P(x) = n(x) n , for all x ∈ X , where n(x) is the number of occurrences of the letter x in (X 1 , . . . , X n ).
It is well known that in many universal lossless source codes for the class of memoryless sources, the dominant term of the length function for encoding (X 1 , . . . , X n ) is nĤ, whereĤ is the empirical entropy,Ĥ For code length performance analysis (as well as for entropy estimation per se), there is therefore interest in calculating the expected value E{Ĥ} as well as Var{Ĥ}. Another motivation comes from the quest for estimating the entropy as an objective on its own right, and then the expectation and the variance suffice for the calculation of the mean square error of the estimate,Ĥ. Most of the results that are available in the literature, in this context, concern the asymptotic behavior for large n as well as bounds (see, e.g., [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30], as well as many other related references therein). The integral representation of the logarithm in (5), on the other hand, allows exact calculations of the expectation and the variance. The expected value of the empirical entropy is given by For convenience, let us define the function φ n : X × R → (0, ∞) as which yields, where φ n and φ n are first and second order derivatives of φ n w.r.t. t, respectively. From (71) and (73), where the integration variable in (75) was changed using a simple scaling by n. Before proceeding with the calculation of the variance ofĤ, let us first compare the integral representation in (75) to the alternative sum, obtained by a direct, straightforward calculation of the expected value of the empirical entropy. A straightforward calculation gives We next compare the computational complexity of implementing (75) to that of (77). For large n, in order to avoid numerical problems in computing (77) by standard software, one may use the Gammaln function in Matlab/Excel or the LogGamma in Mathematica (a built-in function for calculating the natural logarithm of the Gamma function) to obtain that The right-hand side of (75) is the sum of |X | integrals, and the computational complexity of each integral depends on neither n, nor |X |. Hence, the computational complexity of the right-hand side of (75) scales linearly with |X |. On the other hand, the double sum on the right-hand side of (77) consists of n · |X | terms. Let α := n |X | be fixed, which is expected to be large (α 1) if a good estimate of the entropy is sought. The computational complexity of the double sum on the right-hand side of (77) grows like α |X | 2 , which scales quadratically in |X |. Hence, for a DMS with a large alphabet, or when n |X|, there is a significant computational reduction by evaluating (75) in comparison to the right-hand side of (77).
We next move on to calculate the variance ofĤ.
The second term on the right-hand side of (80) has already been calculated. For the first term, let us define, for x = x, · P(x) e s/n k P(x ) e t/n Observe that and for x = x , Therefore, Defining (see (74) and (86)) we have To obtain numerical results, it would be convenient to particularize now the analysis to the binary symmetric source (BSS). From (75), For the variance, it follows from (84) that for x = x with x, x ∈ {0, 1} and s, t ∈ R, ψ n (x, x , s, t) = e s/n + e t/n 2 n , (97) and, from (87)-(89), for From (72), for x ∈ {0, 1} and t ∈ R, Combining Equations (93), (99), and (102), gives the following closed-form expression for the variance of the empirical entropy: where f n (s) := e −s 1 + e −s 2 For the BSS, ln 2 − E{Ĥ} = E{D(P P)} and the standard deviation ofĤ both decay at the rate of 1 n as n grows without bound, according to Figure 4. This asymptotic behavior of E{D(P P)} is supported by the well-known result [31] (see also [18] (Section 3.C) and references therein) that for the class of discrete memoryless sources {P} with a given finite alphabet X ,  lnP (X 1 , . . . , X n ) P(X 1 , . . . , X n ) in law, where χ 2 d is a chi-squared random variable with d degrees of freedom. The left-hand side of (106) can be rewritten as ln exp{−nĤ} exp{−nĤ − nD(P P)} = nD(P P), and so, E{D(P P)} decays like d 2n , which is equal to 1 2n for the BSS. In Figure 4, the base of the logarithm is 2, and therefore, E{D(P P)} = 1 − E{Ĥ} decays like log 2 e 2n ≈ 0.7213 n . It can be verified numerically that 1 − E{Ĥ} (in bits) is equal to 7.25 · 10 −3 and 7.217 · 10 −4 for n = 100 and n = 1000, respectively (see Figure 4), which confirms (106) and (107). Furthermore, the exact result here for the standard deviation, which decays like 1 n , scales similarly to the concentration inequality in [32] ( (9)). We conclude this subsection by exploring a quantity related to the empirical entropy, which is the expected code length associated with the universal lossless source code due to Krichevsky and Trofimov [23]. In a nutshell, this is a predictive universal code, which at each time instant t, sequentially assigns probabilities to the next symbol according to (a biased version of) the empirical distribution pertaining to the data seen thus far, x 1 , . . . , x t . Specifically, consider the code length function (in nats), where N t (x) is the number of occurrences of the symbol x ∈ X in (x 1 , . . . , x t ), and s > 0 is a fixed bias parameter needed for the initial coding distribution (t = 0). We now calculate the redundancy of this universal code, where H is the entropy of the underlying source. From Equations (108), (109), and (110), we can represent R n as follows, The expectation on the right-hand side of (111) satisfies which gives from (111) and (112) that the redundancy is given by  Figure 5 displays nR n as a function of ln n for s = 1 2 in the range 1 ≤ n ≤ 5000. As can be seen, the graph is nearly a straight line with slope 1 2 , which is in agreement with the theoretical result that R n ∼ ln n 2n (in nats per symbol) for large n (see [23] (Theorem 2)).

Summary and Outlook
In this work, we have explored a well-known integral representation of the logarithmic function, and demonstrated its applications in obtaining exact formulas for quantities that involve expectations and second order moments of the logarithm of a positive random variable (or the logarithm of a sum of i.i.d. such random variables). We anticipate that this integral representation and its variants can serve as useful tools in many additional applications, representing a rigorous alternative to the replica method in some situations.
Our work in this paper focused on exact results. In future research, it would be interesting to explore whether the integral representation we have used is useful also in obtaining upper and lower bounds on expectations (and higher order moments) of expressions that involves logarithms of positive random variables. In particular, could the integrand of (1) be bounded from below and/or above in a nontrivial manner, that would lead to new interesting bounds? Moreover, it would be even more useful if the corresponding bounds on the integrand would lend themselves to closed-form expressions of the resulting definite integrals.
Let ν := 1, and µ := are positive i.i.d. random variables. Taking expectations of both sides of (114) and rearranging terms, gives where X is a random variable having the same density as of the X i 's, and M X (u) := E e uX (for u ∈ R) denotes the MGF of X. Since it follows that (115) generalizes (3) for the logarithmic expectation. Identity (115), for the ρ-th moment of a sum of i.i.d. positive random variables with ρ ∈ (0, 1), may be used in some information-theoretic contexts rather than invoking Jensen's inequality. Funding: This research received no external funding.