On Relations Between the Relative Entropy and χ2-Divergence, Generalizations and Applications

This paper is focused on a study of integral relations between the relative entropy and the chi-squared divergence, which are two fundamental divergence measures in information theory and statistics, a study of the implications of these relations, their information-theoretic applications, and some generalizations pertaining to the rich class of f-divergences. Applications that are studied in this paper refer to lossless compression, the method of types and large deviations, strong data–processing inequalities, bounds on contraction coefficients and maximal correlation, and the convergence rate to stationarity of a type of discrete-time Markov chains.


Introduction
The relative entropy (also known as the Kullback-Leibler divergence [1]) and the chi-squared divergence [2] are divergence measures which play a key role in information theory, statistics, learning, signal processing, and other theoretical and applied branches of mathematics. These divergence measures are fundamental in problems pertaining to source and channel coding, combinatorics and large deviations theory, goodness-of-fit and independence tests in statistics, expectation-maximization iterative algorithms for estimating a distribution from an incomplete data, and other sorts of problems (the reader is referred to the tutorial paper by Csiszár and Shields [3]). They both belong to an important class of divergence measures, defined by means of convex functions f , and named f -divergences [4][5][6][7][8]. In addition to the relative entropy and the chi-squared divergence, this class unifies other useful divergence measures such as the total variation distance in functional analysis, and it is also closely related to the Rényi divergence which generalizes the relative entropy [9,10].
In general, f -divergences (defined in Section 2) are attractive since they satisfy pleasing features such as the data-processing inequality, convexity, (semi)continuity, and duality properties, and they therefore find nice applications in information theory and statistics (see, e.g., [6,8,11,12]).
In this work, we study integral relations between the relative entropy and the chi-squared divergence, implications of these relations, and some of their information-theoretic applications. Some generalizations which apply to the class of f -divergences are also explored in detail. In this context, it should be noted that integral representations of general f -divergences, expressed as a function of either the DeGroot statistical information [13], the E γ -divergence (a parametric sub-class of f -divergences, which generalizes the total variation distance [14] [p. 2314]) and the relative information spectrum, have been derived in [12] [Section 5], [15] [Section 7.B], and [16] [Section 3], respectively.

Preliminaries and Notation
This section provides definitions of divergence measures which are used in this paper, and it also provides relevant notation. Definition 1. [12] [p. 4398] Let P and Q be probability measures, let µ be a dominating measure of P and Q (i.e., P, Q µ), and let p := dP dµ and q := dQ dµ be the densities of P and Q with respect to µ. The f -divergence from P to Q is given by where It should be noted that the right side of (1) does not depend on the dominating measure µ.
Throughout the paper, we denote by 1{relation} the indicator function; it is equal to 1 if the relation is true, and it is equal to 0 otherwise. Throughout the paper, unless indicated explicitly, logarithms have an arbitrary common base (that is larger than 1), and exp(·) indicates the inverse function of the logarithm with that base. Definition 2. [1] The relative entropy is the f -divergence with f (t) := t log t for t > 0, D(P Q) := D f (P Q) (4) = p log p q dµ.
Definition 3. The total variation distance between probability measures P and Q is the f -divergence from P to Q with f (t) := |t − 1| for all t ≥ 0. It is a symmetric f -divergence, denoted by |P − Q|, which is given by |P − Q| := D f (P Q) (6) = |p − q| dµ.

•
By the continuous extension of D α (P Q), The second-order Rényi divergence and the chi-squared divergence are related as follows: and the relative entropy and the chi-squared divergence satisfy (see, e.g., [20] [Theorem 5]) D(P Q) ≤ log 1 + χ 2 (P Q) .
In view of (1), (9) and (17), it can be verified that the Györfi-Vajda divergence is related to the chi-squared divergence as follows: Hence,

Relations between Divergences
We introduce in this section results on the relations between the relative entropy and the chi-squared divergence, their implications, and generalizations. Information-theoretic applications are studied in the next section.

Relations between the Relative Entropy and the Chi-Squared Divergence
The following result relates the relative entropy and the chi-squared divergence, which are two fundamental divergence measures in information theory and statistics. This result was recently obtained in an equivalent form in [17] [ (12)] (it is noted that this identity was also independently derived by the coauthors in two separate un-published works in [24] [ (16)] and [25]). It should be noted that these connections between divergences in the quantum setting were originally discovered in [19] [ Theorem 6]. Beyond serving as an interesting relation between these two fundamental divergence measures, it is introduced here for the following reasons: (a) New consequences and applications of it are obtained, including new shorter proofs of some known results; (b) An interesting extension provides new relations between f -divergences (see Section 3.3). Theorem 1. Let P and Q be probability measures defined on a measurable space (X , F ), and let be the convex combination of P and Q. Then, for all λ ∈ [0, 1], Proof. See Section 5.1.
A specialization of Theorem 1 by letting λ = 1 gives the following identities.
By Corollary 1, we obtain original and simple proofs of new and old f -divergence inequalities.
(a) Pinsker's inequality: Furthermore, let {P n } be a sequence of probability measures that is defined on a measurable space (X , F ), and which converges to a probability measure P in the sense that with X ∼ P. Then, (30) is locally tight in the sense that its both sides converge to 0, and lim n→∞ 1 3 χ 2 (P n P) + 1 6 χ 2 (P P n ) 1 log e D(P n P) (c) For all θ ∈ (0, 1), Moreover, under the assumption in (31), for all θ ∈ [0, 1] lim n→∞ D(P P n ) D φ θ (P P n ) = 1 2 log e.
Proof. See Section 5.2. (30) is locally tight in the sense that (31) yields (32). This property, however, is not satisfied by (16) since the assumption in (31) implies that lim n→∞ log 1 + χ 2 (P n P) D(P n P)

Remark 2. Inequality
Remark 3. Inequality (30) readily yields which is proved by a different approach in [27] [Proposition 4]. It is further shown in [15] [Theorem 2 b)] that sup D(P Q) + D(Q P) where the supremum is over P Q and P = Q.

Implications of Theorem 1
We next provide two implications of Theorem 1. The first implication, which relies on the Hammersley-Chapman-Robbins (HCR) bound for the chi-squared divergence [28,29], gives the following tight lower bound on the relative entropy D(P Q) as a function of the means and variances under P and Q. Theorem 2. Let P and Q be probability measures defined on the measurable space (R, B), where R is the real line and B is the Borel σ-algebra of subsets of R. Let m P , m Q , σ 2 P , and σ 2 Q denote the expected values and variances of X ∼ P and Y ∼ Q, i.e., (a) If m P = m Q , then where d(r s) := r log r s + (1 − r) log 1−r 1−s , for r, s ∈ [0, 1], denotes the binary relative entropy (with the convention that 0 log 0 0 = 0), and (b) The lower bound on the right side of (40) is attained for P and Q which are defined on the two-element set U := {u 1 , u 2 }, and with r and s in (41) and (42), respectively, and for m P = m Q where the infimum on the left side of (48) is taken over all P and Q which satisfy (39).

Remark 4.
Consider the case of the non-equal means in Items (a) and (b) of Theorem 2. If these means are fixed, then the infimum of D(P Q) is zero by choosing arbitrarily large equal variances. Suppose now that the non-equal means m P and m Q are fixed, as well as one of the variances (either σ 2 P or σ 2 Q ). Numerical experimentation shows that, in this case, the achievable lower bound in (40) is monotonically decreasing as a function of the other variance, and it tends to zero as we let the free variance tend to infinity. This asymptotic convergence to zero can be justified by assuming, for example, that m P , m Q , and σ 2 Q are fixed, and m P > m Q (the other cases can be justified in a similar way). Then, it can be verified from (41)-(45) that which implies that d(r s) → 0 as we let σ P → ∞. The infimum of the relative entropy D(P Q) is therefore equal to zero since the probability measures P and Q in (46) and (47), which are defined on a two-element set and attain the lower bound on the relative entropy under the constraints in (39), have a vanishing relative entropy in this asymptotic case. The variance under P n is equal to σ 2 P , and the variance under Q n is equal to σ 2 Q (independently of n); (c) The relative entropy D(P n Q n ) vanishes as we let n → ∞.
A second consequence of Theorem 1 gives the following result. Its first part holds due to the concavity of exp −D(P ·) (see [30] [Problem 4.2]). The second part is new, and its proof relies on Theorem 1. As an educational note, we provide an alternative proof of the first part by relying on Theorem 1.
Then, for all λ ∈ [0, 1], with an equality if λ = 0 or λ = 1. Moreover, F is monotonically increasing, differentiable, and it satisfies so the limit in (53) is twice as large as the value of the lower bound on this limit as it follows from the right side of (52).

Remark 6.
By the convexity of the relative entropy, it follows that F(λ) ≤ λ D(P Q) for all λ ∈ [0, 1]. It can be verified, however, that the inequality 1 − λ + λ exp(−x) ≥ exp(−λx) holds for all x ≥ 0 and λ ∈ [0, 1]. Letting x := D(P Q) implies that the upper bound on F(λ) on the right side of (51) is tighter than or equal to the upper bound λ D(P Q) (with an equality if and only if either λ ∈ {0, 1} or P ≡ Q).

Corollary 3.
Let {P j } m j=1 , with m ∈ N, be probability measures defined on a measurable space (X , F ), and let {α j } m j=1 be a sequence of non-negative numbers that sum to 1. Then, for all i ∈ {1, . . . , m}, Proof. For an arbitrary i ∈ {1, . . . , m}, apply the upper bound on the right side of (51) with λ := 1 − α i , P := P i and Q := 1 1−α i ∑ j =i α j P j . The right side of (54) is obtained from (51) by invoking the convexity of the relative entropy, which gives D( The next result provides an upper bound on the non-negative difference between the entropy of a convex combination of distributions and the respective convex combination of the individual entropies (it is also termed as the concavity deficit of the entropy function in [17] [Section 3]).
Proof. The lower bound holds due to the concavity of the entropy function. The upper bound readily follows from Corollary 3, and the identity by relying on all the 1 2 m(m − 1) pairwise relative entropies between the individual distributions {P j } m j=1 . Another refinement of (57), expressed in terms of total variation distances, has been recently provided in [17] [Theorem 3.1].

Monotonic Sequences of f -Divergences and an Extension of Theorem 1
The present subsection generalizes Theorem 1, and it also provides relations between f -divergences which are defined in a recursive way. Theorem 4. Let P and Q be probability measures defined on a measurable space (X , F ). Let R λ , for λ ∈ [0, 1], be the convex combination of P and Q as in (21). Let f 0 : (0, ∞) → R be a convex function with f 0 (1) = 0, and let { f k (·)} ∞ k=0 be a sequence of functions that are defined on (0, ∞) by the recursive equation Then, k=0 is a non-increasing (and non-negative) sequence of f -divergences.
We next use the polylogarithm functions, which satisfy the recursive equation [32] [Equation (7.2)]: This gives Li s log e (1 − s) ds and so on, which are real-valued and finite for x < 1.

On Probabilities and f -Divergences
The following result relates probabilities of sets to f -divergences.
Theorem 5. Let (X , F , µ) be a probability space, and let C ∈ F be a measurable set with µ(C) > 0. Define the conditional probability measure Let f : (0, ∞) → R be an arbitrary convex function with f (1) = 0, and assume (by continuous extension of f at zero) that f (0) := lim t→0 + f (t) < ∞. Furthermore, let f : (0, ∞) → R be the convex function which is given by Then, Proof. See Section 5.7.
Connections of probabilities to the relative entropy, and to the chi-squared divergence, are next exemplified as special cases of Theorem 5.

Corollary 6.
In the setting of Theorem 5, so (16) is satisfied in this case with equality. More generally, for all α ∈ (0, ∞), Proof. See Section 5.7.

Applications
This section provides applications of our results in Section 3. These include universal lossless compression, method of types and large deviations, and strong data-processing inequalities (SDPIs).

Application of Corollary 3: Shannon Code for Universal Lossless Compression
Consider m > 1 discrete, memoryless, and stationary sources with probability mass functions {P i } m i=1 , and assume that the symbols are emitted by one of these sources with an a priori probability α i for source no. i, where {α i } m i=1 are positive and sum to 1. For lossless data compression by a universal source code, suppose that a single source code is designed with respect to the average probability mass function P := m ∑ j=1 α j P j .
Assume that the designer uses a Shannon code, where the code assignment for a symbol x ∈ X is of length (x) = log 1 P(x) bits (logarithms are on base 2). Due to the mismatch in the source distribution, the average codeword length avg satisfies (see [38] The fractional penalty in the average codeword length, denoted by ν, is defined to be equal to the ratio of the penalty in the average codeword length as a result of the source mismatch, and the average codeword length in case of a perfect matching. From (68), it follows that We next rely on Corollary 3 to obtain an upper bound on ν which is expressed as a function of the m(m − 1) relative entropies D(P i P j ) for all i = j in {1, . . . , m}. This is useful if, e.g., the m relative entropies on the left and right sides of (69) do not admit closed-form expressions, in contrast to the m(m − 1) relative entropies D(P i P j ) for i = j. We next exemplify this case.
For i ∈ {1, . . . , m}, let P i be a Poisson distribution with parameter λ i > 0. For all i, j ∈ {1, . . . , m}, the relative entropy from P i to P j admits the closed-form expression From (54) and (70), it follows that where The entropy of a Poisson distribution, with parameter λ i , is given by the integral representation [39][40][41] Combining (69), (71) and (74) finally gives an upper bound on ν in the considered setup.

Example 1.
Consider five discrete memoryless sources where the probability mass function of source no. i is given by 20,24,28,32]. Suppose that the symbols are emitted from one of the sources with equal probability, so α = 1 5 , 1 5 , 1 5 , 1 5 , 1 5 . Let P := 1 5 (P 1 + . . . + P 5 ) be the average probability mass function of the five sources. The term ∑ i α i D(P i P), which appears in the numerators of the upper and lower bounds on ν (see (69)), does not lend itself to a closed-form expression, and it is not even an easy task to calculate it numerically due to the need to compute an infinite series which involves factorials. We therefore apply the closed-form upper bound in (71) to get that ∑ i α i D(P i P) ≤ 1.46 bits, whereas the upper bound which follows from the convexity of the relative entropy (i.e., ∑ i α i f i (α, λ)) is equal to 1.99 bits (both upper bounds are smaller than the trivial bound log 2 5 ≈ 2.32 bits). From (69), (74), and the stronger upper bound on ∑ i α i D(P i P), the improved upper bound on ν is equal to 57.0% (as compared to a looser upper bound of 69.3%, which follows from (69), (74), and the looser upper bound on ∑ i α i D(P i P) that is equal to 1.99 bits).

Application of Theorem 2 in the Context of the Method of Types and Large Deviations Theory
Let X n = (X 1 , . . . , X n ) be a sequence of i.i.d. random variables with X 1 ∼ Q, where Q is a probability measure defined on a finite set X , and Q(x) > 0 for all x ∈ X . Let P be a set of probability measures on X such that Q / ∈ P, and suppose that the closure of P coincides with the closure of its interior. Then, by Sanov's theorem (see, e.g., [42] [Theorem 11.4.1] and [43] [Theorem 3.3]), the probability that the empirical distribution P X n belongs to P vanishes exponentially at the rate Furthermore, for finite n, the method of types yields the following upper bound on this rare event: whose exponential decay rate coincides with the exact asymptotic result in (75).
Suppose that Q is not fully known, but its mean m Q and variance σ 2 Q are available. Let m 1 ∈ R and δ 1 , ε 1 , σ 1 > 0 be fixed, and let P be the set of all probability measures P, defined on the finite set Hence, P coincides with the closure of its interior, and Q / ∈ P. The lower bound on the relative entropy in Theorem 2, used in conjunction with the upper bound in (77), can serve to obtain an upper bound on the probability of the event that the empirical distribution of X n belongs to the set P, regardless of the uncertainty in Q. This gives where and, for fixed (m P , m Q , σ 2 P , σ 2 Q ), the parameters r and s are given in (41) and (42), respectively. Standard algebraic manipulations that rely on (78) lead to the following result, which is expressed as a function of the Lambert W function [44]. This function, which finds applications in various engineering and scientific fields, is a standard built-in function in mathematical software tools such as Mathematica, Matlab, and Maple. Applications of the Lambert W function in information theory and coding are briefly surveyed in [45].

Proposition 2.
For ε ∈ (0, 1), let n * := n * (ε) denote the minimal value of n ∈ N such that the upper bound on the right side of (78) does not exceed ε ∈ (0, 1). Then, n * admits the following closed-form expression: with and W −1 (·) on the right side of (80) denotes the secondary real-valued branch of the Lambert W function (i.e., is the inverse function of y := xe x ).

Example 2.
Let Q be an arbitrary probability measure, defined on a finite set X , with mean m Q = 40 and variance σ 2 Q = 20. Let P be the set of all probability measures P, defined on X , whose mean m P and variance σ 2 P lie in the intervals [43,47] and [18,22], respectively. Suppose that it is required that, for all probability measures Q as above, the probability that the empirical distribution of the i.i.d. sequence X n ∼ Q n be included in the set P is at most ε = 10 −10 . We rely here on the upper bound in (78), and impose the stronger condition where it should not exceed ε. By this approach, it is obtained numerically from (79) that d * = 0.203 nats. We next examine two cases: (i) If |X | = 2, then it follows from (80) that n * = 138.
(ii) Consider a richer alphabet size of the i.i.d. samples where, e.g., |X | = 100. By relying on the same universal lower bound d * , which holds independently of the value of |X | (X can possibly be an infinite set), it follows from (80) that n * = 4170 is the minimal value such that the upper bound in (78) does not exceed 10 −10 .
We close this discussion by providing numerical experimentation of the lower bound on the relative entropy in Theorem 2, and comparing this attainable lower bound (see Item (b) of Theorem 2) with the following closed-form expressions for relative entropies: (a) The relative entropy between real-valued Gaussian distributions is given by (b) Let E µ denote a random variable which is exponentially distributed with mean µ > 0; its probability density function is given by Then, for a 1 , a 2 > 0 and d 1 , In this case, the means under P and Q are m P = d 1 + a 1 and m Q = d 2 + a 2 , respectively, and the variances are σ 2 P = a 2 1 and σ 2 Q = a 2 2 . Hence, for obtaining the required means and variances, set Example 3. We compare numerically the attainable lower bound on the relative entropy, as it is given in (40), with the two relative entropies in (82) and (84): (45,40,20,20), then the lower bound in (40) is equal to 0.521 nats, and the two relative entropies in (82)

Strong Data-Processing Inequalities and Maximal Correlation
The information contraction is a fundamental concept in information theory. The contraction of f -divergences through channels is captured by data-processing inequalities, which can be further tightened by the derivation of SDPIs with channel-dependent or source-channel dependent contraction coefficients (see, e.g., [26,[46][47][48][49][50][51][52]).
We next provide necessary definitions which are relevant for the presentation in this subsection.

Definition 7.
Let Q X be a probability distribution which is defined on a set X , and that is not a point mass, and let W Y|X : X → Y be a stochastic transformation. The contraction coefficient for f -divergences is defined as where, for all y ∈ Y, The notation in (87) and (88) is consistent with the standard notation used in information theory (see, e.g., the first displayed equation after (3.2) in [53]).
The derivation of good upper bounds on contraction coefficients for f -divergences, which are strictly smaller than 1, lead to SDPIs. These inequalities find their applications, e.g., in studying the exponential convergence rate of an irreducible, time-homogeneous and reversible discrete-time Markov chain to its unique invariant distribution over its state space (see, e.g., [49] [Section 2.4.3] and [50] [Section 2]). It is in sharp contrast to DPIs which do not yield convergence to stationarity at any rate. We return to this point later in this subsection, and determine the exact convergence rate to stationarity under two parametric families of f -divergences.
We next rely on Theorem 1 to obtain upper bounds on the contraction coefficients for the following f -divergences.
Theorem 6. The f -divergences in (89) and (90) satisfy the following integral identities, which are expressed in terms of the Györfi-Vajda divergence in (17): with Moreover, the contraction coefficients for these f -divergences are related as follows: where µ χ 2 (Q X , W Y|X ) denotes the contraction coefficient for the chi-squared divergence.
Proof. See Section 5.8. (89) and (90) generalize the upper bound on the contraction coefficient for the relative entropy in [51] [Theorem III.6] (recall that K 1 (P Q) = D(P Q) = S 1 (P Q)), so the upper bounds in Theorem 6 are specialized to the latter bound at α = 1.

Corollary 7.
Let where the supremum on the right side is over all probability measures Q X defined on X . Then, Proof. See Section 5.9.
The upper bound on the contraction coefficients in Corollary 7 is given by µ χ 2 (W Y|X ), whereas the lower bound is given by µ χ 2 (Q X , W Y|X ), which depends on the input distribution Q X . We next provide alternative upper bounds on the contraction coefficients for the considered (parametric) f -divergences, which, similarly to the lower bound, scale like µ χ 2 (Q X , W Y|X ). Although the upper bound in Corollary 7 may be tighter in some cases than the alternative upper bounds which are next presented in Proposition 3 (and in fact, the former upper bound may be even achieved with equality as in Example 4), the bounds in Proposition 3 are used shortly to determine the exponential rate of the convergence to stationarity of a type of Markov chains.

Proposition 3.
For all α ∈ (0, 1], where Q min denotes the minimal positive mass of the input distribution Q X .

Remark 11.
In view of (92), at α = 1, (108) and (109) specialize to an upper bound on the contraction coefficient of the relative entropy (KL divergence) as a function of the contraction coefficient of the chi-squared divergence. In this special case, both (108) and (109) give which then coincides with [48] [Theorem 10].
We next apply Proposition 3 to consider the convergence rate to stationarity of Markov chains by the introduced f -divergences in Definition 8. The next result follows [49] [Section 2.4.3], and it provides a generalization of the result there.

Theorem 7.
Consider a time-homogeneous, irreducible, and reversible discrete-time Markov chain with a finite state space X , let W be its probability transition matrix, and Q X be its unique stationary distribution (reversibility means that Q X (x)[W] x,y = Q X (y)[W] y,x for all x, y ∈ X ). Let P X be an initial probability distribution over X . Then, for all α ∈ (0, 1] and n ∈ N, and the contraction coefficients on the right sides of (111) and (112) scale like the n-th power of the contraction coefficient for the chi-squared divergence as follows: Proof. Inequalities (111) and (112) hold since Q X W n = Q X , for all n ∈ N, and due to Definition 7 and (95) and (96). Inequalities (113) and (114) hold by Proposition 3, and due to the reversibility of the Markov chain which implies that (see [49] [Equation (2.92)]) In view of (113) and (114), Theorem 7 readily gives the following result on the exponential decay rate of the upper bounds on the divergences on the left sides of (111) and (112).
We end this subsection by considering maximal correlations, which are closely related to the contraction coefficient for the chi-squared divergence. Definition 9. The maximal correlation between two random variables X and Y is defined as where the supremum is taken over all real-valued functions f and g such that It is well-known [60] that, if X ∼ Q X and Y ∼ Q Y = Q X W Y|X , then the contraction coefficient for the chi-squared divergence µ χ 2 (Q X , W Y|X ) is equal to the square of the maximal correlation between the random variables X and Y, i.e., A simple application of Corollary 1 and (119) gives the following result.

Proposition 4.
In the setting of Definition 7, for s ∈ [0, 1], let X s ∼ (1 − s)P X + sQ X and Y s ∼ (1 − s)P Y + sQ Y with P X = Q X and P X Q X . Then, the following inequality holds: Proof. See Section 5.11.

Proofs
This section provides proofs of the results in Sections 3 and 4.
We next show the local tightness of inequality (30) by proving that (31) yields (32). Let {P n } be a sequence of probability measures, defined on a measurable space (X , F ), and assume that {P n } converges to a probability measure P in the sense that (31) holds. In view of [16] [Theorem 7] (see also [15] [Section 4.F] and [63]), it follows that lim n→∞ D(P n P) = lim n→∞ χ 2 (P n P) = 0, and lim n→∞ D(P n P) lim n→∞ χ 2 (P n P) which therefore yields (32). (c) Proof of (33) and (34): The proof of (33) relies on (28) and the following lemma.

Proof of Theorem 2
We first prove Item (a) in Theorem 2. In view of the Hammersley-Chapman-Robbins lower bound on the χ 2 divergence, for all λ ∈ [0, 1] where X ∼ P, Y ∼ Q and Z λ ∼ R λ := (1 − λ)P + λQ is defined by For λ ∈ [0, 1], and it can be verified that We now rely on (24) 1 log e D(P Q) = From (43) and (44), we get where By using the partial fraction decomposition of the integrand on the right side of (186), we get (after multiplying both sides of (185) by log e) where (189) holds by integration since α − aλ and β + aλ are both non-negative for all λ ∈ [0, 1].

Lemma 2.
Let r, s be given in (41)- (45), and let u 1,2 be given in (47). Then, Let X ∼ P and Y ∼ Q be defined on a set U = {u 1 , u 2 } (for the moment, the values of u 1 and u 2 are not yet specified) with P[ We now calculate u 1 and u 2 such that E[X] = m P and Var(X) = σ 2 P . This is equivalent to Substituting (196) into the right side of (197) gives which, by rearranging terms, also gives Solving simultaneously (196) and (199) gives We next verify that, by setting u 1,2 as in (47), one also gets (as desired) that E[Y] = m Q and Var(Y) = σ 2 Q . From Lemma 2,and,from (196) and (197), we have By combining (204) and (209), we obtain Var(Y) = σ 2 Q . Hence, the probability mass functions P and Q defined on U = {u 1 , u 2 } (with u 1 and u 2 in (47)) such that satisfy the equality constraints in (39), while also achieving the lower bound on D(P Q) that is equal to d(r s). It can be also verified that the second option where does not yield the satisfiability of the conditions E[Y] = m Q and Var(Y) = σ 2 Q , so there is only a unique pair of probability measures P and Q, defined on a two-element set that achieves the lower bound in (40) under the equality constraints in (39).
We finally prove Item (c) in Theorem 2. Let m ∈ R, σ 2 P , and σ 2 Q be selected arbitrarily such that σ 2 Q ≥ σ 2 P . We construct probability measures P ε and Q ε , depending on a free parameter ε, with means m P = m Q := m and variances σ 2 P and σ 2 Q , respectively (means and variances are independent of ε), and which are defined on a three-element set U := {u 1 , u 2 , u 3 } as follows: with ε > 0. We aim to set the parameters r, s, u 1 , u 2 and u 3 (as a function of m, σ P , σ Q and ε) such that Proving (214) yields (48), while it also follows that the infimum on the left side of (48) can be restricted to probability measures which are defined on a three-element set.
In view of the constraints on the means and variances in (39), with equal means m, we get the following set of equations from (212) and (213): The first and second equations in (215) refer to the equal means under P and Q, and the third and fourth equations in (215) refer to the second moments in (39). Furthermore, in view of (212) and (213), the relative entropy is given by Subtracting the square of the first equation in (215) from its third equation gives the equivalent set of equations We next select u 1 and u 2 such that u 1 − u 2 := 2σ P . Then, the third equation in (217) gives r(1 − r) = 1 4 , so r = 1 2 . Furthermore, the first equation in (217) gives Since r, u 1 , and u 2 are independent of ε, so is the probability measure P ε := P. Combining the second equation in (217) with (218) and (219) gives Substituting (218)-(220) into the fourth equation of (217) gives a quadratic equation for s, whose selected solution (such that s and r = 1 2 be close for small > 0) is equal to Hence, s = 1 2 + O( √ ε), which implies that s ∈ (0, 1 − ε) for sufficiently small ε > 0 (as it is required in (213)). In view of (216), it also follows that D(P Q ε ) vanishes as we let ε tend to zero.
We finally outline an alternative proof, which refers to the case of equal means with arbitrarily selected σ 2 P and σ 2 Q . Let (σ 2 P , σ 2 Q ) ∈ (0, ∞) 2 . We next construct a sequence of pairs of probability measures {(P n , Q n )} with zero mean and respective variances (σ 2 P , σ 2 Q ) for which D(P n Q n ) → 0 as n → ∞ (without any loss of generality, one can assume that the equal means are equal to zero). We start by assuming (σ 2 P , σ 2 Q ) ∈ (1, ∞) 2 . Let and define a sequence of quaternary real-valued random variables with probability mass functions It can be verified that, for all n ∈ N, Q n has zero mean and variance σ 2 Q . Furthermore, let If ξ > 1, for n = 1, . . . , ξ , we choose P n arbitrarily with mean 0 and variance σ 2 P . Then, Next, suppose min{σ 2 P , σ 2 Q } := σ 2 < 1, then construct P n and Q n as before with variances 2σ 2 P σ 2 > 1 and 2σ 2 Q σ 2 > 1, respectively. If P n and Q n denote the random variables P n and Q n scaled by a factor of σ √ 2 , then their variances are σ 2 P , σ 2 Q , respectively, and D(P n Q n ) = D(P n Q n ) → 0 as we let n → ∞. To conclude, it should be noted that the sequences of probability measures in the latter proof are defined on a four-element set. Recall that, in the earlier proof, specialized to the case of (equal means with) σ 2 P ≤ σ 2 Q , the introduced probability measures are defined on a three-element set, and the reference probability measure P is fixed while referring to an equiprobable binary random variable.
We next prove (53), and the conclusion which appears after it. In view of [16] [Theorem 8], applied to f (t) := − log t for all t > 0, we get (it should be noted that, by the definition of F in (50), the result in [16] [(195)-(196)] is used here by swapping P and Q) Since lim λ→0 + F(λ) = 0, it follows by L'Hôpital's rule that which gives (53). A comparison of the limit in (53) with a lower bound which follows from (52) gives where (236) relies on (231). Hence, the limit in (53) is twice as large as its lower bound on the right side of (236). This proves the conclusion which comes right after (53). We finally prove the known result in (51), by showing an alternative proof which is based on (52). The function F is non-negative on [0, 1], and it is strictly positive on (0, 1] if P = Q. Let P = Q (otherwise, (51) is trivial). Rearranging terms in (52) and integrating both sides over the interval [λ, 1], for λ ∈ (0, 1], gives that The left side of (237) satisfies where (241) holds since F(1) = D(P Q) (see (50)). Combining (237)-(241) gives which, due to the non-negativity of F, gives the right side inequality in (51) (21)) and by interchanging the order of the integrations.
In view of (119), the distributions of X s and Y s , and since (1 − s)P X + sQ X W Y|X = (1 − s)P Y + sQ Y holds for all s ∈ [0, 1], it follows that ρ m (X s ; Y s ) = µ χ 2 ((1 − s)P X + sQ X , W Y|X ), s ∈ [0, 1], ρ m (X s ; Y s ) ≥ D(P Y Q Y ) D(P X Q X ) .
Switching P X and Q X in (292)-(294) and using the mapping s → 1 − s in (294) gives (due to the symmetry of the maximal correlation) and, finally, taking the maximal lower bound among those in (296) and (297) gives (120).
Author Contributions: Both coauthors contributed to this research work, and to the writing and proofreading of this article. The starting point of this work was in independent derivations of preliminary versions of Theorems 1 and 2 in two separate un-published works [24,25]. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.