On f-Divergences: Integral Representations, Local Behavior, and Inequalities

This paper is focused on f-divergences, consisting of three main contributions. The first one introduces integral representations of a general f-divergence by means of the relative information spectrum. The second part provides a new approach for the derivation of f-divergence inequalities, and it exemplifies their utility in the setup of Bayesian binary hypothesis testing. The last part of this paper further studies the local behavior of f-divergences.

Integral representations of f -divergences serve to study properties of these information measures, and they are also used to establish relations among these divergences. An integral representation of f -divergences, expressed by means of the DeGroot statistical information, was provided in [3] with a simplified proof in [15]. The importance of this integral representation stems from the operational meaning of the DeGroot statistical information [16], which is strongly linked to Bayesian binary hypothesis testing. Some earlier specialized versions of this integral representation were introduced in [17][18][19][20][21], and a variation of it also appears in [22] Section 5.B. Implications of the integral representation of f -divergences, by means of the DeGroot statistical information, include an alternative proof of the data processing inequality, and a study of conditions for the sufficiency or ε-deficiency of observation channels [3,15].
Following earlier studies of the local behavior of f -divergences and their asymptotic properties (see related results by Csiszár and Shields [55] Theorem 4.1, Pardo and Vajda [56] Section 3, and Sason and Vérdu [22] Section 3.F), it is known that the local behavior of f -divergences scales, such as the chi-square divergence (up to a scaling factor which depends on f ) provided that the first distribution approaches the reference measure in a certain strong sense. The study of the local behavior of f -divergences is an important aspect of their properties, and we further study it in this work.
This paper considers properties of f -divergences, while first introducing in Section 2 the basic definitions and notation needed, and in particular the various measures of dissimilarity between probability measures used throughout this paper. The presentation of our new results is then structured as follows: Section 3 is focused on the derivation of new integral representations of f -divergences, expressed as a function of the relative information spectrum of the pair of probability measures, and the convex function f . The novelty of Section 3 is in the unified approach which leads to integral representations of f -divergences by means of the relative information spectrum, where the latter cumulative distribution function plays an important role in information theory and statistical decision theory (see, e.g., [7,54]). Particular integral representations of the type of results introduced in Section 3 have been recently derived by Sason and Verdú in a case-by-case basis for some f -divergences (see [22] Theorems 13 and 32), while lacking the approach which is developed in Section 3 for general f -divergences. In essence, an f -divergence D f (P Q) is expressed in Section 3 as an inner product of a simple function of the relative information spectrum (depending only on the probability measures P and Q), and a non-negative weight function ω f : (0, ∞) → [0, ∞) which only depends on f . This kind of representation, followed by a generalized result, serves to provide new integral representations of various useful f -divergences. This also enables in Section 3 to characterize the interplay between the DeGroot statistical information (or between another useful family of f -divergence, named the E γ divergence with γ ≥ 1) and the relative information spectrum.
Section 4 provides a new approach for the derivation of f -divergence inequalities, where an arbitrary f -divergence is lower bounded by means of the E γ divergence [57] or the DeGroot statistical information [16]. The approach used in Section 4 yields several generalizations of the Bretagnole-Huber inequality [58], which provides a closed-form and simple upper bound on the total variation distance as a function of the relative entropy; the Bretagnole-Huber inequality has been proved to be useful, e.g., in the context of lower bounding the minimax risk in non-parametric estimation (see, e.g., [5] pp. 89-90, 94), and in the problem of density estimation (see, e.g., [6] Section 1.6). Although Vajda's tight lower bound in [59] is slightly tighter everywhere than the Bretagnole-Huber inequality, our motivation for the generalization of the latter bound is justified later in this paper. The utility of the new inequalities is exemplified in the setup of Bayesian binary hypothesis testing.
Section 5 finally derives new results on the local behavior of f -divergences, i.e., the characterization of their scaling when the pair of probability measures are sufficiently close to each other. The starting point of our analysis in Section 5 relies on the analysis in [56] Section 3, regarding the asymptotic properties of f -divergences.
The reading of Sections 3-5 can be done in any order since the analysis in these sections is independent.

Preliminaries and Notation
We assume throughout that the probability measures P and Q are defined on a common measurable space (A, F ), and P Q denotes that P is absolutely continuous with respect to Q, namely there is no event F ∈ F such that P(F ) > 0 = Q(F ). Definition 1. The relative information provided by a ∈ A according to (P, Q), where P Q, is given by More generally, even if P Q, let R be an arbitrary dominating probability measure such that P, Q R (e.g., R = 1 2 (P + Q)); irrespectively of the choice of R, the relative information is defined to be The following asymmetry property follows from (2): Definition 2. The relative information spectrum is the cumulative distribution function The relative entropy is the expected valued of the relative information when it is distributed according to P: Throughout this paper, C denotes the set of convex functions f : (0, ∞) → R with f (1) = 0. Hence, the function f ≡ 0 is in C; if f ∈ C, then a f ∈ C for all a > 0; and if f , g ∈ C, then f + g ∈ C. We next provide a general definition for the family of f -divergences (see [3] p. 4398).
Definition 3 ( f -divergence [9,10,12]). Let P and Q be probability measures, let µ be a dominating measure of P and Q (i.e., P, Q µ; e.g., µ = P + Q), and let p := dP dµ and q := dQ dµ . The f -divergence from P to Q is given, independently of µ, by where We rely in this paper on the following properties of f -divergences: The following conditions are equivalent: (1) (2) there exists a constant c ∈ R such that Proposition 2. Let f ∈ C, and let f * : (0, ∞) → R be the conjugate function, given by for t > 0. Then, f * ∈ C; f * * = f , and for every pair of probability measures (P, Q), By an analytic extension of f * in (12) at t = 0, let Note that the convexity of f * implies that f * (0) ∈ (−∞, ∞]. In continuation to Definition 3, we get with the convention in (16) that 0 · ∞ = 0, We refer in this paper to the following f -divergences: (1) Relative entropy: with (2) Jeffrey's divergence [60]: with (3) Hellinger divergence of order α ∈ (0, 1) ∪ (1, ∞) [2] Definition 2.10: with Some of the significance of the Hellinger divergence stems from the following facts: -The analytic extension of H α (P Q) at α = 1 yields -The chi-squared divergence [61] is the second order Hellinger divergence (see, e.g., [62] p. 48), i.e., Note that, due to Proposition 1, where f : (0, ∞) → R can be defined as -The squared Hellinger distance (see, e.g., [62] p. 47), denoted by H 2 (P Q), satisfies the identity -The Bhattacharyya distance [63], denoted by B(P Q), satisfies -The Rényi divergence of order α ∈ (0, 1) ∪ (1, ∞) is a one-to-one transformation of the Hellinger divergence of the same order [11] (14): -The Alpha-divergence of order α, as it is defined in [64] and ( [65] (4)), is a generalized relative entropy which (up to a scaling factor) is equal to the Hellinger divergence of the same order α. More explicitly, where D

(α)
A (· ·) denotes the Alpha-divergence of order α. Note, however, that the Beta and Gamma-divergences in [65], as well as the generalized divergences in [66,67], are not f -divergences in general.
(4) χ s divergence for s ≥ 1 [2] (2.31), and the total variation distance: The function results in Specifically, for s = 1, let and the total variation distance is expressed as an f -divergence: (5) Triangular Discrimination [39] (a.k.a. Vincze-Le Cam distance): with Note that (6) Lin's measure [68] (4.1): for θ ∈ [0, 1]. This measure can be expressed by the following f -divergence: with The special case of (41) with θ = 1 2 gives the Jensen-Shannon divergence (a.k.a. capacitory discrimination): with X ∼ P and Y ∼ Q, and where (46) follows from the Neyman-Pearson lemma. The E γ divergence can be identified as an f -divergence: with where (x) + := max{x, 0}. The following relation to the total variation distance holds: (8) DeGroot statistical information [3,16]: For ω ∈ (0, 1), with The following relation to the total variation distance holds: and the DeGroot statistical information and the E γ divergence are related as follows [22] (384):

New Integral Representations of f -Divergences
The main result in this section provides new integral representations of f -divergences as a function of the relative information spectrum (see Definition 2). The reader is referred to other integral representations (see [15] Section 2, [4] Section 5, [22] Section 5.B, and references therein), expressing a general f -divergence by means of the DeGroot statistical information or the E γ divergence.
Lemma 1. Let f ∈ C be a strictly convex function at 1. Let g : R → R be defined as where f + (1) denotes the right-hand derivative of f at 1 (due to the convexity of f on (0, ∞), it exists and it is finite). Then, the function g is non-negative, it is strictly monotonically decreasing on (−∞, 0], and it is strictly monotonically increasing on [0, ∞) with g(0) = 0.
Proof. For any function u ∈ C, let u ∈ C be given by and let u * ∈ C be the conjugate function, as given in (12). The function g in (54) can be expressed in the form as it is next verified. For t > 0, we get from (12) and (55), and the substitution t := exp(−x) for x ∈ R yields (56) in view of (54).
By assumption, f ∈ C is strictly convex at 1, and therefore these properties are inherited to f . Since also f (1) = f (1) = 0, it follows from [3] Theorem 3 that both f and f * are non-negative on (0, ∞), and they are also strictly monotonically decreasing on (0, 1]. Hence, from (12), it follows that the function ( f ) * is strictly monotonically increasing on [1, ∞). Finally, the claimed properties of the function g follow from (56), and in view of the fact that the function ( f ) * is non-negative with ( f ) * (1) = 0, strictly monotonically decreasing on (0, 1] and strictly monotonically increasing on [1, ∞).

Lemma 2.
Let f ∈ C be a strictly convex function at 1, and let g : R → R be as in (54). Let and be the two inverse functions of g. Then, Proof. In view of Lemma 1, it follows that 1 : [0, a) → [0, ∞) is strictly monotonically increasing and Let X ∼ P, and let V := exp ı P Q (X) . Then, we have where (61) (58) and (59), and by expressing the event {g(V) > t} as a union of two disjoint events; (69) holds again by the monotonicity properties of g in Lemma 1, and by the definition of its two inverse functions 1 and 2 as above; in (67)- (69) we are free to substitute > by ≥, and < by ≤; finally, (70) holds by the definition of the relative information spectrum in (4).
with an arbitrary c ∈ R. This invariance of g (and, hence, also the invariance of its inverse functions 1 and 2 ) is well expected in view of Proposition 1 and Lemma 2. (26), letting f be as in (27), it follows from (54) that g(x) = 4 sinh 2 1 2 log e x , x ∈ R, (71) which yields, from (58) and (59), a = b = ∞. Calculation of the two inverse functions of g, as defined in Lemma 2, yields the following closed-form expression:

Remark 2.
Unlike Example 1, in general, the inverse functions 1 and 2 in Lemma 2 are not expressible in closed form, motivating our next integral representation in Theorem 1.
The following theorem provides our main result in this section. Theorem 1. The following integral representations of an f -divergence, by means of the relative information spectrum, hold: be the non-negative weight function given, for β > 0, by the function G P Q : (0, ∞) → [0, 1] be given by Then, (2) More generally, for an arbitrary c ∈ R, let w f ,c : (0, ∞) → R be a modified real-valued function defined as Then, Proof. We start by proving the special integral representation in (81), and then extend our proof to the general representation in (83).
We now extend the result in (81) when f ∈ C is differentiable on (0, ∞), but not necessarily strictly convex at 1. To that end, let s : (0, ∞) → R be defined as This implies that s ∈ C is differentiable on (0, ∞), and it is also strictly convex at 1. In view of the proof of (81) when f is strict convexity of f at 1, the application of this result to the function s in (90) yields In view of (6), (22), (23), (25) and (90), from (79), (89), (90) and the convexity and differentiability of f ∈ C, it follows that the weight for β > 0. Furthermore, by applying the result in (81) to the chi-squared divergence χ 2 (P Q) in (25) whose corresponding function f 2 (t) := t 2 − 1 for t > 0 is strictly convex at 1, we obtain Finally, the combination of (91)- (94), yields D f (P Q) = w f , G P Q ; this asserts that (81) also holds by relaxing the condition that f is strictly convex at 1. (2) In view of (80)-(82), in order to prove (83) for an arbitrary c ∈ R, it is required to prove the identity Equality (95) can be verified by Lemma 3: by rearranging terms in (95), we get the identity in (73)

Remark 4.
The weight function w f only depends on f , and the function G P Q only depends on the pair of probability measures P and Q. In view of Proposition 1, it follows that, for f , g ∈ C, the equality w f = w g holds on (0, ∞) if and only if (11) is satisfied with an arbitrary constant c ∈ R. It is indeed easy to verify that (11) yields w f = w g on (0, ∞).

Remark 5.
An equivalent way to write G P Q in (80) is where X ∼ P. Hence, the function G P Q : (0, ∞) → [0, 1] is monotonically increasing in (0, 1), and it is monotonically decreasing in [1, ∞); note that this function is in general discontinuous at 1 unless Note that if P = Q, then G P Q is zero everywhere, which is consistent with the fact that D f (P Q) = 0.

Remark 6.
In the proof of Theorem 1-(1), the relaxation of the condition of strict convexity at 1 for a differentiable function f ∈ C is crucial, e.g., for the χ s divergence with s > 2. To clarify this claim, note that in view of (32), Remark 7. Theorem 1-(1) with c = 0 enables, in some cases, to simplify integral representations of f -divergences. This is next exemplified in the proof of Theorem 2.
Theorem 1 yields integral representations for various f -divergences and related measures; some of these representations were previously derived by Sason and Verdú in [22] in a case by case basis, without the unified approach of Theorem 1. We next provide such integral representations. Note that, for some f -divergences, the function f ∈ C is not differentiable on (0, ∞); hence, Theorem 1 is not necessarily directly applicable.
An application of (112) yields the following interplay between the E γ divergence and the relative information spectrum.
Theorem 3. Let X ∼ P, and let the random variable ı P Q (X) have no probability masses. Denote Then, • E γ (P Q) is a continuously differentiable function of γ on (1, ∞), and E γ (P Q) ≤ 0; • the sets A 1 and A 2 determine, respectively, the relative information spectrum F P Q (·) on [0, ∞) and (−∞, 0); • for γ > 1, Proof. We start by proving the first item. By our assumption, F P Q (·) is continuous on R. Hence, it follows from (112) that E γ (P Q) is continuously differentiable in γ ∈ (1, ∞); furthermore, (45) implies that E γ (P Q) is monotonically decreasing in γ, which yields E γ (P Q) ≤ 0. We next prove the second and third items together. Let X ∼ P and Y ∼ Q. From (112), for γ > 1, d dγ which yields (115). Due to the continuity of F P Q (·), it follows that the set A 1 determines the relative information spectrum on [0, ∞).
To prove (116), we have where (120) holds by switching P and Q in (46); (121) holds since Y ∼ Q; (122) holds by switching P and Q in (115) (correspondingly, also X ∼ P and Y ∼ Q are switched); (123) holds since ı Q P = −ı P Q ; (124) holds by the assumption that dP dQ (X) has no probability masses, which implies that the sign < can be replaced with ≤ at the term P[ı P Q (X) < − log γ] in the right side of (123). Finally, (116) readily follows from (120)-(124), which implies that the set A 2 determines F P Q (·) on (−∞, 0).
A similar application of (107) yields an interplay between DeGroot statistical information and the relative information spectrum. Theorem 4. Let X ∼ P, and let the random variable ı P Q (X) have no probability masses. Denote Then, • I ω (P Q) is a continuously differentiable function of ω on (0, 1 2 ) ∪ ( 1 2 , 1), and I ω (P Q) is, respectively, non-negative or non-positive on 0, 1 2 and 1 2 , 1 ; • the sets B 1 and B 2 determine, respectively, the relative information spectrum F P Q (·) on [0, ∞) and (−∞, 0); for ω ∈ 1 2 , 1 and Remark 8. By relaxing the condition in Theorems 3 and 4 where dP dQ (X) has no probability masses with X ∼ P, it follows from the proof of Theorem 3 that each one of the sets determines F P Q (·) at every point on R where this relative information spectrum is continuous. Note that, as a cumulative distribution function, F P Q (·) is discontinuous at a countable number of points.
Consequently, under the condition that f ∈ C is differentiable on (0, ∞), the integral representations of D f (P Q) in Theorem 1 are not affected by the countable number of discontinuities for F P Q (·).
In view of Theorems 1, 3 and 4 and Remark 8, we get the following result.
Corollary 1. Let f ∈ C be a differentiable function on (0, ∞), and let P Q be probability measures. Then, each one of the sets A and B in (131) and (132), respectively, determines D f (P Q).
where Γ f is a certain σ-finite measure defined on the Borel subsets of (0, 1); it is also shown in [3] (80) that if f ∈ C is twice differentiable on (0, ∞), then

New f -Divergence Inequalities
Various approaches for the derivation of f -divergence inequalities were studied in the literature (see Section 1 for references). This section suggests a new approach, leading to a lower bound on an arbitrary f -divergence by means of the E γ divergence of an arbitrary order γ ≥ 1 (see (45)) or the DeGroot statistical information (see (50)). This approach leads to generalizations of the Bretagnole-Huber inequality [58], whose generalizations are later motivated in this section. The utility of the f -divergence inequalities in this section is exemplified in the setup of Bayesian binary hypothesis testing.
In the following, we provide the first main result in this section for the derivation of new f -divergence inequalities by means of the E γ divergence. Generalizing the total variation distance, the E γ divergence in (45)-(47) is an f -divergence whose utility in information theory has been exemplified in [17] Chapter 3, [54], [57] p. 2314 and [69]; the properties of this measure were studied in [22] Section 7 and [54] Section 2.B.
Theorem 5. Let f ∈ C, and let f * ∈ C be the conjugate convex function as defined in (12). Let P and Q be probability measures. Then, for all γ ∈ [1, ∞), Proof. Let p = dP dµ and q = dQ dµ be the densities of P and Q with respect to a dominating measure µ (P, Q µ). Then, for an arbitrary a ∈ R, where (139) follows from the convexity of f * and by invoking Jensen's inequality.
An application of Theorem 5 gives the following lower bounds on the Hellinger and Rényi divergences with arbitrary positive orders, expressed as a function of the E γ divergence with an arbitrary order γ ≥ 1.

Corollary 3.
For γ ∈ [1, ∞), the following upper bounds on E γ divergence hold as a function of the relative entropy and χ 2 divergence: Remark 10. From [4] (58), is a tight lower bound on the chi-squared divergence as a function of the total variation distance. In view of (49), we compare (151) with the specialized version of (149) when γ = 1. The latter bound is expected to be looser than the tight bound in (151), as a result of the use of Jensen's inequality in the proof of Theorem 5; however, it is interesting to examine how much we loose in the tightness of this specialized bound with γ = 1. From (49), the substitution of γ = 1 in (149) gives and, it can be easily verified that 1), then the lower bound in the right side of (152) is at most twice smaller than the tight lower bound in the right side of (151); • if |P − Q| ∈ [1, 2), then the lower bound in the right side of (152) is at most 3 2 times smaller than the tight lower bound in the right side of (151).

Remark 12.
In [59] (8), Vajda introduced a lower bound on the relative entropy as a function of the total variation distance: The lower bound in the right side of (155) is asymptotically tight in the sense that it tends to ∞ if |P − Q| ↑ 2, and the difference between D(P Q) and this lower bound is everywhere upper bounded by [59] (9)). The Bretagnole-Huber inequality in (153), on the other hand, is equivalent to Although it can be verified numerically that the lower bound on the relative entropy in (155) is everywhere slightly tighter than the lower bound in (156) (for |P − Q| ∈ [0, 2)), both lower bounds on D(P Q) are of the same asymptotic tightness in a sense that they both tend to ∞ as |P − Q| ↑ 2 and their ratio tends to 1. Apart of their asymptotic tightness, the Bretagnole-Huber inequality in (156) is appealing since it provides a closed-form simple upper bound on |P − Q| as a function of D(P Q) (see (153)), whereas such a closed-form simple upper bound cannot be obtained from (155). In fact, by the substitution v := − 2−|P−Q| 2+|P−Q| and the exponentiation of both sides of (155), we get the inequality ve v ≥ − 1 e exp −D(P Q) whose solution is expressed by the Lambert W function [72]; it can be verified that (155) is equivalent to the following upper bound on the total variation distance as a function of the relative entropy: where W in the right side of (157) denotes the principal real branch of the Lambert W function. The difference between the upper bounds in (153) and (157) can be verified to be marginal if D(P Q) is large (e.g., if D(P Q) = 4 nats, then the upper bounds on |P − Q| are respectively equal to 1.982 and 1.973), though the former upper bound in (153) is clearly more simple and amenable to analysis. The Bretagnole-Huber inequality in (153) is proved to be useful in the context of lower bounding the minimax risk (see, e.g., [5] pp. 89-90, 94), and the problem of density estimation (see, e.g., [6] Section 1.6). The utility of this inequality motivates its generalization in this section (see Corollaries 2 and 3, and also see later Theorem 7 followed by Example 2).
In [22] Section 7.C, Sason and Verdú generalized Pinsker's inequality by providing an upper bound on the E γ divergence, for γ > 1, as a function of the relative entropy. In view of (49) and the optimality of the constant in Pinsker's inequality (154), it follows that the minimum achievable D(P Q) is quadratic in E 1 (P Q) for small values of E 1 (P Q). It has been proved in [22] Section 7.C that this situation ceases to be the case for γ > 1, in which case it is possible to upper bound E γ (P Q) as a constant times D(P Q) where this constant tends to infinity as we let γ ↓ 1. We next cite the result in [22] Theorem 30, extending (154) by means of the E γ divergence for γ > 1, and compare it numerically to the bound in (150).

Theorem 6. ([22] Theorem 30) For every
where the supremum is over P Q, P = Q, and c γ is a universal function (independent of (P, Q)), given by where W −1 in (161) denotes the secondary real branch of the Lambert W function [72].
As an immediate consequence of (159), it follows that which forms a straight-line bound on the E γ divergence as a function of the relative entropy for γ > 1.
Similarly to the comparison of the Bretagnole-Huber inequality (153) and Pinsker's inequality (154), we exemplify numerically that the extension of Pinsker's inequality to the E γ divergence in (162) forms a counterpart to the generalized version of the Bretagnole-Huber inequality in (150).  Figure 1 plots an upper bound on the E γ divergence, for γ ∈ {1.1, 2.0, 3.0, 4.0}, as a function of the relative entropy (or, alternatively, a lower bound on the relative entropy as a function of the E γ divergence). The upper bound on E γ (P Q) for γ > 1, as a function of D(P Q), is composed of the following two components: • the straight-line bound, which refers to the right side of (162), is tighter than the bound in the right side of (150) if the relative entropy is below a certain value that is denoted by d(γ) in nats (it depends on γ); • the curvy line, which refers to the bound in the right side of (150), is tighter than the straight-line bound in the right side of (162) for larger values of the relative entropy.
It is supported by Figure 1  (see Figure 1).

Bayesian Binary Hypothesis Testing
The DeGroot statistical information [16] has the following meaning: consider two hypotheses H 0 and H 1 , and let P[H 0 ] = ω and P[H 1 ] = 1 − ω with ω ∈ (0, 1). Let P and Q be probability measures, and consider an observation Y where Y|H 0 ∼ P, and Y|H 1 ∼ Q. Suppose that one wishes to decide which hypothesis is more likely given the observation Y. The operational meaning of the DeGroot statistical information, denoted by I ω (P Q), is that this measure is equal to the minimal difference between the a-priori error probability (without side information) and a posteriori error probability (given the observation Y). This measure was later identified as an f -divergence by Liese and Vajda [3] (see (50) here). Theorem 7. The DeGroot statistical information satisfies the following upper bound as a function of the chi-squared divergence: , ω ∈ 0, 1 2 , , and the following bounds as a function of the relative entropy: (1) where c γ for γ > 1 is introduced in (160); (2) , ω ∈ 1 2 , 1 . (165) Proof. Inequalities (167) and (168) follow by combining (135) and (53).
We end this section by exemplifying the utility of the bounds in Theorem 7.
Example 2. Let P[H 0 ] = ω and P[H 1 ] = 1 − ω with ω ∈ (0, 1), and assume that the observation Y given that the hypothesis is H 0 or H 1 is Poisson distributed with the positive parameter µ or λ, respectively: Without any loss of generality, let ω ∈ 0, 1 2 . The bounds on the DeGroot statistical information I ω (P µ P λ ) in Theorem 7 can be expressed in a closed form by relying on the following identities: In this example, we compare the simple closed-form bounds on I ω (P µ P λ ) in (163)-(165) with its exact value To simplify the right side of (174), let µ > λ, and define where for x ∈ R, x denotes the largest integer that is smaller than or equal to x. It can be verified that Hence, from (174)-(176), To exemplify the utility of the bounds in Theorem 7, suppose that µ and λ are close, and we wish to obtain a guarantee on how small I ω (P µ P λ ) is. For example, let λ = 99, µ = 101, and ω = 1 10 . The upper bounds on I ω (P µ P λ ) in (163)-(165) are, respectively, equal to 4.6 · 10 −4 , 5.8 · 10 −4 and 2.2 · 10 −3 ; we therefore get an informative guarantee by easily calculable bounds. The exact value of I ω (P µ P λ ) is, on the other hand, hard to compute since k 0 = 209 (see (175)), and the calculation of the right side of (178) appears to be sensitive to the selected parameters in this setting.

Local Behavior of f -Divergences
This section studies the local behavior of f -divergences; the starting point relies on [56] Section 3 which studies the asymptotic properties of f -divergences. The reader is also referred to a related study in [22] Section 4.F.

Lemma 4. Let
• {P n } be a sequence of probability measures on a measurable space (A, F ); • the sequence {P n } converge to a probability measure Q in the sense that lim n→∞ ess sup dP n dQ where P n Q for all sufficiently large n; • f , g ∈ C have continuous second derivatives at 1 and g (1) > 0. Then Proof. The result in (180) follows from [56] Theorem 3, even without the additional restriction in [56] Section 3 which would require that the second derivatives of f and g are locally Lipschitz at a neighborhood of 1. More explicitly, in view of the analysis in [56] p. 1863, we get by relaxing the latter restriction that (cf. [56] (31)) where ε n ↓ 0 as we let n → ∞, and also lim n→∞ χ 2 (P n Q) = 0.
By our assumption, due to the continuity of f and g at 1, it follows from (181) and (182) that lim n→∞ D f (P n Q) lim n→∞ D g (P n Q) which yields (180) (recall that, by assumption, g (1) > 0).

Remark 17.
Since f and g in Lemma 4 are assumed to have continuous second derivatives at 1, the left and right derivatives of the weight function w f in (79) at 1 satisfy, in view of Remark 3, Hence, the limit in the right side of (180) is equal to Lemma 5.
Proof. Let p = dP dµ and q = dQ dµ be the densities of P and Q with respect to an arbitrary probability measure µ such that P, Q µ. Then, Remark 18. The result in Lemma 5, for the chi-squared divergence, is generalized to the identity for all s ≥ 1 (see (33)). The special case of s = 2 is required in the continuation of this section.

Remark 19.
The result in Lemma 5 can be generalized as follows: let P, Q, R be probability measures, and λ ∈ [0, 1]. Let P, Q, R µ for an arbitrary probability measure µ, and p := dP dµ , q := dQ dµ , and r := dR dµ be the corresponding densities with respect to µ. Calculation shows that If Q = R, then c = 0 in (192), and (191) is specialized to (186). However, if Q = R, then c may be non-zero. This shows that, for small λ ∈ [0, 1], the left side of (191) scales linearly in λ if c = 0, and it has a quadratic scaling in λ if c = 0 and χ 2 (P R) = χ 2 (Q R) (e.g., if Q = R, as in Lemma 5). The identity in (191) yields d dλ We next state the main result in this section.

Theorem 8. Let
• P and Q be probability measures defined on a measurable space (A, F ), Y ∼ Q, and suppose that ess sup dP dQ (Y) < ∞; (194) • f ∈ C, and f be continuous at 1. Then, Proof. Let {λ n } n∈N be a sequence in [0, 1], which tends to zero. Define the sequence of probability measures Note that P Q implies that R n Q for all n ∈ N. Since and, by combining (186) and (201), we get We next prove the result for the limit in the right side of (195). Let f * : (0, ∞) → R be the conjugate function of f , which is given in (12). By the assumption that f has a second continuous derivative, so is f * and it is easy to verify that the second derivatives of f and f * coincide at 1. Hence, from (13) and (202), Remark 20. Although an f -divergence is in general not symmetric, in the sense that the equality D f (P Q) = D f (Q P) does not necessarily hold for all pairs of probability measures (P, Q), the reason for the equality in (195) stems from the fact that the second derivatives of f and f * coincide at 1 when f is twice differentiable.

Remark 21.
Under the conditions in Theorem 8, it follows from (196) that where (206) relies on L'Hôpital's rule. The convexity of D f (P Q) in (P, Q) also implies that, for all λ ∈ [0, 1], The following result refers to the local behavior of Rényi divergences of an arbitrary non-negative order.
The second line in the right side of (107) yields Finally, substituting (A41) into the right side of (A40) yields (112).