Strongly Convex Divergences

We consider a sub-class of the f-divergences satisfying a stronger convexity property, which we refer to as strongly convex, or κ-convex divergences. We derive new and old relationships, based on convexity arguments, between popular f-divergences.


Notation
Throughout, f denotes a convex function f : (0, ∞) → R ∪ {∞}, such that f (1) = 0. For a convex function defined on (0, ∞), we define f (0) := lim x→0 f (x). We denote by f * , the convex function f * : (0, ∞) → R ∪ {∞} defined by f * (x) = x f (x −1 ). We consider Borel probability measures P and Q on a Polish space X and define the f -divergence from P to Q, via densities p for P and q for Q with respect to a common reference measure µ as (1) We note that this representation is independent of µ, and such a reference measure always exists, take µ = P + Q for example.
For t, s ∈ [0, 1], define the binary f -divergence with the conventions, f (0) = lim t→0 + f (t), 0 f (0/0) = 0, and 0 f (a/0) = a lim t→∞ f (t)/t. For a random variable X and a set A, we denote the probability that X takes a value in A by P(X ∈ A), the expectation of the random variable by EX, and the variance by Var(X) := E|X − EX| 2 . For a probability measure µ satisfying µ(A) = P(X ∈ A) for all Borel A, we write X ∼ µ, and, when there exists a probability density function such that P(X ∈ A) = A f (x)dγ(x) for a reference measure γ, we write X ∼ f . For a probability measure µ on X , and an L 2 function f : X → R, we denote Var µ ( f ) := Var( f (X)) for X ∼ µ.

Strongly Convex Divergences
Definition 1. A R ∪ {∞}-valued function f on a convex set K ⊆ R is κ-convex when x, y ∈ K and t ∈ [0, 1] For example, when f is twice differentiable, (3) is equivalent to f (x) ≥ κ for x ∈ K. Note that the case κ = 0 is just usual convexity.

Proof.
Observe that it is enough to prove the result when κ = 0, where the proposition is reduced to the classical result for convex functions.
Definition 2. An f -divergence D f is κ-convex on an interval K for κ ≥ 0 when the function f is κ-convex on K. Table 1 lists some κ-convex f -divergences of interest to this article.
Observe that we have taken the normalization convention on the total variation (the total variation for a signed measure µ on a space X can be defined through the Hahn-Jordan decomposition of the measure into non-negative measures µ + and µ − such that µ = µ + − µ − , as µ = µ + (X) + µ − (X) (see [18]); in our notation, |µ| TV = µ /2) which we denote by |P − Q| TV , such that |P − Q| TV = sup A |P(A) − Q(A)| ≤ 1. In addition , note that the α-divergence interpolates Pearson's χ 2 -divergence when α = 3, one half Neyman's χ 2 -divergence when α = −3, the squared Hellinger divergence when α = 0, and has limiting cases, the relative entropy when α = 1 and the reverse relative entropy when . Recall that f * satisfies the equality D f * (P||Q) = D f (Q||P). For brevity, we use χ 2 -divergence to refer to the Pearson χ 2 -divergence, and we articulate Neyman's χ 2 explicitly when necessary.
The next lemma is a restatement of Jensen's inequality.
For a convex function f such that f (1) = 0 and c ∈ R, the functionf (t) = f (t) + c(t − 1) remains a convex function, and what is more satisfies We pursue a generalization of the following bound on the total variation by the χ 2 -divergence [19][20][21].
Theorem 1 ( [19][20][21]). For measures P and Q, We mention the work of Harremos and Vadja [20], in which it is shown, through a characterization of the extreme points of the joint range associated to a pair of f -divergences (valid in general), that the inequality characterizes the "joint range", that is, the range of the function (P, Q) → (|P − Q| TV , χ 2 (P||Q)). We use the following lemma, which shows that every strongly convex divergence can be lower bounded, up to its convexity constant κ > 0, by the χ 2 -divergence, and note thatf defines the same κ-convex divergence as f . Thus, we may assume without loss of generality that f + is uniquely zero when t = 1. Since f is κ-convex φ : t → f (t) − κ(t − 1) 2 /2 is convex, and, by f + (1) = 0, φ + (1) = 0 as well. Thus, φ takes its minimum when t = 1 and hence φ ≥ 0 so that f (t) ≥ κ(t − 1) 2 /2. Computing, Based on a Taylor series expansion of f about 1, Nielsen and Nock ( [22], [Corollary 1]) gave the estimate for divergences with a non-zero second derivative and P close to Q. Lemma 2 complements this estimate with a lower bound, when f is κ-concave. In particular, if f (1) = κ, it shows that the approximation in (5) is an underestimate.
Theorem 2. For measures P and Q, and a κ convex divergence D f , Proof. By Lemma 2 and then Theorem 1, The proof of Lemma 2 uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdu in [6], where it appears as Theorem 1 and is used to give sharp comparisons in several f -divergence inequalities.
Theorem 3 (Sason-Verdu [6]). For divergences defined by g and f with c f (t) ≥ g(t) for all t, then

Corollary 1.
For a smooth κ-convex divergence f , the inequality is sharp multiplicatively in the sense that if f (1) = κ.
In information geometry, a standard f -divergence is defined as an f -divergence satisfying the normalization f (1) = f (1) = 0, f (1) = 1 (see [23]). Thus, Corollary 1 shows that 1 2 χ 2 provides a sharp lower bound on every standard f -divergence that is 1-convex. In particular, the lower bound in Lemma 2 complimenting the estimate (5) is shown to be sharp.

Proposition 2.
When D f is an f divergence such that f is κ-convex on [a, b] and that P θ and Q θ are probability measures indexed by a set Θ such that a ≤ dP θ dQ θ (x) ≤ b, holds for all θ and P := Θ P θ dµ(θ) and Q := In particular, when Q θ = Q for all θ Proof. Let dθ denote a reference measure dominating µ so that dµ = ϕ(θ)dθ then write By Jensen's inequality, as in Lemma 1 Integrating this inequality gives Note that Inserting these equalities into (14) gives the result.
To obtain the total variation bound, one needs only to apply Jensen's inequality, Observe that, taking Q = P = Θ P θ dµ(θ) in Proposition 2, one obtains a lower bound for the average f -divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter, An alternative proof of this can be obtained by applying |P θ − P| 2 TV ≤ D f (P θ ||P)/κ from Theorem 2 pointwise.
The next result shows that, for f strongly convex, Pinsker type inequalities can never be reversed, Proposition 3. Given f strongly convex and M > 0, there exists P, Q measures such that In fact, building on the work of Basu-Shioya-Park [24] and Vadja [25], Sason and Verdu proved [6] that, for any f divergence, sup P =Q Thus, an f -divergence can be bounded above by a constant multiple of a the total variation, if and only if f (0) + f * (0) < ∞. From this perspective, Proposition 3 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.

Skew Divergences
If we denote Cvx(0, ∞) to be quotient of the cone of convex functions f on (0, ∞) such that then the map f → D f gives a linear isomorphism between Cvx(0, ∞) and the space of all f -divergences. The mapping T : Mathematically, skew divergences give an interpolation of this involution as gives D f (P||Q) by taking s = 1 and t = 0 or yields D f * (P||Q) by taking s = 0 and t = 1.
Moreover, as mentioned in the Introduction, skewing imposes boundedness of the Radon-Nikodym derivative dP dQ , which allows us to constrain the domain of f -divergences and leverage κ-convexity to obtain f -divergence inequalities in this section.
The following appears as Theorem III.1 in the preprint [26]. It states that skewing an f -divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed f -divergences. A proof is given in the Appendix A for the convenience of the reader. [26]). For t, s ∈ [0, 1] and a divergence D f , then

Theorem 4 (Melbourne et al
is an f -divergence as well.

Definition 4.
For an f -divergence, its skew symmetrization, ∆ f is determined by the convex function Observe When f (x) = x log x, the relative entropy's skew symmetrization is the Jensen-Shannon divergence. When f (x) = (x − 1) 2 up to a normalization constant the χ 2 -divergence's skew symmetrization is the Vincze-Le Cam divergence which we state below for emphasis. The work of Topsøe [11] provides more background on this divergence, where it is referred to as the triangular discrimination.

Corollary 2.
For an f -divergence such that f is a κ-convex on (0, 2), with equality when the f (t) = (t − 1) 2 corresponding the the χ 2 -divergence, where ∆ f denotes the skew symmetrized divergence associated to f and ∆ is the Vincze-Le Cam divergence.
Proof. Applying Proposition 2 , which demonstrates that up to a constant log e 8 the Jensen-Shannon divergence bounds the Vincze-Le Cam divergence (see [11] for improvement of the inequality in the case of the Jensen-Shannon divergence, called the "capacitory discrimination" in the reference, by a factor of 2).
Definition 6 (Nielsen [17]). For p and q densities with respect to a reference measure µ, where ∑ n i=1 w i α i =ᾱ.
Theorem 5. For p and q densities with respect to a reference measure µ, w i > 0, such that ∑ n i=1 w i = 1 and where H(w) : Note that, sinceᾱ i is the w average of the α j terms with α i removed,ᾱ i ∈ [0, 1] and thus A ≤ 1. We need the following Theorem from Melbourne et al. [26] for the upper bound. where Proof of Theorem 5. We apply Theorem 6 with f i = (1 − α i )p + α i q, λ i = w i , and noticing that in general we have It Thus, T = max i (α i −ᾱ i )|p − q| TV = A|p − q| TV , and the proof of the upper bound is complete.
To prove the lower bound, we apply Pinsker's inequality, 2 log e|P − Q| 2 TV ≤ D(P||Q), Definition 7. Given an f -divergence, densities p and q with respect to common reference measure, α ∈ [0, 1] n and w ∈ (0, 1) n such that ∑ i w i = 1 define its generalized skew divergence

Note that, by Theorem 4, D α,w
f is an f -divergence. The generalized skew divergence of the relative entropy is the generalized Jensen-Shannon divergence JS α,w . We denote the generalized skew divergence of the χ 2 -divergence from p to q by Note that, when n = 2 and α 1 = 0, α 2 = 1 and w i = 1 2 , we recover the skew symmetrized divergence in Definition 4 The following theorem shows that the usual upper bound for the relative entropy by the χ 2 -divergence can be reversed up to a factor in the skewed case.

Proof. By definition,
Taking P i to be the measure associated to (1 − α i )p + α i q and Q given by (1 −ᾱ)p +ᾱq, then Since f (x) = x log x, the convex function associated to the usual KL divergence, satisfies Since Q = ∑ i w i P i , the left hand side of (42) is zero, while Rearranging gives, which is our conclusion.

Total Variation Bounds and Bayes Risk
In this section, we derive bounds on the Bayes risk associated to a family of probability measures with a prior distribution λ. Let us state definitions and recall basic relationships. Given probability densities {p i } n i=1 on a space X with respect a reference measure µ and λ i ≥ 0 such that ∑ n i=1 λ i = 1, define the Bayes risk, If (x, y) = 1 − δ x (y), and we define T(x) := arg max i λ i p i (x) then observe that this definition is consistent with, the usual definition of the Bayes risk associated to the loss function . Below, we consider θ to be a random variable on {1, 2, . . . , n} such that P(θ = i) = λ i , and x to be a variable with conditional distribution P(X ∈ A|θ = i) = A p i (x)dµ(x). The following result shows that the Bayes risk gives the probability of the categorization error, under an optimal estimator. Proposition 5. The Bayes risk satisfies where the minimum is defined overθ : X → {1, 2, . . . , n}.
which gives our conclusion.
It is known (see, for example, [9,31]) that the Bayes risk can also be tied directly to the total variation in the following special case, whose proof we include for completeness. Proposition 6. When n = 2 and λ 1 = λ 2 = 1 2 , the Bayes risk associated to the densities p 1 and p 2 satisfies Proof. Since p T = |p 1 −p 2 |+p 1 +p 2 2 , integrating gives X p T (x)dµ(x) = |p 1 − p 2 | TV + 1 from which the equality follows.
Information theoretic bounds to control the Bayes and minimax risk have an extensive literature (see, for example, [9,[32][33][34][35]). Fano's inequality is the seminal result in this direction, and we direct the reader to a survey of such techniques in statistical estimation (see [36]). What follows can be understood as a sharpening of the work of Guntuboyina [9] under the assumption of a κ-convexity.
The function T(x) = arg max i {λ i p i (x)} induces the following convex decompositions of our densities. The density q can be realized as a convex combination of q 1 = λ T q 1−Q where Q = 1 − λ T qdµ and If we take p := ∑ i λ i p i , then p can be decomposed as ρ 1 = λ T p T 1−R and ρ 2 = p−λ T p T R so that W 0 can be expressed explicitly as where for fixed x, we consider the variance Var λ i =T p i q to be the variance of a random variable taking values p i (x)/q(x) with probability λ i /(1 − λ T(x) ) for i = T(x). Note this term is a non-zero term only when n > 2.
Proof. For a fixed x, we apply Lemma 1 Integrating, where Applying the κ-convexity of f , with Similarly, where Writing W = W 0 + W 1 + W 2 , we have our result.

Corollary 4. When
further when n = 2, Proof. Note that q 1 = q 2 = q, since λ i = 1 n implies λ T = 1 n as well. In addition, Q = 1 − λ T qdµ = n−1 n so that applying Theorem 8 gives The term W can be simplified as well. In the notation of the proof of Theorem 8, For the special case, one needs only to recall R = 1−|p 1 −p 2 | TV 2 while inserting 2 for n.

Corollary 5.
for D(p i ||q) the relative entropy. In particular, where P = 1 − λ T pdµ for p = ∑ i λ i p i and t * = min λ i .
Proof. For the relative entropy, When p i ≤ q/t * holds for all i, then we can apply Theorem 8 with M = 1 t * . For the second inequality, recall the compensation identity, ∑ i λ i D(p i ||q) = ∑ i λ i D(p i ||p) + D(p||q), and apply the first inequality to ∑ i D(p i ||p) for the result.

Proof. Since
can be obtained for V = |p 1 − p 2 | TV .

On Topsøe's Sharpening of Pinsker's Inequality
For P i , Q probability measures with densities p i and q with respect to a common reference measure, ∑ n i=1 t i = 1, with t i > 0, denote P = ∑ i t i P i , with density p = ∑ i t i p i , the compensation identity is Theorem 9. For P 1 and P 2 , denote M k = 2 −k P 1 + (1 − 2 −k )P 2 , and define then the following sharpening of Pinsker's inequality can be derived, Proof. When n = 2 and t 1 = t 2 = 1 2 , if we denote M = P 1 +P 2 2 , then (61) reads as Taking Q = P 2 , we arrive at Iterating and writing M k = 2 −k P 1 + (1 − 2 −k )P 2 , we have It can be shown (see [11]) that 2 n D(M n ||P 2 ) → 0 with n → ∞, giving the following series representation, Note that the ρ-decomposition of M k is exactly ρ i = M k (i), thus, by Corollary 6, Thus, we arrive at the desired sharpening of Pinsker's inequality.
Observe that the k = 0 term in the above series is equivalent to where ρ i is the convex decomposition of p = p 1 +p 2 2 in terms of T(x) = arg max{p 1 (x), p 2 (x)}.

Conclusions
In this article, we begin a systematic study of strongly convex divergences, and how the strength of convexity of a divergence generator f , quantified by the parameter κ, influences the behavior of the divergence D f . We prove that every strongly convex divergence dominates the square of the total variation, extending the classical bound provided by the χ 2 -divergence. We also study a general notion of a skew divergence, providing new bounds, in particular for the generalized skew divergence of Nielsen. Finally, we show how κ-convexity can be leveraged to yield improvements of Bayes risk f -divergence inequalities, and as a consequence achieve a sharpening of Pinsker's inequality. is convex withf (1) = 0 as well.