1. Introduction
The concept of an
f-divergence, introduced independently by Ali-Silvey [
1], Morimoto [
2], and Csisizár [
3], unifies several important information measures between probability distributions, as integrals of a convex function
f, composed with the Radon–Nikodym of the two probability distributions. (An additional assumption can be made that
f is strictly convex at 1, to ensure that
for
. This obviously holds for any
, and can hold for some
f-divergences without classical derivatives at 0, for instance the total variation is strictly convex at 1. An example of an
f-divergence not strictly convex is provided by the so-called “hockey-stick” divergence, where
, see [
4,
5,
6].) For a convex function
such that
, and measures
P and
Q such that
, the
f-divergence from
P to
Q is given by
The canonical example of an
f-divergence, realized by taking
, is the relative entropy (often called the KL-divergence), which we denote with the subscript
f omitted.
f-divergences inherit many properties enjoyed by this special case; non-negativity, joint convexity of arguments, and a data processing inequality. Other important examples include the total variation, the
-divergence, and the squared Hellinger distance. The reader is directed to Chapter 6 and 7 of [
7] for more background.
We are interested in how stronger convexity properties of
f give improvements of classical
f-divergence inequalities. More explicitly, we consider consequences of
f being
-convex, in the sense that the map
is convex. This is in part inspired by the work of Sason [
8], who demonstrated that divergences that are
-convex satisfy “stronger than
” data-processing inequalities.
Perhaps the most well known example of an
f-divergence inequality is Pinsker’s inequality, which bounds the square of the total variation above by a constant multiple of the relative entropy. That is for probability measures
P and
Q,
. The optimal constant is achieved for Bernoulli measures, and under our conventions for total variation,
. Many extensions and sharpenings of Pinsker’s inequality exist (for examples, see [
9,
10,
11]). Building on the work of Guntuboyina [
9] and Topsøe [
11], we achieve a further sharpening of Pinsker’s inequality in Theorem 9.
Aside from the total variation, most divergences of interest have stronger than affine convexity, at least when
f is restricted to a sub-interval of the real line. This observation is especially relevant to the situation in which one wishes to study
in the existence of a bounded Radon–Nikodym derivative
. One naturally obtains such bounds for skew divergences. That is divergences of the form
for
, as in this case,
. Important examples of skew-divergences include the skew divergence [
12] based on the relative entropy and the Vincze–Le Cam divergence [
13,
14], called the triangular discrimination in [
11] and its generalization due to Györfi and Vajda [
15] based on the
-divergence. The Jensen–Shannon divergence [
16] and its recent generalization [
17] give examples of
f-divergences realized as linear combinations of skewed divergences.
Let us outline the paper. In
Section 2, we derive elementary results of
-convex divergences and give a table of examples of
-convex divergences. We demonstrate that
-convex divergences can be lower bounded by the
-divergence, and that the joint convexity of the map
can be sharpened under
-convexity conditions on
f. As a consequence, we obtain bounds between the mean square total variation distance of a set of distributions from its barycenter, and the average
f-divergence from the set to the barycenter.
In
Section 3, we investigate general skewing of
f-divergences. In particular, we introduce the skew-symmetrization of an
f-divergence, which recovers the Jensen–Shannon divergence and the Vincze–Le Cam divergences as special cases. We also show that a scaling of the Vincze–Le Cam divergence is minimal among skew-symmetrizations of
-convex divergences on
. We then consider linear combinations of skew divergences and show that a generalized Vincze–Le Cam divergence (based on skewing the
-divergence) can be upper bounded by the generalized Jensen–Shannon divergence introduced recently by Nielsen [
17] (based on skewing the relative entropy), reversing the classical convexity bounds
. We also derive upper and lower total variation bounds for Nielsen’s generalized Jensen–Shannon divergence.
In
Section 4, we consider a family of densities
weighted by
, and a density
q. We use the Bayes estimator
to derive a convex decomposition of the barycenter
and of
q, each into two auxiliary densities. (Recall, a Bayes estimator is one that minimizes the expected value of a loss function. By the assumptions of our model, that
, and
, we have
for the loss function
and any estimator
. It follows that
by
. Thus,
T is a Bayes estimator associated to
ℓ. ) We use this decomposition to sharpen, for
-convex divergences, an elegant theorem of Guntuboyina [
9] that generalizes Fano and Pinsker’s inequality to
f-divergences. We then demonstrate explicitly, using an argument of Topsøe, how our sharpening of Guntuboyina’s inequality gives a new sharpening of Pinsker’s inequality in terms of the convex decomposition induced by the Bayes estimator.
Notation
Throughout,
f denotes a convex function
, such that
. For a convex function defined on
, we define
. We denote by
, the convex function
defined by
. We consider Borel probability measures
P and
Q on a Polish space
and define the
f-divergence from
P to
Q, via densities
p for
P and
q for
Q with respect to a common reference measure
as
We note that this representation is independent of , and such a reference measure always exists, take for example.
For
, define the binary
f-divergence
with the conventions,
,
, and
. For a random variable
X and a set
A, we denote the probability that
X takes a value in
A by
, the expectation of the random variable by
, and the variance by
. For a probability measure
satisfying
for all Borel
A, we write
, and, when there exists a probability density function such that
for a reference measure
, we write
. For a probability measure
on
, and an
function
, we denote
for
.
2. Strongly Convex Divergences
Definition 1. A -valued function f on a convex set is κ-convex when and implies For example, when
f is twice differentiable, (
3) is equivalent to
for
. Note that the case
is just usual convexity.
Proposition 1. For and , the following are equivalent:
Proof. Observe that it is enough to prove the result when , where the proposition is reduced to the classical result for convex functions. □
Definition 2. An f-divergence is κ-convex on an interval K for when the function f is κ-convex on K.
Table 1 lists some
-convex
f-divergences of interest to this article.
Observe that we have taken the normalization convention on the total variation (the total variation for a signed measure
on a space
X can be defined through the Hahn-Jordan decomposition of the measure into non-negative measures
and
such that
, as
(see [
18]); in our notation,
) which we denote by
, such that
. In addition, note that the
-divergence interpolates Pearson’s
-divergence when
, one half Neyman’s
-divergence when
, the squared Hellinger divergence when
, and has limiting cases, the relative entropy when
and the reverse relative entropy when
. If
f is
-convex on
, then recalling its dual divergence
is
-convex on
. Recall that
satisfies the equality
. For brevity, we use
-divergence to refer to the Pearson
-divergence, and we articulate Neyman’s
explicitly when necessary.
The next lemma is a restatement of Jensen’s inequality.
Lemma 1. If f is κ-convex on the range of X, Proof. Apply Jensen’s inequality to . □
For a convex function
f such that
and
, the function
remains a convex function, and what is more satisfies
since
.
Definition 3 (
-divergence)
. For , we write We pursue a generalization of the following bound on the total variation by the
-divergence [
19,
20,
21].
Theorem 1 ([
19,
20,
21])
. For measures P and Q, We mention the work of Harremos and Vadja [
20], in which it is shown, through a characterization of the extreme points of the joint range associated to a pair of
f-divergences (valid in general), that the inequality characterizes the “joint range”, that is, the range of the function
. We use the following lemma, which shows that every strongly convex divergence can be lower bounded, up to its convexity constant
, by the
-divergence,
Proof. Define a
and note that
defines the same
-convex divergence as
f. Thus, we may assume without loss of generality that
is uniquely zero when
. Since
f is
-convex
is convex, and, by
,
as well. Thus,
takes its minimum when
and hence
so that
. Computing,
□
Based on a Taylor series expansion of
f about 1, Nielsen and Nock ([
22], [Corollary 1]) gave the estimate
for divergences with a non-zero second derivative and
P close to
Q. Lemma 2 complements this estimate with a lower bound, when
f is
-concave. In particular, if
, it shows that the approximation in (
5) is an underestimate.
Theorem 2. For measures P and Q, and a κ convex divergence , Proof. By Lemma 2 and then Theorem 1,
□
The proof of Lemma 2 uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdu in [
6], where it appears as Theorem 1 and is used to give sharp comparisons in several
f-divergence inequalities.
Theorem 3 (Sason–Verdu [
6])
. For divergences defined by g and f with for all t, thenMoreover, if , then Corollary 1. For a smooth κ-convex divergence f, the inequalityis sharp multiplicatively in the sense thatif . In information geometry, a standard
f-divergence is defined as an
f-divergence satisfying the normalization
(see [
23]). Thus, Corollary 1 shows that
provides a sharp lower bound on every standard
f-divergence that is 1-convex. In particular, the lower bound in Lemma 2 complimenting the estimate (
5) is shown to be sharp.
Proof. Without loss of generality, we assume that
. If
for some
, then taking
and applying Theorem 3 and Lemma 2
Observe that, after two applications of L’Hospital,
Proposition 2. When is an f divergence such that f is κ-convex on and that and are probability measures indexed by a set Θ such that , holds for all θ and and for a probability measure μ on Θ, then In particular, when for all θ Proof. Let
denote a reference measure dominating
so that
then write
.
By Jensen’s inequality, as in Lemma 1
Integrating this inequality gives
Note that
and
Inserting these equalities into (
14) gives the result.
To obtain the total variation bound, one needs only to apply Jensen’s inequality,
□
Observe that, taking
in Proposition 2, one obtains a lower bound for the average
f-divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter,
An alternative proof of this can be obtained by applying from Theorem 2 pointwise.
The next result shows that, for f strongly convex, Pinsker type inequalities can never be reversed,
Proposition 3. Given f strongly convex and , there exists P, Q measures such that Proof. By -convexity is a convex function. Thus, and hence Taking measures on the two points space and gives which tends to infinity with , while . □
In fact, building on the work of Basu-Shioya-Park [
24] and Vadja [
25], Sason and Verdu proved [
6] that, for any
f divergence,
. Thus, an
f-divergence can be bounded above by a constant multiple of a the total variation, if and only if
. From this perspective, Proposition 3 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.
3. Skew Divergences
If we denote
to be quotient of the cone of convex functions
f on
such that
under the equivalence relation
when
for
, then the map
gives a linear isomorphism between
and the space of all
f-divergences. The mapping
defined by
, where we recall
, gives an involution of
. Indeed,
, so that
. Mathematically, skew divergences give an interpolation of this involution as
gives
by taking
and
or yields
by taking
and
.
Moreover, as mentioned in the Introduction, skewing imposes boundedness of the Radon–Nikodym derivative , which allows us to constrain the domain of f-divergences and leverage -convexity to obtain f-divergence inequalities in this section.
The following appears as Theorem III.1 in the preprint [
26]. It states that skewing an
f-divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed
f-divergences. A proof is given in the
Appendix A for the convenience of the reader.
Theorem 4 (Melbourne et al [
26])
. For and a divergence , thenis an f-divergence as well. Definition 4. For an f-divergence, its skew symmetrization, is determined by the convex function
Observe that
, and when
,
for all
since
,
. When
, the relative entropy’s skew symmetrization is the Jensen–Shannon divergence. When
up to a normalization constant the
-divergence’s skew symmetrization is the Vincze–Le Cam divergence which we state below for emphasis. The work of Topsøe [
11] provides more background on this divergence, where it is referred to as the triangular discrimination.
Definition 5. When , denote the Vincze–Le Cam divergence by If one denotes the skew symmetrization of the
-divergence by
, one can compute easily from (
20) that
. We note that although skewing preserves 0-convexity, by the above example, it does not preserve
-convexity in general. The skew symmetrization of the
-divergence a 2-convex divergence while
corresponding to the Vincze–Le Cam divergence satisfies
, which cannot be bounded away from zero on
.
Corollary 2. For an f-divergence such that f is a κ-convex on ,with equality when the corresponding the the -divergence, where denotes the skew symmetrized divergence associated to f and Δ is the Vincze- Le Cam divergence. When
, we have
on
, which demonstrates that up to a constant
the Jensen–Shannon divergence bounds the Vincze–Le Cam divergence (see [
11] for improvement of the inequality in the case of the Jensen–Shannon divergence, called the “capacitory discrimination” in the reference, by a factor of 2).
We now investigate more general, non-symmetric skewing in what follows.
Proposition 4. For , define Then,where is the binary ∞-Rényi divergence [27]. We need the following lemma originally proved by Audenart in the quantum setting [
28]. It is based on a differential relationship between the skew divergence [
12] and the [
15] (see [
29,
30]).
Lemma 3 (Theorem III.1 [
26])
. For P and Q probability measures and , Proof of Theorem 4. If
, then
and
. In addition,
with
, thus
where the inequality follows from Lemma 3. Following the same argument for
, so that
,
, and
for
completes the proof. Indeed,
□
We recover the classical bound [
11,
16] of the Jensen–Shannon divergence by the total variation.
Corollary 3. For probability measure P and Q, Proof. Since . □
Proposition 4 gives a sharpening of Lemma 1 of Nielsen [
17], who proved
, and used the result to establish the boundedness of a generalization of the Jensen–Shannon Divergence.
Definition 6 (Nielsen [
17])
. For p and q densities with respect to a reference measure μ, , such that and , definewhere . Note that, when , , and , , the usual Jensen–Shannon divergence. We now demonstrate that Nielsen’s generalized Jensen–Shannon Divergence can be bounded by the total variation distance just as the ordinary Jensen–Shannon Divergence.
Theorem 5. For p and q densities with respect to a reference measure μ, , such that and ,where and with . Note that, since
is the
w average of the
terms with
removed,
and thus
. We need the following Theorem from Melbourne et al. [
26] for the upper bound.
Theorem 6 ([
26] Theorem 1.1)
. For densities with respect to a common reference measure γ and such that ,where and with . Proof of Theorem 5. We apply Theorem 6 with
,
, and noticing that in general
we have
It remains to determine
,
Thus,
and the proof of the upper bound is complete.
To prove the lower bound, we apply Pinsker’s inequality,
□
Definition 7. Given an f-divergence, densities p and q with respect to common reference measure, and such that define its generalized skew divergencewhere . Note that, by Theorem 4,
is an
f-divergence. The generalized skew divergence of the relative entropy is the generalized Jensen–Shannon divergence
. We denote the generalized skew divergence of the
-divergence from
p to
q by
Note that, when
and
,
and
, we recover the skew symmetrized divergence in Definition 4
The following theorem shows that the usual upper bound for the relative entropy by the -divergence can be reversed up to a factor in the skewed case.
Theorem 7. For p and q with a common dominating measure μ, Writing . For and such that , we use the notation where .
Proof. By definition,
Taking
to be the measure associated to
and
Q given by
, then
Since
, the convex function associated to the usual KL divergence, satisfies
,
f is
-convex on
, applying Proposition 2, we obtain
Since
, the left hand side of (
42) is zero, while
Rearranging gives,
which is our conclusion. □
4. Total Variation Bounds and Bayes Risk
In this section, we derive bounds on the Bayes risk associated to a family of probability measures with a prior distribution
. Let us state definitions and recall basic relationships. Given probability densities
on a space
with respect a reference measure
and
such that
, define the Bayes risk,
If
, and we define
then observe that this definition is consistent with, the usual definition of the Bayes risk associated to the loss function
ℓ. Below, we consider
to be a random variable on
such that
, and
x to be a variable with conditional distribution
. The following result shows that the Bayes risk gives the probability of the categorization error, under an optimal estimator.
Proposition 5. The Bayes risk satisfieswhere the minimum is defined over . Proof. Observe that
. Similarly,
which gives our conclusion. □
It is known (see, for example, [
9,
31]) that the Bayes risk can also be tied directly to the total variation in the following special case, whose proof we include for completeness.
Proposition 6. When and , the Bayes risk associated to the densities and satisfies Proof. Since , integrating gives from which the equality follows. □
Information theoretic bounds to control the Bayes and minimax risk have an extensive literature (see, for example, [
9,
32,
33,
34,
35]). Fano’s inequality is the seminal result in this direction, and we direct the reader to a survey of such techniques in statistical estimation (see [
36]). What follows can be understood as a sharpening of the work of Guntuboyina [
9] under the assumption of a
-convexity.
The function
induces the following convex decompositions of our densities. The density
q can be realized as a convex combination of
where
and
,
If we take
, then
p can be decomposed as
and
so that
Theorem 8. When f is κ-convex, on with and wherefor . can be expressed explicitly as
where for fixed
x, we consider the variance
to be the variance of a random variable taking values
with probability
for
. Note this term is a non-zero term only when
.
Proof. For a fixed
x, we apply Lemma 1
Integrating,
where
Applying the
-convexity of
f,
with
Similarly,
where
Writing
, we have our result. □
Corollary 4. When , and f is κ-convex on further when , Proof. Note that
, since
implies
as well. In addition,
so that applying Theorem 8 gives
The term
W can be simplified as well. In the notation of the proof of Theorem 8,
For the special case, one needs only to recall
while inserting 2 for
n. □
Corollary 5. When for , and for the relative entropy. In particular,where for and . Proof. For the relative entropy, is -convex on since . When holds for all i, then we can apply Theorem 8 with . For the second inequality, recall the compensation identity, , and apply the first inequality to for the result. □
This gives an upper bound on the Jensen–Shannon divergence, defined as
. Let us also note that through the compensation identity
,
where
. In the case that
Corollary 6. For two densities and , the Jensen–Shannon divergence satisfies the following,with defined above and . Proof. Since and satisfies on . Taking , in the example of Corollary 4 with yields the result. □
Note that
, we see that a further bound,
can be obtained for
.
On Topsøe’s Sharpening of Pinsker’s Inequality
For
probability measures with densities
and
q with respect to a common reference measure,
, with
, denote
, with density
, the compensation identity is
Theorem 9. For and , denote , and definethen the following sharpening of Pinsker’s inequality can be derived, Proof. When
and
, if we denote
, then (
61) reads as
Taking
, we arrive at
Iterating and writing
, we have
It can be shown (see [
11]) that
with
, giving the following series representation,
Note that the
-decomposition of
is exactly
, thus, by Corollary 6,
Thus, we arrive at the desired sharpening of Pinsker’s inequality. □
Observe that the
term in the above series is equivalent to
where
is the convex decomposition of
in terms of
.