1. Introduction
The starting point of this work is the introduction of the minmax divergence between two finite categorical distributions. It is a special case of a natural measurement arising in certain constructions from computational geometry and topology. Specifically, it coincides with the smallest radius at which two appropriately defined balls intersect. However, since our results promise to be useful also outside the field of geometry and topology, we will tailor the exposition accordingly.
The main result are tight bounds between the minmax divergence and the standard Jensen–Shannon divergence (as well as its generalizations). Given these bounds and the well-known fact the square root of the Jensen–Shannon divergence is a metric, we ask if the same can be proven about the minmax divergence. A supplementary result is a proof that in one dimension its square root is a metric. The general case is left as an open question.
The basic definition of the minmax divergence is based on the classic notion of Shannon entropy and has an intuitive information theoretic interpretation, as will be explained shortly. The definition coincides with an interpretation of the Chernoff information recently provided by Nielsen in [
1]. In
Section 2, we generalize this definition slightly, allowing us to work with the entire positive orthant of
. Later in
Section 5, we further generalize using the language of Bregman divergences, which allows us to work with other notions of entropy, such as the Burg entropy. Before we proceed, we mention that the following exposition is tailored towards this generalization. In particular, we prefer to work with negative Shannon entropy. Being a strictly convex function it allows for a smoother transition to the case of Bregman divergences, which are defined via such functions.
Shannon entropy and related concepts. We start from basic definitions and review their information-theoretical interpretations. Consider a random variable that takes one of
n values. Letting
be the vector of probabilities, we can encode the outcome with
expected efficiency , in which
is the (negative)
Shannon entropy of
x. We remark that if a binary logarithm was used, this quantity would be expressed in bits. We use natural logarithms to simplify calculations. Suppose we mistakenly assume that the underlying probability vector is
, and we encode the random variable based on this faulty assumption. The expected efficiency is then
. Comparing this with
, we get
as a measure of the efficiency loss due to the faulty assumption. This quantity is often referred to as the
relative entropy, which we note is not symmetric. Next assume we know that the random variable is either chosen according to the distribution
x or according to the distribution
y, each with
likelihood. Our best bet is to encode the result as if the underlying probability distribution were the average,
. The expected efficiency is
, which we compare with
, the expected efficiency assuming we know from which distribution the random variable is chosen. The difference,
is the
Jensen–Shannon divergence. This can be generalized to non-negative likelihoods
, for which our best bet is to encode using
, with expected divergence
. There are unique likelihoods,
, for which the expected divergence is maximized. Because
E is convex, this is also the situation in which the maximum divergence is minimized. We therefore consider this solution our best bet if we do not know how
x and
y split the likelihood. Setting
, the expected efficiency is
, which we compare with
. The difference,
minimizes the maximum divergence over all possible choices of likelihoods
. We therefore call
the
minmax divergence between the two probability distributions. In the next section, we will provide an easier to work with alternative definition.
Main result. Our main result is a comparison of the minmax divergence with the standard Jensen–Shannon divergence. Specifically, we prove
as tight bounds. We remark that the final result uses a generalized form of the above concepts, as defined in the next section.
Given that
, the two divergences are very close. (As we will argue shortly, it makes sense to consider the square roots of these divergences, in which case the constant is approximately
, which is even closer to 1.) Additionally, the Jensen–Shannon divergence and the minmax divergence also yield the same intrinsic metric, which is
times the intrinsic metric defined by the relative entropy. In the literature, the relative entropy is also known as the
Kullback–Leibler divergence [
2], and the corresponding intrinsic metric is known as the
Fisher information metric [
3]. Both are however not
length metrics, which means that integrating infinitesimal steps gives a corresponding
intrinsic metric, which is different from the original metric.
In light of these similarities, we ask if they share more properties. In particular Endres and Schindelin proved that
is a metric [
4]. We were able to prove a similar result in dimension one (specifically, in a slightly generalized setting in which
). Namely, we prove that
is a metric. The result in higher dimensions turns out to be challenging, and remains open. We do believe that the geometric proof techniques we introduce are worth sharing, and are likely to be useful in the high-dimensional case (in combination with some other techniques).
Applications in computational geometry. We briefly explain how the above concepts can be used in computational geometry and topology and, in particular, why the main result is useful. Our main motivation comes from the field of topological data analysis, a subfield of computational geometry and topology. In short, the idea is to characterize the
shape of data or, in other words, its geometric-topological structure. We briefly describe a simple case to which our main result is immediately applicable. For a comprehensive treatment of topological data analysis in the context of arbitrary Bregman divergences, which includes our setup with relative entropy, see [
5].
In the simplest case—which is also most relevant here—the input data is a finite collection of points. We consider the union of the balls centered at these points and increase their common radius from zero to infinity. As the radius changes, so does the connectivity (or topology) of the union of balls. One basic topological property is the structure of the connected components. At radius zero, each point constitutes its own connected component, and these components may merge as the radius increases. Specifically, two components may merge when two balls develop a nonempty intersection for the first time. (We remark that higher-degree topological features can also be considered, but this requires tools from algebraic topology, which are beyond the scope of this paper.).
This setup can be easily implemented in the Euclidean space. However, in the more interesting situation in which each point is a finite categorical distribution, the balls are better defined using the relative entropy (and not the Euclidean distance), as it allows the outcome to have an information-theoretic interpretation, as outlined above. The radius at which two balls intersect coincides with the minmax divergence [
6], whose direct computation (especially for many pairs of points) can be slow. The proposed inequality suggests we compute the approximating Jensen–Shannon divergence instead. It remains fairly accurate while being significantly faster, simpler, and more robust. In particular, this allows for a simple computation of a weighted undirected graph that describes the connected structure of data measured with relative entropy. In topological data analysis language, it would be called the 1-skeleton of the Čech complex (the full complex would be a weighted simplicial complex that encodes intersections between arbitrary tuples of balls and captures higher-degree topological information.).
In summary, our results simplify a fundamental computation in topological data analysis for an important kind of inputs. This is often the first step towards computing a topological descriptor, which has been proven useful in a variety of fields [
7] ranging from biology, to astronomy, to materials science. Moreover, if the square root of the minmax divergence is indeed a metric (a fact we were unable to prove beyond dimension one), it would allow one to use existing theoretic in computational tools. Here we mention stability results in topological data analysis that exploit the metric structure of data [
8], and classical nearest neighbour search tools that focus on metric spaces [
9].
Outline. Section 2 introduces the main concepts in their generalized forms and proves some of their fundamental properties.
Section 3 shows that the Jensen–Shannon divergence approximates the minmax divergence within a factor
.
Section 4 proves that in one dimension the square root of the minmax divergence is a metric.
Section 5 extends the results beyond the Shannon entropy using the framework of Bregman divergences.
Section 6 concludes the paper.
Related work. Many concepts used in this paper lead back to the seminal work of Claude Shannon [
10] on information theory and, in particular, the notion of Shannon entropy. This notion was extended to a dissimilarity measure between two probability distributions by Kullback and Leibler [
2], often referred to as the relative (Shannon) entropy or the Kullback–Leibler divergence. The best known metric derived from relative Shannon entropy is based on the Jensen–Shannon divergence, defined by Lin in 1991 [
11]. Interestingly, its more general form was introduced a decade earlier by Burbea and Rao [
12]. Lew Bregman introduced a notion of Bregman divergence [
13], which generalizes the relative entropy. Various other metrics derived from Bregman divergences were studied in [
14,
15]. More recently, Bregman divergences were further generalized by Nielsen in various ways [
16,
17,
18]. Our work is motivated by results at the intersection of Bregman geometry and computational geometry. The starting point for this direction is the work of Boissonnat, Nielsen and Nock [
19,
20,
21] on computational geometry in the Bregman context.
2. Generalized Definitions
In the Introduction, we described basic concepts (such as the relative entropy) along with their information-theoretical interpretation. In this section, we introduce more general versions of these concepts, allowing us to work in the entire positive orthant . We remark that the definitions change in subtle ways, and that applying the usual definitions in this extended setup may on occasion be non-sensical.
For the reminder of this paper, we are exclusively concerned with two spaces: the n-dimensional positive orthant, denoted , which consists of all points with for , and the open -dimensional standard simplex, denoted , which consists of the points that satisfy . A point is really a finite probability distribution, which leads us to believe that is the more important setting. However, is often easier to work with, and we can restrict results to .
Shannon entropy and relative entropy. Writing for the natural logarithm of , the (negative) Shannon entropy is defined by . (Subtracting the extra term is a standard trick to simplify computations while not affecting the interpretation of the resulting relative entropy. However, the interpretation of the Shannon entropy mentioned in the Introduction holds only up to a constant.) We write for its restriction to the standard simplex.
As mentioned in
Section 1,
pertains to the
expected efficiency of encoding a random variable that distributes according to
. If we encode the values assuming the random variable distributes according to
, the expected efficiency is the negative of the best linear approximation of
E at
y evaluated at
x, which is
. The relative entropy can be viewed as the non-negative difference between the above approximation and the Shannon entropy of
x; see
Figure 1.
Going back to
, we consider a generalized form of the
relative entropy from
x to
y:
Note the extra additive terms, which are absent in the usual form and ensure that the result is nonnegative. When restricted to the standard simplex, it simplifies to its more standard form, namely .
We say the relative entropy is
decomposable because it satisfies (11). Observe that the notation separates the points
x and
y by a double bar as opposed to a comma to remind us that the measure is generally not symmetric. The relative entropy is also known as the
Kullback divergence, the
Kullback–Leibler divergence, or simply the
divergence; see [
3] (page 57). It measures the divergence in encoding efficiency due to assuming a different distribution.
Jensen–Shannon divergence. Similar to the relative entropy, the
Jensen–Shannon divergence generalized to the positive orthant is a function
, but it is symmetric by taking the average relative entropy from
x to
and from
y to
; see
Figure 2:
in which we get (13) by noting
. Similar to the relative entropy, the Jensen–Shannon divergence is decomposable (15). As pointed out in [
4],
measures the divergence of expected efficiency when we encode a random variable that distributes half of the time according to
and the other half of the time according to
using the average of
x and
y.
It is not difficult to prove that if we substitute any other point for
, then the average relative entropy and therefore the divergence in efficiency increases. For reasons that will become obvious shortly, we prove this optimality result for a weighted version of this divergence. Let
and set
. The corresponding
weighted Jensen–Shannon divergence is
To prove optimality, we set
, noting that
. The following lemma and proof can also be found in [
22].
Lemma 1
(Optimality of Weighted JS). Let , , and . Then for every , with equality iff .
Proof. Computing the difference,
, most terms cancel and we get
in which we use
. In other words, the difference is equal to
, which is non-negative and zero iff
by the strict convexity of
E. □
Remark 1.
To get an information theoretic interpretation of the result, we assume and suppose a random variable that distributes according to x with likelihood and according to y with likelihood . Lemma 1 says that our best bet is to encode with . In words, w minimizes the weighted Jensen–Shannon divergence.
Minmax divergence. Similar to the Jensen–Shannon divergence, the
minmax divergence is a symmetric function
. It is defined by mapping
to the larger relative entropy to a third point,
, in which
z is selected so as to minimize this maximum:
We call the point
z that gives the infimum the
minmax divergence center of the pair. We remark that the general form of minmax divergence introduced in
Section 5 generalizes the Chernoff information [
1] popular in statistics.
As proved for example in [
6],
z is a convex combination of
x and
y. We strengthen this result by proving that it is the particular convex combination that maximizes the weighted Jensen–Shannon divergence, and that this weighted Jensen–Shannon divergence equals the minmax divergence.
Lemma 2 (Minmax-Maxweight). Let and such that is the minmax divergence center. Then for all ξ.
Proof. Let and write for a general affine combination of x and y. The restriction of E to the line of such points is a strictly convex function. It follows that there is a unique affine combination, , such that . The weighted Jensen–Shannon divergence at is , which is equal to , as claimed.
By convexity of the restriction of E, the maximum relative entropy is larger than this shared value for every affine combination . Similarly, the weighted Jensen–Shannon divergence is smaller than at every affine combination . □
Remark 2.
In contrast to the other measures discussed so far, the minmax divergence is not decomposable.
Remark 3.
Since the minmax divergence center of x and y lies between these two points, it is constrained to a compact set so we can replace the infimum in (
22)
by a minimum. This justifies the name of corresponding divergence. Lemma 2 says that the minimum of the maximum relative entropy is equal to the maximum weighted Jensen–Shannon divergence, which justifies the name of the lemma. Explicit formula. While the minmax divergence is defined as an infimum, it is possible to compute it explicitly. To state the formula, we introduce
defined by
It is well defined at all positive
, and we get
in the limit because
and
by the rule de l’Hôpital.
Lemma 3
(Minmax divergence Formula). For , we have .
Proof. For
both sides vanish, so the relation holds. We therefore assume
for the remainder of the proof. Letting
z be the minmax divergence center of
x and
y, we recall that the derivative at
z is the slope of the line that passes through
and
:
. We express this equation in terms of
:
Write
for the first term on the right-hand side of (26). Recall that the minmax divergence of
x and
y is the vertical distance between the point
and the line that passes through
and
. By construction, the slope of this line is
. We express the vertical distance as the sum of two vertical distances, which we then express in terms of
A and
:
in which we use (26) to get (29), we cancel
and replace
z to get (30), and finally use
to get (31). □
Bregman divergences. The relative entropy can be viewed from the perspective of Bregman divergences [
13]. Indeed, it is an instance of a Bregman divergence generated by the negative Shannon entropy. We briefly introduce the setup for general Bregman divergences, generated by arbitrary convex functions, or more technically functions of Legendre type.
Given an open convex set
, a function
is of Legendre type if it is (1) differentiable, (2) strictly convex and (3) the magnitude of its gradient diverges to positive infinity when evaluated at points converging to the boundary of the domain. Given a function
F of Legendre type, the Bregman divergence [
13] generated by
F is defined as
We will use this concept to generalize our main result in
Section 5.
3. Comparison of Divergences
We think of the Jensen–Shannon divergence as a readily computed approximation of the minmax divergence. The approximation is very close, and we prove in this section that the Jensen–Shannon divergence is always between and 1 times the minmax divergence. We prove this first in one dimension and then generalize the result to n dimensions.
Approximation with ellipse. The main tool in proving the relation between the minmax divergence and the Jensen–Shannon divergence is the—surprisingly close—approximation of the graph of the Shannon entropy defined over with an arc of an ellipse. We are interested in the interval , so we choose the ellipse to
pass through the points and ;
have its minimum at the point ;
have the same curvature at as the graph of the Shannon entropy.
Writing the ellipse as the zero-set of a function, we introduce
defined by
It is not difficult to check that
satisfies the above three properties. We also note that
is negative for points inside the ellipse and positive for points outside the ellipse. Within
, the approximation of
E by the lower portion of the ellipse is astonishingly close, and we exaggerate the difference in
Figure 3 to make it visible. We prove that from 0 to 1 the graph of
E is below the ellipse, and from 1 to
e it is above the ellipse.
Lemma 4 (Below-Above)
. Letting be the 1-dimensional Shannon entropy and the function defined in (
32)
. Then Proof. Write
, in which
We have
by construction of
, but not necessarily
because we divided by
x. It suffices to prove that
is positive for
and negative for
. Next we compute the first two derivatives, again after removing a monotonic factor to simplify the computations. Specifically, we write
, in which
The derivative of
g is quadratic in
, with zeros at
and
. Hence,
is negative for
and positive outside the corresponding closed interval. Returning to
g, we note that
g is zero, negative, positive at
:
Our analysis of
implies that
g increases from 0 to 1, it decreases from 1 to
, and finally it increases again from
to
e, with a zero at some value
between
and
e. Hence,
f decreases from 0 to
, with
, and it increases from
to
e. Since
, this implies that
f is positive from 0 to 1 and negative from 1 to
e. The claimed inequalities for
follow. □
Midpoint lines. We are interested in the relative position of two midpoint lines. The first is defined by the Shannon entropy, E, and the second by the function whose graph is the portion of the ellipse on and below the horizontal coordinate axis. Both lines consist of points with , in which the points of the first line satisfy and the points of the second line satisfy .
The ellipse can be obtained by shearing a circle, and since this operation takes straight lines to straight lines, it follows that the midpoint line of
G is a straight line segment. Its endpoints are
and
. By Lemma 4, the midpoint line of
E is a curve that connects the same two endpoints but lies otherwise to the left of the line segment; see
Figure 3.
Inequalities. Recall the definition of the weighted Jensen–Shannon divergence for a real parameter and points:
in which
and
. The (unweighted) Jensen–Shannon divergence is
, and from Lemma 2 we know that the minmax divergence can be written as
. We prove that the two measures of information divergence are good approximations of each other.
Theorem 1
(Loss Comparison). Letting , we have .
Proof. The upper bound on
follows from Lemma 2, so we focus on proving the lower bound. We begin with the 1-dimensional case, when
. Recall that
for every
. Hence,
which implies that the ratio is independent of
C. Given
, we can find
such that
, so we assume
for the remainder of the 1-dimensional argument. This implies that the minmax divergence center is
. Setting
and
, we have
, and the ratio is
Observe that this ratio is the length of the left edge divided by the length of the right edge of the trapezoid
in
Figure 3. For a general pair
with
, we represent the ratio by the left and right edges of a similar trapezoid,
. Importantly, the two trapezoids share
as their common lower left corner, and the bottom edge of
has smaller positive slope than the bottom edge of
. The height of
is 1 and that of
is
. To compare the two ratios, we thus consider
, which is the scaled version of
whose lower left corner is
and whose height is 1. The upper right corner of
lies on the midpoint line of
E, which by Lemma 4 implies that the upper right corner of the scaled trapezoid lies to the left of
on the horizontal coordinate axis. It follows that the width of the scaled trapezoid is smaller than the width of
. Hence,
We get equality for
and for
, which implies the claimed lower bound for
dimension. Moving on to
dimensions, we recall that the minmax divergence center is a convex combination of the two points [
6]. Specifically, the center satisfies
with
and
. Using Lemma 2, the decomposability of the weighted Jensen–Shannon divergence (19), and the claimed inequality in
dimension—in this sequence—we get
as claimed. □
Remark 4.
The bounds in Theorem 1 are tight. To see this for the lower bound, we note that and . While 0 is formally not part of the domain, we can take points arbitrarily close to 0 and thus get the bound in the limit. To see that the upper bound is tight, we let be small and set and . We can see the two entropies geometrically as the vertical distance of two points on the graph of E below the line that passes through and ; see Figure 2. For this point is , and for this point is , in which z is the minmax divergence center of x and y. To determine z, we note that . After some computations, including the Taylor expansions of around and of around , we find that z is plus a fourth-order term in . In words, z approaches much faster than a and b. It follows that in the limit, the two entropies are the same, as required. 4. Proof of Metric in Dimension One
As proved in [
4], the square root of the Jensen–Shannon divergence is a metric in
. Using the decomposability of the Jensen–Shannon divergence together with the Minkowski inequality, it is then easy to prove that this square root is also a metric in
. We prove that the square root of the minmax divergence is a metric in
. Since the minmax divergence is not decomposable, we do not know yet whether its square root is a metric in
.
The ratio method. Suppose
satisfies the triangle inequality and
is another function on the product. To prove that
B also satisfies the triangle inequality, we may consider the ratio,
and prove its
monotonicity, that is:
whenever
.
Lemma 5
(Triangle Inequality). Let in which A satisfies the triangle inequality and is monotonic. Then B satisfies the triangle inequality.
Proof. Let
. Then
in which we get (49) using the monotonicity of
f and the triangle inequality for
A. □
To apply the lemma, we set and . Clearly, A is a metric in , so it will suffice to show that the ratio is monotonic.
A first application. As a warm up exercise, we use the ratio method expressed in Lemma 5 to re-prove the main result of [
4].
Theorem 2
(JS Revisited). Let be the Shannon entropy and map every pair to the Jensen–Shannon divergence. Then is a metric in .
Proof. We begin with the one-dimensional case,
. It is clear that
is non-negative, zero iff
, and symmetric. It remains to prove that its square root satisfies the triangle inequality. Setting
we will prove that
f is monotonic as defined in (
47). Recall that
for every
. It follows that
, which allows us to assume
. To simplify the notation, we set
assuming
. Writing
, we aim at proving that
g is monotonically decreasing, that is:
for all
. Recall from (5) that
The derivative of
g is
, in which
For
, both the numerator and the denominator vanish:
and
in which
. Applying the rule de l’Hôpital three times, we get
with
. However,
and
. Hence,
and it suffices to prove
for
. Since the denominator of
is positive, this is equivalent to
for
. But this follows from
and
for
, which can be seen from (57). Hence
f is monotonic and since the denominator in (
51) satisfies the triangle inequality, Lemma 5 implies that
also satisfies the triangle inequality and therefore is a metric.
Having established the claim in one dimension, we get the
n-dimensional result using the Minkowski inequality, which for non-negative real numbers
and
implies
Recall that the Jensen–Shannon divergence is
decomposable:
for
. Letting
be a third point, we set
and
for all
i. Since
satisfies the triangle inequality in one dimension, we have
for
. It follows that the left-hand side of (
59) is larger than or equal to
. The right-hand side of (
59) is equal to
, which implies the triangle inequality for the square root of the Jensen–Shannon divergence in
n dimensions. □
Further preparations. Recall that the Shannon entropy is defined by
. The related function,
, defined by
will play a crucial role in the proof of our next theorem; see
Figure 4. The derivative has a discontinuity at
, but if we reflect the preceding branch to get
defined by
for
and
for
, we obtain a convex function; see again
Figure 4 on the left.
Appendix A proves that
is convex and everywhere differentiable, and that its derivative,
, is concave; see again
Figure 4.
One dimension. We are now ready to prove that the square root of the minmax divergence is a metric in one dimension.
Theorem 3
(1D Metric). Let be the Shannon entropy and map every pair to its minmax divergence. Then is a metric in .
Proof. It is clear that
is non-negative, zero iff
, and symmetric. It thus remains to prove that it satisfies the triangle inequality. Setting
we will prove shortly that
f is monotonic as defined in (
47). Lemma 5 then implies that
satisfies the triangle inequality. It thus remains to prove
whenever
. We begin by noting that we may assume these intervals are
canonical, by which we mean that
and
. Indeed, if
, then we can find
such that
, which then implies
.
We now use the function
defined by
to draw a geometric picture of the situation; see
Figure 5.
To prove monotonicity, that is:
, we consider the linear functions
that satisfy
, for
, and
, for
. The goal is to show that the aspect ratio of the rectangle defined by
is less than that defined by
. Equivalently, we show that the point at which the lines
L and
R meet has positive second coordinate. By the differentiability of
F and the transitivity of order along the real line, it suffices to show this in the limit case, when
and
. In this case,
L and
R are the tangent lines of
F at
and
, as shown in
Figure 5. Let
and
be the coordinates of the points at which
L and
R intersect the vertical line
, and note that
by convexity of
. The convexity of this function furthermore implies
To compare the absolute values of
ℓ and
r, we express both as integrals:
see
Figure 4 on the right. The concavity of
together with (
63) implies
, and since
L has smaller absolute slope than
R (64), it follows that the two lines intersect above the zero line, as required. □
This concludes the proof in dimension one. The main obstacle to extending the proof to arbitrary dimension is the lack of decomposability (separability) of the minmax loss. Still, we decided to present the partial results and techniques, as they may help other researchers complete the proof in the future. In particular, techniques for extending results from one to arbitrary dimensions are present in the information theory literature; see for example the work on Pinsker’s inequality [
23], which compares the relative entropy with another distance. This gives us hope that researchers in information theory may be well-equipped to extend the result to arbitrary dimension.
5. Extensions to Other Bregman Divergences
In this section, we provide a perspective for our results by extending them beyond the Shannon entropy. Specifically, we weaken the bounds while keeping them tight to generalize Theorem 1 to Bregman divergences, and we prove that the square root of the minmax divergence for the Burg entropy is a metric in .
This way the inequality can be used in other applied contexts. For example, the Burg entropy (and the Itakura–Saito divergence it induces) are used in speech recognition [
24].
Burbea–Rao divergence. The
Burbea–Rao divergence, also called the Jensen–Bregman [
25], is a straightforward generalization of the Jensen–Shannon divergence. Specifically, the underlying divergence is generalized from the relative entropy to a Bregman divergence. Letting
be a function of Legendre type generating a Bregman divergence and
, we define
in which
. Lemma 1 generalizes with a verbatim proof in which we substitute
F for
E. As done in [
1], we also generalize the minmax divergence from the Kullback–Leibler divergence to a general underlying Bregman divergence
:
When we compare the two, we get bounds that are considerably worse than for the special case of the Shannon entropy.
Theorem 4
(Burbea–Rao divergence Comparison). Let be a Legendre type function, and let be points in . Then .
Proof. Let
be the point that minimizes the right-hand side of (
69). Using the generalization of Lemma 1 to the Burbea–Rao divergence, we get
Since
, the right-hand side of (
70) is equal to
, which implies the claimed upper bound on the Burbea–Rao divergence. To prove the lower bound, we note that the larger of the two divergences to
is at least as large as
. Hence,
is at most the sum:
and the claimed lower bound follows because the right-hand side of (
71) evaluates to
. □
Remark 5.
The bounds in Theorem 4 are tight. To see this for the lower bound, we consider , which is strictly convex, differentiable, and with minimum at . Setting , for , we have , which implies that minimizes the maximum divergence from x and y. Some computations show that the divergences areFor , the ratio of over goes to , as required. To see the upper bound, we choose for which for all . Minmax divergence for Burg entropy. Another significant entropy appearing in this context is the Burg entropy: , for . Let us denote the corresponding minmax divergence my .
Theorem 5
(Metric for Burg Entropy). Let be the Burg entropy. Then is a metric in .
Proof. The proof follows a similar structure as the proof of Theorem 3. Setting up the ratio method, we define
with the denominator being the induced path metric. By Lemma 5, we have to prove that
f is monotonic.
Next we simplify. Specifically, because , we have for all . We may therefore assume all intervals considered by the Ratio Method to be canonical, i.e., . In such case, we have .
To prepare the geometric picture sketched in
Figure 6, we define
for
. Note that
is the aspect ratio of the rectangle in this picture. As in the proof of Theorem 3, the monotonicity of
f is established by proving that the tangent lines to
H at
and
intersect above the horizontal axis for all canonical intervals
. To this end, it suffices to show that
, in which
Reflecting the left part of
H, we obtain an injective function
defined by
Note that this does not change the intersections mentioned above. Letting
K be the inverse of
, We consider the graph of
K; see
Figure 7.
Expressing this with integrals, and setting
, we can compare the two coordinates:
By Lemmas A3 and A4,
is convex and decreasing. Using the integral representation above, we interpret
and
as areas; see
Figure 3. It follows that
and consequently
is a metric. □