Revisiting Chernoff Information with Likelihood Ratio Exponential Families

Nielsen, Frank

doi:10.3390/e24101400

Open AccessFeature PaperEditor’s ChoiceArticle

Revisiting Chernoff Information with Likelihood Ratio Exponential Families

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2022, 24(10), 1400; https://doi.org/10.3390/e24101400

Submission received: 31 July 2022 / Revised: 23 September 2022 / Accepted: 28 September 2022 / Published: 1 October 2022

(This article belongs to the Special Issue Robust Distance Metric Learning in the Framework of Statistical Information Theory)

Download

Browse Figures

Versions Notes

Abstract

:

The Chernoff information between two probability measures is a statistical divergence measuring their deviation defined as their maximally skewed Bhattacharyya distance. Although the Chernoff information was originally introduced for bounding the Bayes error in statistical hypothesis testing, the divergence found many other applications due to its empirical robustness property found in applications ranging from information fusion to quantum information. From the viewpoint of information theory, the Chernoff information can also be interpreted as a minmax symmetrization of the Kullback–Leibler divergence. In this paper, we first revisit the Chernoff information between two densities of a measurable Lebesgue space by considering the exponential families induced by their geometric mixtures: The so-called likelihood ratio exponential families. Second, we show how to (i) solve exactly the Chernoff information between any two univariate Gaussian distributions or get a closed-form formula using symbolic computing, (ii) report a closed-form formula of the Chernoff information of centered Gaussians with scaled covariance matrices and (iii) use a fast numerical scheme to approximate the Chernoff information between any two multivariate Gaussian distributions.

Keywords:

Chernoff information; Chernoff–Bregman divergence; Chernoff–Jensen divergence; Chernoff information distribution; Kullback–Leibler divergence; Bhattacharyya distance; Rényi α-divergences; regular/steep exponential family; Gaussian measures; exponential arc; information geometry; L1 measurable space; Bregman divergence; affine group

Graphical Abstract

1. Introduction

1.1. Chernoff Information: Definition and Related Statistical Divergences

Let

(𝒳, A)

denote a measurable space [1] with sample space

X

and finite

σ

-algebra

A

of events. A measure P is absolutely continuous with respect to another measure Q if

P (A) = 0

whenever

Q (A) = 0

: P is said dominated by Q and written notationally for short as

P ≪ Q

. We shall write

P ≪ Q

when P is not dominated by Q. When

P ≪ Q

, we denote by

\frac{d P}{d Q}

the Radon–Nikodym density [1] of P with respect to Q.

The Chernoff information [2], also called Chernoff information number [3,4] or the Chernoff divergence [5,6], is the following symmetric measure of dissimilarity (see Appendix A for some background on statistical divergences) between any two comparable probability measures P and Q dominated by

μ

:

D_{C} [P, Q] : = max_{α \in (0, 1)} - \log ρ_{α} [P : Q] = D_{C} [Q, P],

(1)

ρ_{α} [P : Q] : = \int p^{α} q^{1 - α} d μ = ρ_{1 - α} [Q : P],

(2)

is the

α

-skewed Bhattacharyya affinity coefficient [7] (a coefficient measuring the similarity of two densities). In the remainder, we shall use the following conventions: When a (dis)similarity is asymmetric (e.g.,

ρ_{α} [P : Q]

), we use the colon notation “:” to separate its arguments. When the (dis)similarity is symmetric (e.g.,

D_{C} [P, Q]

), we use the comma notation “,” to separate its arguments.

The

α

-skewed Bhattacharyya coefficients are always upper bounded by 1 and are strictly greater than zero for non-empty intersecting support (non-singular PMs):

0 < ρ_{α} [P : Q] \leq 1 .

A proof can be obtained by applying Hölder’s inequality (see also Appendix A for an alternative proof).

Since the affinity coefficient

ρ_{α} [P : Q]

does not depend on the underlying dominating measure

μ

[4], we shall write

D_{C} [p, q]

instead of

D_{C} [P, Q]

in the reminder.

Let

D_{B, α} [p : q]

denote the

α

-skewed Bhattacharyya distance [7,8]:

D_{B, α} [p : q] : = - \log ρ_{α} [P : Q] = D_{B, 1 - α} [q : p],

(3)

The

α

-skewed Bhattacharyya distances are not metric distances since they can be asymmetric and do not satisfy the triangle inequality even when

α = \frac{1}{2}

.

Thus, the Chernoff information is defined as the maximal skewed Bhattacharyya distance:

D_{C} [p, q] = max_{α \in (0, 1)} D_{B, α} [p : q] .

(4)

Grünwald [9,10] called the skewed Bhattacharyya coefficients and distances the

α

-Rényi affinity and the unnormalized Rényi divergence, respectively, (see Section 19.6 of [9]) since the Rényi divergence [11,12] is defined by

D_{R, α} [P : Q] = \frac{1}{α - 1} \log \int p^{α} q^{1 - α} d μ = \frac{1}{1 - α} D_{B, α} [P : Q] .

(5)

Thus

D_{B, α} [P : Q] = (1 - α) D_{R, α} [P : Q]

can be interpreted as the unnormalized Rényi divergence in [9]. However, let us notice that the Rényi

α

-divergences are defined in general for a wider range

α \in [0, \infty] ∖ {1}

with

{lim}_{α \to 1} D_{R, α} [P : Q] = D_{KL} [P : Q]

but the skew Bhattacharyya distances are defined for

α \in (0, 1)

in general.

The Chernoff information was originally introduced to upper bound the probability error of misclassification in Bayesian binary hypothesis testing [2] where the optimal skewing parameter

α^{*}

such that

D_{C} [p, q] = D_{B, α^{*}} [p : q]

is referred to in the statistical literature as the Chernoff error exponent [13,14,15] or Chernoff exponent [16,17] for short. The Chernoff information has found many other fruitful applications beyond its original statistical hypothesis testing scope like in computer vision [18], information fusion [19], time-series clustering [20], and more generally in machine learning [21] (just to cite a few use cases). It has been observed empirically that the Chernoff information exhibits superior robustness [22] compared to the Kullback–Leibler divergence in distributed fusion of Gaussian Mixtures Models [19] (GMMs) or in target detection in radar sensor network [23]. The Chernoff information has also been used for analysis deepfake detection performance of Generative Adversarial Networks [22] (GANs).

1.2. Prior Work and Contributions

The Chernoff information between any two categorical distributions (multinomial distributions with one trial also called “multinoulli” since they are extensions of the Bernoulli distributions) has been very well-studied and described in many reference textbooks of information theory or statistics (e.g., see Section 12.9 of [13]). The Chernoff information between two probability distributions of an exponential family was considered from the viewpoint of information geometry in [24], and in the general case from the viewpoint of unnormalized Rényi divergences in [11] (Theorem 32). By replacing the weighted geometric mean in the definition of the Bhattacharyya coefficient

ρ_{α}

of Equation (2) by an arbitrary weighted mean, the generalized Bhattacharyya coefficient and its associated divergences including the Chernoff information was generalized in [25]. The geometry of the Chernoff error exponent was studied in [26,27] when dealing with a finite set of mutually absolutely probability distributions

P_{1}, \dots, P_{n}

. In this case, the Chernoff information amounts to the minimum pairwise Chernoff information of the probability distributions [28]:

D_{C} [P_{1}, \dots, P_{n}] : = min_{i \in {1, \dots, n} \neq j \in {1, \dots, n}} D_{C} [P_{i}, P_{j}] .

We summarize our contributions as follows: In Section 2, we study the Chernoff information between two given mutually non-singular probability measures P and Q by considering their “exponential arc” [29] as a special 1D exponential family termed a Likelihood Ratio Exponential Family (LREF) in [10]. We show that the optimal skewing value (Chernoff exponent) defining their Chernoff information is unique (Proposition 1) and can be characterized geometrically on the Banach vector space

L^{1} (μ)

of equivalence classes of measurable functions (i.e., two functions

f_{1}

and

f_{2}

are said equivalent in

L^{1} (μ)

if they are equal

μ

-almost everywhere, abbreviated as

μ

-a.e. in the remainder) for which their absolute value is Lebesgue integrable (Proposition 4). This geometric characterization allows us to design a generic dichotomic search algorithm (Algorithm 1) to approximate the Chernoff optimal skewing parameter, generalizing the prior work [24]. When P and Q belong to a same exponential family, we recover in Section 3 the results of [24]. This geometric characterization also allows us to reinterpret the Chernoff information as a minmax symmetrization of the Kullback–Leibler divergence, and we define by analogy the forward and reverse Chernoff–Bregman divergences in Section 4 (Definition 2). In Section 5, we consider the Chernoff information between Gaussian distributions: We show that the optimality condition for the Chernoff information between univariate Gaussian distributions can be solved exactly and report a closed-form formula for the Chernoff information between any two univariate Gaussian distributions (Proposition 10). For multivariate Gaussian distributions, we show how to implement the dichotomic search algorithms to approximate the Chernoff information, and report a closed-form formula for the Chernoff information between two centered multivariate Gaussian distributions with scaled covariance matrices (Proposition 11). Finally, we conclude in Section 7.

2. Chernoff Information from the Viewpoint of Likelihood Ratio Exponential Families

2.1. LREFs and the Chernoff Information

Recall that

L^{1} (μ)

denotes the Lebesgue vector space of measurable functions f such that

\int_{X} | f | d μ < \infty

. Given two prescribed densities p and q of

L^{1} (μ)

, consider building a uniparametric exponential family [30]

E_{p q}

which consists of the weighted geometric mixtures of p and q:

E_{p q} : = \{{(p q)}_{α}^{G} (x) : = \frac{p {(x)}^{α} q {(x)}^{1 - α}}{Z_{p q} (α)} : α \in Θ\},

(6)

where

Z_{p q} (α) = \int_{X} p {(x)}^{α} q {(x)}^{1 - α} d μ (x) = ρ_{α} [p : q]

(7)

denotes the normalizer (or partition function) of the geometric mixture

{(p q)}_{α}^{G} (x) \propto p {(x)}^{α} q {(x)}^{1 - α}

so that

\int_{X} {(p q)}_{α}^{G} d μ = 1

. Parameter space

Θ

is defined as the set of

α

values which yieds convergence of the definite integral

Z_{p q} (α)

:

Θ : = {α \in R : Z_{p q} (α) < \infty} .

(8)

Let us express the density

{(p q)}_{α}^{G}

in the canonical form (∗) of exponential families [30]:

\begin{matrix} {(p q)}_{α}^{G} (x) & = & \exp (α \log \frac{p (x)}{q (x)} - \log Z_{p q} (α)) q (x), \end{matrix}

(9)

\begin{matrix} \overset{*}{= :} & \exp (α t (x) - F_{p q} (α) + k (x)) . \end{matrix}

(10)

It follows from this decomposition that

α \in Θ \subset R

is the scalar natural parameter,

t (x) = \log \frac{p (x)}{q (x)}

denotes the sufficient statistic (minimal when

p (x) \neq q (x)

μ

-a.e.),

k (x) = \log q (x)

is an auxiliary carrier term wrt. measure

μ

(i.e., measure

d ν (x) = q (x) d μ (x)

), and

F_{p q} (α) = \log Z_{p q} (α) = - D_{B, α} [p : q] < 0

(11)

is the log-normalizer (or log-partition or cumulant function). Since the sufficient statistic is the logarithm of the likelihood ratio of

p (x)

and

q (x)

, Grünwald [9] (Section 19.6) termed

E_{p q}

a Likelihood Ratio Exponential Family (LREF). See also [31] for applications of LREFs to Markov chain Monte Carlo (McMC) methods.

We have

p = {(p q)}_{1}^{G}

and

q = {(p q)}_{0}^{G}

. Thus, let

α_{p} = 1

and

α_{q} = 0

, and let us interpret geometrically

{{(p q)}_{α}^{G}, α \in Θ}

as a maximal exponential arc [29,32,33] where

Θ \subseteq R

is an interval. We denote by

E_{\bar{p q}}

the open exponential arc with extremities p and q.

Since the log-normalizers

F (θ)

of exponential families are always strictly convex and real analytic [30] (i.e.,

F (θ) \in C^{ω} (R)

), we deduce that

D_{B, α} [p : q] = - F_{p q} (α)

is strictly concave and real analytic. Moreover, we have

D_{B, 0} [p : q] = D_{B, 1} [p : q] = 0

. Hence, the Chernoff optimal skewing parameter

α^{*}

is unique when

p \neq q

μ

-a.e., and we get the Chernoff information calculated as

D_{C} [p : q] = D_{B, α^{*}} (p, q) .

See Figure 1 for a plot of the strictly concave function

D_{B, α} [p : q]

and the strictly convex function

F_{p q} (α) = - D_{B, α} [p : q]

when

p = p_{0, 1}

is the standard normal density and

q = p_{1, 2}

is a normal density of mean 1 and variance 2.

Consider the full natural parameter space

Θ_{p q}

of

E_{p q}

:

Θ_{p q} = \{α \in R : ρ_{α} (p : q) < + \infty\} .

(12)

The natural parameter space

Θ_{p q}

is always convex [30] and since

ρ_{0} (p : q) = ρ_{1} (p : q) = 1

, we necessarily have

(0, 1) \in Θ_{p q}

but not necessarily

[0, 1] \in Θ_{p q}

as detailed in the following remark:

Remark 1.

In order to be an exponential family, the densities

{(p q)}_{α}^{G}

shall have the same coinciding support for all values of α belonging to the natural parameter space. The support of the geometric mixture density

{(p q)}_{α}^{G}

is

supp ({(p q)}_{α}^{G}) = \{\begin{matrix} supp (p) \cap supp (q), & α \in Θ_{p q} ∖ {0, 1} \\ supp (p), & α = 1 \\ supp (q), & α = 0 . \end{matrix}

This condition is trivially satisfied when the supports of p and q coincide, and therefore

[0, 1] \subset Θ_{p q}

in that case. Otherwise, we may consider the common support

X_{p q} = supp (p) \cap supp (q)

for

α \in (0, 1)

. In this latter case, we are poised to restrict the natural parameter space to

Θ_{p q} = (0, 1)

even if

ρ_{α} (p : q) < \infty

for some α outside that range.

To emphasize that

α^{*}

depends on p and q, we shall use the notation

α^{*} (p : q)

whenever necessary. We have

α^{*} (q : p) = 1 - α^{*} (p : q)

, and since

D_{B, α} (p : q) = D_{B, 1 - α} (q : p)

, and we check that

D_{C} [p, q] = D_{B, α^{*} (p : q)} (p : q) = D_{B, α^{*} (q : p)} (q : p) = D_{C} [q, p] .

Thus the skewing value

α^{*} (q : p)

may be called the conjugate Chernoff exponent (i.e., depends on the convention chosen for interpolating on the exponential arc).

However, since the Chernoff information does not satisfy the triangle inequality, it is not a metric distance and the Chernoff information is called a quasi-distance.

Proposition 1

(Uniqueness of the Chernoff information optimal skewing parameter [11,12]). Let P and Q be two probability measures dominated by a positive measure μ with corresponding Radon–Nikodym densities p and q, respectively. The Chernoff information optimal skewing parameter

α^{*} (p : q)

is unique when

p \neq q

μ-almost everywhere, and

D_{C} [p, q] = D_{B, α^{*} (p : q)} (p : q) = D_{B, α^{*} (q : p)} (q : p) = D_{C} [q, p] .

When

p = q

μ-a.e., we have

D_{C} [p : q] = 0

and

α^{*}

is undefined since it can range in

[0, 1]

.

Definition 1.

An exponential family is called regular [30] when the natural parameter space Θ is open, i.e.,

Θ = Θ^{\circ}

where

Θ^{\circ}

denotes the interior of Θ (i.e., an open interval).

Proposition 2

(Finite sided Kullback–Leibler divergences). When the LREF

E_{p q}

is a regular exponential family with natural parameter space

Θ ⊋ [0, 1]

, both the forward Kullback–Leibler divergence

D_{KL} [p : q]

and the reverse Kullback–Leibler divergence

D_{KL} [q : p]

are finite.

Proof.

A reverse parameter divergence

D^{*} (θ_{1} : θ_{2})

is a parameter divergence on the swapped parameter order:

D^{*} (θ_{1} : θ_{2}) : = D (θ_{2} : θ_{1})

. Similarly, a reverse statistical divergence

D^{*} [p : q]

is a statistical divergence on the swapped parameter order:

D^{*} [p : q] : = D [q : p]

. We shall use the result pioneered in [34,35] that the KLD between two densities

p_{θ_{1}}

and

p_{θ_{2}}

of a regular exponential family

E = {p_{θ} : θ \in Θ}

amounts to a reverse Bregman divergence

{(B_{F})}^{*}

(i.e., a Bregman divergence on swapped parameter order) induced by the log-normalizer of the family:

D_{KL} [p_{θ_{1}} : p_{θ_{2}}] = {(B_{F})}^{*} (θ_{1} : θ_{2}) = B_{F} (θ_{2} : θ_{1}),

(13)

where

B_{F}

is the Bregman divergence defined on domain

D = dom (F)

(see Definition 1 of [36]):

\begin{matrix} B_{F} : & D \times ri (D) \to [0, \infty) \\ (θ_{1}, θ_{2}) \mapsto B_{F} (θ_{1} : θ_{2}) = F (θ_{1}) - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} \nabla F (θ_{2}) < + \infty, \end{matrix}

where

ri (D)

denotes the relative interior of domain D. Bregman divergences are always finite and the only symmetric Bregman divergences are squared Mahalanobis distances [37] (i.e., with corresponding Bregman generators defining quadratic forms).

For completeness, we recall the proof as follows: We have

\log \frac{p_{θ_{1}} (x)}{p_{θ_{2}} (x)} = {(θ_{1} - θ_{2})}^{⊤} t (x) - F (θ_{1}) + F (θ_{2}) .

Thus we get

\begin{matrix} D_{KL} [p_{θ_{1}} : p_{θ_{2}}] & = & E_{p_{θ_{1}}} [\log \frac{p_{θ_{1}}}{p_{θ_{2}}}], \\ = & F (θ_{2}) - F (θ_{1}) - {(θ_{1} - θ_{2})}^{⊤} E_{p_{θ_{1}}} [t (x)], \end{matrix}

using the linearity property of the expectation operator. When

E

is regular, we also have

E_{p_{θ_{1}}} [t (x)] = \nabla F (θ_{1})

(see [38]), and therefore we get

D_{KL} [p_{θ_{1}} : p_{θ_{2}}] = F (θ_{2}) - F (θ_{1}) - {(θ_{1} - θ_{2})}^{⊤} \nabla F (θ_{1}) = : B_{F} (θ_{2} : θ_{1}) = {(B_{F})}^{*} (θ_{1} : θ_{2}) .

In our LREF setting, we thus have:

D_{KL} [p : q] = {(B_{F})}^{*} (α_{p} : α_{q}) = B_{F_{p q}} (α_{q} : α_{p}) = B_{F_{p q}} (0 : 1),

and

D_{KL} [q : p] = B_{F_{p q}} (α_{p} : α_{q}) = B_{F_{p q}} (1 : 0)

where

B_{F_{p q}} (α_{1} : α_{2})

denotes the following scalar Bregman divergence:

B_{F_{p q}} (α_{1} : α_{2}) = F_{p q} (α_{1}) - F_{p q} (α_{2}) - (α_{1} - α_{2}) F_{p q}^{'} (α_{2}) .

Since

F_{p q} (0) = F_{p q} (1) = 0

and

B_{F_{p q}} : Θ \times ri (Θ) \to [0, \infty)

, we have

D_{KL} [p : q] = B_{F_{p q}} (α_{q} : α_{p}) = B_{F_{p q}} (0 : 1) = F_{p q}^{'} (1) < \infty .

Similarly

D_{KL} [q : p] = B_{F_{p q}} (α_{p} : α_{q}) = B_{F_{p q}} (1 : 0) = - F_{p q}^{'} (0) < \infty .

Notice that since

B_{F_{p q}} (α_{1} : α_{2}) > 0

, we have

F_{p q}^{'} (1) > 0

and

F_{p q}^{'} (0) < 0

when

p \neq q

μ

-almost everywhere (a.e.). Moreover, since

F_{p q} (α)

is strictly convex,

F_{p q}^{'} (α)

is strictly monotonically increasing, and therefore there exists a unique

α^{*} \in (0, 1)

such that

F_{p q}^{'} (α^{*}) = 0

. □

Example 1.

When p and q belongs to a same regular exponential family

E

(e.g., p and q are two normal densities), their sided KLDs [37] are both finite. The LREF induced by two Cauchy distributions

p_{l_{1}, s_{1}}

and

p_{l_{2}, s_{2}}

is such that

[0, 1] \subset Θ

since the skewed Bhattacharyya distance is defined and finite for

α \in R

[39]. Therefore the KLDs between two Cauchy distributions are always finite [39], see the closed-form formula in [40].

Remark 2.

If

0 \in Θ^{\circ}

, then

B_{F_{p q}} (1 : 0) < \infty

and therefore

D_{KL} [q : p] < \infty

. Since the KLD between a standard Cauchy distribution p and a standard normal distribution q is

+ \infty

, we deduce that

D_{KL} [p : q] \neq B_{F_{p q}} (0 : 1)

, and therefore

1 \notin Θ^{\circ}

. Similarly, when

1 \in Θ^{\circ}

, we have

B_{F_{p q}} (0 : 1) < \infty

and therefore

D_{KL} [p : q] < \infty

.

Proposition 3

(Chernoff information expressed as KLDs). (see also Theorem 32 of [11]) We have at the Chernoff information optimal skewing value

α^{*} \in (0, 1)

the following identities:

D_{C} [p : q] = D_{KL} [{(p q)}_{α^{*}}^{G} : p] = D_{KL} [{(p q)}_{α^{*}}^{G} : q] .

Proof.

Since the skewed Bhattacharyya distance between two densities

p_{θ_{1}}

and

p_{θ_{2}}

of an exponential family with log-normalizer F amounts to a skew Jensen divergence for the log-normalizer [8,41], we have:

D_{B, α} (p_{θ_{1}} : p_{θ_{2}}) = J_{F, α} (θ_{1} : θ_{2}),

where the skew Jensen divergence [42] is given by

J_{F, α} (θ_{1} : θ_{2}) = α F (θ_{1}) + (1 - α) F (θ_{2}) - F (α θ_{1} + (1 - α) θ_{2}) .

In the setting of the LREF, we have

\begin{matrix} D_{B, α} ({(p q)}_{α_{1}}^{G} : {(p q)}_{α_{2}}^{G}) & = & J_{F_{p q}, α} (α_{1} : α_{2}), \\ = & α F_{p q} (α_{1}) + (1 - α) F_{p q} (α_{2}) - F_{p q} (α α_{1} + (1 - α) α_{2}) . \end{matrix}

At the optimal value

α^{*}

, we have

F_{p q}^{'} (α^{*}) = 0

. Since

D_{KL} [{(p q)}_{α^{*}}^{G} : p] = B_{F_{p q}} (1 : α^{*}) = - F (α^{*})

and

D_{KL} [{(p q)}_{α^{*}}^{G} : q] = B_{F_{p q}} (0 : α^{*}) = - F (α^{*})

and

D_{C} [p : q] = - \log ρ_{α^{*}} (p : q) = J_{F_{p q}, α^{*}} (1 : 0) = - F_{p q} (α^{*})

, we get

D_{C} [p : q] = D_{KL} [{(p q)}_{α^{*}}^{G} : p] = D_{KL} [{(p q)}_{α^{*}}^{G} : q] .

Figure 2 illustrates the proposition on the plot of the scalar function

F_{p q} (α)

. □

Corollary 1.

The Chernoff information optimal skewing value

α^{*} (p : q) \in (0, 1)

can be used to calculate the Chernoff information

D_{C} [p, q]

as a Bregman divergence induced by the LREF:

D_{C} [p : q] = B_{F_{p q}} [1 : α^{*}] = B_{F_{p q}} [0 : α^{*}] = J_{F_{p q}, α^{*}} (1 : 0) .

In general, the divergence

J_{F}^{C} (θ_{1}, θ_{2}) = {max}_{α \in (0, 1)} J_{F, α} (θ_{1} : θ_{2})

is called a Jensen–Chernoff divergence.

Proposition 3 let us interpret the Chernoff information as a special symmetrization of the Kullback–Leibler divergence [43], different from the Jeffreys divergence or the Jensen–Shannon divergence [44]. Indeed, the Chernoff information can be rewritten as

D_{C} [p : q] = min_{r \in E_{\bar{p q}}} {D_{KL} [r : p], D_{KL} [r : q]} .

(14)

As such, we can interpret the Chernoff information as the radius of a minimum enclosing left-sided Kullback–Leibler ball on the space

L^{1} (μ)

. A related concept is the radius [12] of two densities p and q with respect to Rényi divergences of order

α

(see Equation (2) of [12]):

r_{α} (p, q) : = inf_{c} max {D_{R, α} [p : c], D_{R, α} [q : c]} .

When

α = 1

, the radius is called the Shannon radius [12] since the Rényi divergences of order 1 corresponds to the Kullback–Leibler divergence (relative entropy).

2.2. Geometric Characterization of the Chernoff Information and the Chernoff Information Distribution

Let us term the probability distribution

{(P Q)}_{α^{*}}^{G} ≪ μ

with corresponding density

{(p q)}_{α^{*}}^{G}

the Chernoff information distribution to avoid confusion with another concept of Chernoff distributions [45] used in statistics. We can characterize geometrically the Chernoff information distribution

{(p q)}_{α^{*}}^{G}

on

L^{1} (μ)

as the intersection of a left-sided Kullback–Leibler divergence bisector:

{Bi}_{KL}^{left} (p, q) : = \{r \in L^{1} (μ) : D_{KL} [r : p] = D_{KL} [r : q]\},

(15)

with an exponential arc [29]

γ^{G} (p, q) : = \{{(p q)}_{α}^{G} : α \in [0, 1]\} .

(16)

We thus interpret Proposition 3 geometrically by the following proposition (see Figure 3):

Proposition 4

(Geometric characterization of the Chernoff information). On the vector space

L^{1} (μ)

, the Chernoff information distribution is the unique distribution

{(p q)}_{α^{*}}^{G} = γ^{G} (p, q) \cap {Bi}_{KL}^{left} (p, q) .

The point

{(p q)}_{α^{*}}^{G}

has been called the Chernoff point in [24].

Proposition 4 allows us to design a dichotomic search to numerically approximate

α^{*}

as reported in pseudo-code in Algorithm 1 (see also the illustration in Figure 4).

Algorithm 1 Dichotomic search for approximating the Chernoff information by approximating the optimal skewing parameter value

\tilde{α} \approx α^{*}

and reporting

D_{C} [p : q] \approx D_{KL} [{(p q)}_{\tilde{α}}^{G} : p]

. The search requires

⌈ \log_{2} \frac{1}{ϵ} ⌉

iterations to guarantee

| α^{*} - \tilde{α} | \leq ϵ

.

Remark 3.

We do not need to necessarily handle normalized densities p and q since we have for

α \in R ∖ {0, 1}

:

{(p q)}_{α}^{G} = {(\tilde{p} \tilde{q})}_{α}^{G},

where

p (x) = \frac{\tilde{p} (x)}{Z_{p}}

and

q (x) = \frac{\tilde{q} (x)}{Z_{q}}

with

\tilde{p}

and

\tilde{q}

denoting the computationally-friendly unnormalized positive densities. This property of geometric mixtures is used in Annealed Importance Sampling [46,47] (AIS), and for designing an asymptotically efficient estimator for computationally-intractable parametric densities [48]

{\tilde{q}}_{θ}

(e.g., distributions learned by Boltzmann machines).

2.3. Dual Parameterization of LREFs

The densities

{(p q)}_{α}^{G}

of a LREF can also be parameterized by their dual moment parameter [30] (or mean parameter):

β = β (α) : = E_{{(p q)}_{α}^{G}} [t (x)] = E_{{(p q)}_{α}^{G}} [\log \frac{p (x)}{q (x)}] .

(17)

When the LREF is regular (and therefore steep [38]), we have

β (α) = F_{p q}^{'} (α)

and

α = {F_{p q}^{*}}^{'} (β)

, where

F_{p q}^{*}

denotes the Legendre transform of

F_{p q}

. At the optimal value

α^{*}

, we have

F_{p q}^{'} (α^{*}) = 0

. Therefore an equivalent condition of optimality is

β (α^{*}) = F_{p q}^{'} (α^{*}) = 0 = E_{{(p q)}_{α^{*}}^{G}} [\log \frac{p (x)}{q (x)}] .

Notice that when

[0, 1] \subset Θ^{\circ}

, we have finite forward and reverse Kullback–Leibler divergences:

$α = 1$ , we have ${(p q)}_{1}^{G} = p$ and

$β (1) = E_{p} [\log \frac{p (x)}{q (x)}] = D_{KL} [p : q] = F_{p q}^{'} (1) > 0 .$
$α = 0$ , we have ${(p q)}_{0}^{G} = q$ and

$β (0) = E_{q} [\log \frac{p (x)}{q (x)}] = - D_{KL} [q : p] = F_{p q}^{'} (0) < 0 .$

Since

F_{p q} (α)

is strictly convex, we have

F_{p q}^{″} (α) > 0

and

F_{p q}^{'}

is strictly increasing with

F_{p q}^{'} (0) = - D_{KL} [q : p] < 0

and

F_{p q}^{'} (1) = D_{KL} [p : q] > 0

. The value

α^{*}

is thus the unique value such that

F_{p q}^{'} (α^{*}) = 0

.

Proposition 5

(Dual optimality condition for the Chernoff information). The unique Chernoff information optimal skewing parameter

α^{*}

is such that

{OC}_{α} : D_{KL} [{(p q)}_{α^{*}}^{G} : p] = D_{KL} [{(p q)}_{α^{*}}^{G} : q] \Leftrightarrow {OC}_{β} : β (α^{*}) = E_{{(p q)}_{α^{*}}^{G}} [\log \frac{p (x)}{q (x)}] = 0 .

One can understand that the Chernoff information is more robust or stable than a skewed Bhattacharrya distance by considering the derivative of the corresponding skewed Bhattacharrya distance. Consider without loss of generality densities

p_{θ_{1}}

and

p_{θ_{2}}

of a 1D exponential family. Their skewed Bhattacharrya distances amounts to skew Jensen divergences, and we have:

J_{F, α}^{'} (θ_{1} : θ_{2}) : = \frac{d}{d α} J_{F, α} (θ_{1} : θ_{2}) = F (θ_{1}) - F (θ_{2}) - (θ_{1} - θ_{2}) F^{'} (α θ_{1} + (1 - α) θ_{2}) .

Since

J_{F, α^{*}}

is by definition maximal, we have

F^{'} (α^{*} θ_{1} + (1 - α^{*}) θ_{2}) = 0

, and therefore

| J_{F, α}^{'} (θ_{1} : θ_{2}) - J_{F, α^{*}}^{'} (θ_{1} : θ_{2}) | = | (θ_{1} - θ_{2}) F^{'} (α θ_{1} + (1 - α) θ_{2}) | > 0

. Further assuming without loss of generality that

θ_{2} - θ_{1} = 1

, we get

| J_{F, α}^{'} (θ_{1} : θ_{2}) - J_{F, α^{*}}^{'} (θ_{1} : θ_{2}) | = | F^{'} (α θ_{1} + (1 - α) θ_{2}) | > 0 = F^{'} (α^{*} θ_{1} + (1 - α^{*}) θ_{2})

.

As a side remark, let us notice that the Fisher information of a likelihood ratio exponential family

E_{p q}

is

I_{p q} (α) = - E_{{(p q)}_{α}^{G}} [{(\log {(p q)}_{α}^{G})}^{″}] = F_{p q}^{″} (α) > 0,

and

= F_{p q}^{″} (α) {F^{*}}_{p q}^{″} (β) = 1

.

3. Chernoff Information between Densities of an Exponential Family

3.1. General Case

We shall now consider that the densities p and q (with respect to measure

μ

) belong to a same exponential family [30]:

E = \{P_{λ} : \frac{d P_{λ}}{d μ} = p_{λ} (x) = \exp (θ {(λ)}^{⊤} t (x) - F (θ (λ))), λ \in Λ\},

where

θ (λ)

denotes the natural parameter associated with the ordinary parameter

λ

,

t (x)

the sufficient statistic vector and

F (θ (λ))

the log-normalizer. When

θ (λ) = λ

and

t (x) = x

, the exponential family is called a natural exponential family (NEF). The exponential family

E

is defined by

μ

and

t (x)

, hence we may write when necessary

E = E_{μ, t}

.

Example 2.

The set of univariate Gaussian distributions

N = {p_{μ, σ^{2}} (x) : λ = (μ, σ^{2}) \in Λ = R \times R_{+ +}}

forms an exponential family with the following decomposition terms:

\begin{matrix} λ & = & (μ, σ^{2}) \in Λ = R \times R_{+ +}, \\ θ (λ) & = & (θ_{1} = \frac{μ}{σ^{2}}, θ_{2} = - \frac{1}{2 σ^{2}}) \in Θ = R \times R_{- -}, \\ t (x) & = & (x, x^{2}), \\ F (θ) & = & - \frac{θ_{1}^{2}}{4 θ_{2}} + \frac{1}{2} \log (- \frac{π}{θ_{2}}), \end{matrix}

where

R_{+ +} = {x \in R : x > 0}

and

R_{- -} = {x \in R : x < 0}

denotes the set of positive real numbers and negative real numbers, respectively. Letting

v = σ^{2}

be the variance parameter, we get the equivalent natural parameters

(\frac{μ}{v}, - \frac{1}{2 v})

. The log-normalizer can be written using the

(μ, v)

-parameterization as

F (μ, v) = \frac{1}{2} \log (2 π v) + \frac{μ^{2}}{2 v}

and

θ = (θ_{1} = \frac{μ}{v}, - \frac{1}{2 v})

. See Appendix B for further details concerning this normal exponential family.

Notice that we can check easily that the LREF between two densities of an exponential family forms a 1D sub-exponential family of the exponential family:

\begin{matrix} p_{θ_{1}} {(x)}^{α} p_{θ_{2}} {(x)}^{1 - α} & \propto & \exp (〈 α θ_{1} + (1 - α) θ_{2}, t (x) 〉 - α F (θ_{1}) - (1 - α) F (θ_{2})), \\ = & p_{α θ_{1} + (1 - α) θ_{2}} (x) \exp (F (α θ_{1} + (1 - α) θ_{2}) - α F (θ_{1}) - (1 - α) F (θ_{2}))), \\ = & p_{α θ_{1} + (1 - α) θ_{2}} (x) \exp (- J_{F, α} (θ_{1} : θ_{2})), \end{matrix}

where

J_{F}

denote the Jensen divergence induced by F.

The optimal skewing value condition of the Chernoff information between two categorical distributions [13] was extended to densities

p_{θ_{1}}

and

p_{θ_{2}}

of an exponential family in [24]. The family of categorical distributions with d choices forms an exponential family with natural parameter of dimension

d - 1

. Thus, Proposition 7 generalizes the analysis in [13].

Let

p = p_{θ_{1}}

and

q = p_{θ_{2}}

. Then we have the property that exponential families are closed under geometric mixtures:

{(p_{θ_{1}} p_{θ_{2}})}_{α}^{G} = p_{α θ_{1} + (1 - α) θ_{2}} .

Since the natural parameter space

Θ

is convex, we have

α θ_{1} + (1 - α) θ_{2} \in Θ

.

The KLD between two densities

p_{θ_{1}}

and

p_{θ_{2}}

of a regular exponential family

E

amounts to a reverse Bregman divergence for the log-normalizer of

E

:

D_{KL} [p_{θ_{1}} : p_{θ_{2}}] = B_{F} (θ_{2} : θ_{1}),

where

B_{F} (θ_{2} : θ_{1})

denotes the Bregman divergence:

B_{F} (θ_{2} : θ_{1}) = F (θ_{2}) - F (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1}) .

Thus, when the exponential family

E

is regular, both the forward and reverse KLD are finite, and we can rewrite Proposition 3 to characterize

α^{*}

as follows:

{OC}_{EF} : B_{F} (θ_{1} : θ_{α^{*}}) = B_{F} (θ_{2} : θ_{α^{*}}),

(18)

where

θ_{α^{*}} = α^{*} θ_{1} + (1 - α^{*}) θ_{2}

.

The Legendre–Fenchel transform of

F (θ)

yields the convex conjugate

F^{*} (η) = sup_{θ \in Θ} {θ^{⊤} η - F (θ)}

(19)

with

η (θ) = \nabla F (θ)

. Let

H = {η (θ) : θ \in Θ}

denote the dual moment parameter space also called domain of means. The Legendre transform associates to

(Θ, F (θ))

the convex conjugate

(H, F^{*} (η))

. In order for

(H, F^{*} (η))

to be of the same well-behaved type of

(Θ, F (θ))

, we shall consider convex functions

F (θ)

which are steep, meaning that their gradient diverges when nearing the boundary

bd (Θ)

[49] and thus ensures that domain H is also convex. Steep convex functions are said of Legendre-type, and

{({(Θ, F (θ))}^{*})}^{*} = (Θ, F (θ))

(Moreau biconjugation theorem which shows that the Legendre transform is involutive). For Legendre-type functions, there is a one-to-one mapping between parameters

θ (η)

and parameters

η (θ)

as follows:

θ (η) = \nabla F^{*} (η) = {(\nabla F)}^{- 1} (η),

(20)

and

η (θ) = \nabla F (θ) = {(\nabla F^{*})}^{- 1} (θ) .

(21)

Exponential families with log-normalizers of Legendre-type are called steep exponential families [30]. All regular exponential families are steep, and the maximum likelihood estimator in steep exponential families exists and is unique [38] (with the likelihood equations corresponding to the method of moments for the sufficient statistics). The set of inverse Gaussian distributions form a non-regular but steep exponential family, and the set of singly truncated normal distributions form a non-regular and non-steep exponential family [50] (but the exponential family of doubly truncated normal distributions is regular and hence steep).

For Legende-type convex generators

F (θ)

, we can express the Bregman divergence

B_{F} (θ_{1} : θ_{2})

using the dual Bregman divergence:

B_{F} (θ_{1} : θ_{2}) = B_{F^{*}} (η_{2} : η_{1})

since there is a one-to-one correspondence between

η = \nabla F (θ)

and

θ = \nabla F^{*} (η)

.

For Legendre-type generators

F (θ)

, the Bregman divergence

B_{F} (θ_{1} : θ_{2})

can be rewritten as the following Fenchel–Young divergence:

B_{F} (θ_{1} : θ_{2}) = F (θ_{1}) + F^{*} (η_{2}) - θ_{1}^{⊤} η_{2} : = Y_{F, F^{*}} (θ_{1} : η_{2}) .

Proposition 6.

(KLD between densities of a regular (and steep) exponential family). The KLD between two densities

p_{θ_{1}}

and

p_{θ_{2}}

of a regular and steep exponential family can be obtained equivalently as

D_{KL} [p_{θ_{1}} : p_{θ_{2}}] = B_{F} (θ_{2} : θ_{1}) = Y_{F, F^{*}} (θ_{2} : η_{1}) = Y_{F^{*}, F} (η_{1} : θ_{2}) = B_{F^{*}} (η_{1} : η_{2}),

where

F (θ)

and its convex conjugate

F^{*} (η)

are Legendre-type functions.

Figure 5 illustrates the taxonomy of regularity and steepness of exponential families by a Venn diagram.

It follows that the optimal condition of Equation (18) can be restated as

{OC}_{YF} : Y_{F, F^{*}} (θ_{1} : η_{α^{*}}) = Y_{F, F^{*}} (θ_{2} : η_{α^{*}}),

(22)

where

η_{α^{*}} = \nabla F^{- 1} (α^{*} θ_{1} + (1 - α^{*}) θ_{2})

. From the equality of Equation (22), we get the following simplified optimality condition:

{OC}_{SEF} : {(θ_{2} - θ_{1})}^{⊤} η_{α^{*}} = F (θ_{2}) - F (θ_{1}),

(23)

where

η_{α^{*}} = \nabla F (α^{*} θ_{1} + (1 - α^{*}) θ_{2})

.

Remark 4.

We can recover (

{OC}_{SEF}

) by instantiating the equivalent condition

E_{p_{{\bar{θ}}_{α^{*}}}} [\log \frac{p_{θ_{1}}}{p_{θ_{2}}}] = 0

. Indeed, since

\log \frac{p_{θ_{1}}}{p_{θ_{2}}} = {(θ_{1} - θ_{2})}^{⊤} t (x) - F (θ_{1}) + F (θ_{2})

, we get

\begin{matrix} E_{p_{{\bar{θ}}_{α^{*}}}} [{(θ_{1} - θ_{2})}^{⊤} t (x) - F (θ_{1}) + F (θ_{2})] & = & 0, \\ {(θ_{1} - θ_{2})}^{⊤} {\bar{η}}_{α^{*}} = F (θ_{1}) - F (θ_{2}) . \end{matrix}

Since the

α

-skewed Bhattacharyya distance amounts to a

α

-skewed Jensen divergence [8], we get the Chernoff information as

\begin{matrix} D_{C} [p_{λ_{1}} : p_{λ_{2}}] & = & J_{F, α^{*}} (θ (λ_{1}) : θ (λ_{2})), \\ = & B_{F} (θ_{1} : θ_{α^{*}}) = B_{F} (θ_{2} : θ_{α^{*}}), \end{matrix}

where

J_{F, α} (θ_{1} : θ_{2})

is the Jensen divergence:

J_{F, α} (θ_{1} : θ_{2}) = α F (θ_{1}) + (1 - α) F (θ_{2}) - F (α θ_{1} + (1 - α) θ_{2}) .

Notice that we have the induced LREF with log-normalizer expressed as the negative Jensen divergence induced the log-normalizer of

E

:

F_{p_{θ_{1}} p_{θ_{2}}} (α) = - \log ρ_{α} [p_{θ_{1}} : p_{θ_{2}}] = - J_{F, α} (θ_{1} : θ_{2}) .

We summarize the result in the following proposition:

Proposition 7.

Let

p_{λ_{1}}

and

p_{λ_{2}}

be two densities of a regular exponential family

E

with natural parameter

θ (λ)

and log-normalizer

F (θ)

. Then the Chernoff information is

D_{C} [p_{λ_{1}} : p_{λ_{2}}] = J_{F, α^{*}} (θ (λ_{1}) : θ (λ_{2})) = B_{F} (θ_{1} : θ_{α^{*}}) = B_{F} (θ_{2} : θ_{α^{*}}),

where

θ_{1} = θ (λ_{1})

,

θ_{2} = θ (λ_{2})

, and the optimal skewing parameter

α^{*}

is unique and satisfies the following optimality condition:

{OC}_{EF} : {(θ_{2} - θ_{1})}^{⊤} η_{α^{*}} = F (θ_{2}) - F (θ_{1}),

(24)

where

η_{α^{*}} = \nabla F (α^{*} θ_{1} + (1 - α^{*}) θ_{2}) = E_{p_{α^{*} θ_{1} + (1 - α^{*}) θ_{2}}} [t (x)]

.

Figure 6 illustrates geometrically the Chernoff point [24] which is the geometric mixture

{(p_{θ_{1}} p_{θ_{2}})}_{α^{*}}

induced by two comparable probability measures

P_{θ_{1}}, P_{θ_{2}} ≪ μ

.

In information geometry [51], the manifold of densities

M = {p_{θ} : θ \in Θ}

of this exponential family is a dually flat space [51]

M = ({p_{θ}}, g_{F} (θ) = \nabla^{2} F (θ), \nabla^{m}, \nabla^{e})

with respect to the exponential connection

\nabla^{e}

and the mixture connection

\nabla^{m}

, where

g_{F} (θ)

is the Fisher information metric expressed in the

θ

-coordinate system as

\nabla^{2} F (θ)

(and in the dual moment parameter

η

as

g_{F} (η) = \nabla^{2} F^{*} (η)

). Then the exponential geodesic

\nabla^{e}

is flat and corresponds to the exponential arc of geometric mixtures when parameterized with the

\nabla^{e}

-affine coordinate system

θ

.

The left-sided Kullback–Voronoi bisector:

{Bi}_{KL}^{left} (p_{θ_{1}}, p_{θ_{2}}) = {p_{θ} : D_{KL} [p_{θ} : p_{θ_{1}}] = D_{KL} [p_{θ} : p_{θ_{1}}]}

corresponds to a Bregman right-sided bisector [52] and is

\nabla^{m}

flat (i.e., an affine subspace in the

η

-coordinate system):

{Bi}_{F}^{right} (θ_{1}, θ_{2}) = {θ \in Θ : B_{F} (θ_{1}, θ) = B_{F} (θ_{2}, θ)} .

The Chernoff information distribution

{(p_{θ_{1}} p_{θ_{2}})}_{α^{*}}^{G}

is called the Chernoff point on this exponential family manifold (see Figure 6). Since the Chernoff point is unique and since in general statistical manifolds

(M, g, \nabla, \nabla^{*})

can be realized by statistical models [53], we deduce the following proposition of interest for information geometry [51]:

Proposition 8.

Let

(M, g, \nabla, \nabla^{*})

be a dually flat space with corresponding canonical divergence a Bregman divergence

B_{F}

. Let

γ_{p q}^{e} (α)

and

γ_{p q}^{m} (α)

be a e-geodesic and m-geodesic passing through the points p and q of

M

, respectively. Let

{Bi}^{m} (p, q)

and

{Bi}^{e} (p, q)

be the right-sided

\nabla^{m}

-flat and left-sided

\nabla^{e}

-flat Bregman bisectors, respectively. Then the intersection of

γ_{p q}^{e} (α)

with

{Bi}^{m} (p, q)

and the intersection of

γ_{p q}^{m} (α)

with

{Bi}^{e} (p, q)

are unique. The point

γ_{p q}^{e} (α) \cap {Bi}^{m} (p, q)

is called the Chernoff point and the point

γ_{p q}^{m} (α) \cap {Bi}^{e} (p, q)

is termed the reverse or dual Chernoff point.

3.2. Case of One-Dimensional Parameters

When the exponential family has one-dimensional natural parameter

α \in Θ \subset R

, we thus get from

{OC}_{SEF}

:

η_{α^{*}} = \frac{F (α_{2}) - F (α_{1})}{α_{2} - α_{1}} .

That is,

α^{*}

can be obtained as the following closed-form formula:

α^{*} = \frac{{F^{'}}^{- 1} (\frac{F (α_{2}) - F (α_{1})}{α_{2} - α_{1}}) - α_{2}}{α_{1} - α_{2}} .

(25)

Example 3.

Consider the exponential family

{p_{v} (x) : v > 0}

of 0-centered scale univariate normal distributions with variance

v = σ^{2}

and density

p_{v} (x) = \frac{1}{\sqrt{2 π v}} \exp (- \frac{1}{2} \frac{x^{2}}{v}) .

The natural parameter corresponding to the sufficient statistic

t (x) = x^{2}

is

θ = - \frac{1}{2 v}

. The log-normalizer is

F (θ) = \frac{1}{2} \log \frac{π}{- θ}

. We have

η = F^{'} (θ) = - \frac{1}{2 θ}

and

{F^{'}}^{- 1} (η) = - \frac{1}{2 η}

. It follows that

α^{*} (p_{v_{1}} : p_{v_{2}}) = \frac{v_{1} \log \frac{v_{1}}{v_{2}} - v_{1} + v_{2}}{(v_{2} - v_{1}) \log \frac{v_{2}}{v_{1}}} .

Let

s = \frac{v_{2}}{v_{1}}

. Then we can rewrite

α^{*}

as

α^{*} (p_{v_{1}} : p_{v_{2}}) = \frac{s - 1 - \log s}{(s - 1) \log s} .

The Chernoff information is

D_{C} [p_{v_{1}}, p_{v_{2}}] = - \log ρ_{α^{*}} [p_{v_{1}}, p_{v_{2}}]

, with

ρ_{α} [p_{v_{1}}, p_{v_{2}}] = \frac{σ_{1}^{1 - α} σ_{2}^{α}}{\sqrt{(1 - α) σ_{1}^{2} + α σ_{2}^{2}}} .

This result will be generalized in Proposition 11 to multivariate centered Gaussians with scaled covariance matrices.

For multi-dimensional parameters

θ

, we may consider the one-dimensional LREF

E_{p_{θ_{1}} p_{θ_{2}}}

induced by

p_{θ_{1}}

and

p_{θ_{2}}

with

F_{θ_{1}, θ_{2}} (α) = F ((1 - α) θ_{1} + α θ_{2})

, and write

F_{p q}^{'} (α)

as the following directional derivative:

\begin{matrix} \nabla_{θ_{2} - θ_{1}} F_{θ_{1}, θ_{2}} (α) & : = & lim_{ϵ \to 0} \frac{F (θ_{1} + (ϵ + α) (θ_{2} - θ_{1})) - F (θ_{1} + α (θ_{2} - θ_{1}))}{ϵ}, \end{matrix}

(26)

\begin{matrix} = & {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1} + α (θ_{2} - θ_{1})), \end{matrix}

(27)

using a first-order Taylor expansion. Thus, the optimality condition

{OC}_{{SEF}^{'}} : F_{θ_{1}, θ_{2}}^{'} (α) = 0

amounts to

{OC}_{SEF} : {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1} + α^{*} (θ_{2} - θ_{1})) = F (θ_{2}) - F (θ_{1}) .

(28)

This is equivalent to Equation (8) of [24].

Remark 5.

In general, we may consider multivariate Bregman divergences as univariate Bregman divergences: We have

B_{F} (θ_{1} : θ_{2}) = B_{F_{θ_{1}, θ_{2}}} (0 : 1), \forall θ_{1}, θ_{2} \in Θ

(29)

where

F_{θ_{1}, θ_{2}} (u) : = F (θ_{1} + u (θ_{2} - θ_{1})) .

(30)

The functions

F_{θ_{1}, θ_{2}}

are 1D Bregman generators (i.e., strictly convex and

C^{1}

), and we have the directional derivative

\begin{matrix} \nabla_{θ_{2} - θ_{1}} F_{θ_{1}, θ_{2}} (u) & = & lim_{ϵ \to 0} \frac{F (θ_{1} + (ϵ + u) (θ_{2} - θ_{1})) - F (θ_{1} + u (θ_{2} - θ_{1}))}{ϵ}, \\ = & {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1} + u (θ_{2} - θ_{1})), \end{matrix}

Since

F_{θ_{1}, θ_{2}} (0) = F (θ_{1})

,

F_{θ_{1}, θ_{2}} (1) = F (θ_{2})

, and

F_{θ_{1}, θ_{2}}^{'} (u) = \nabla_{θ_{2} - θ_{1}} F_{θ_{1}, θ_{2}} (u)

, it follows that

\begin{matrix} B_{F_{θ_{1}, θ_{2}}} (0 : 1) & = & F_{θ_{1}, θ_{2}} (0) - F_{θ_{1}, θ_{2}} (1) - (0 - 1) \nabla_{θ_{2} - θ_{1}} F_{θ_{1}, θ_{2}} (1), \\ = & F (θ_{1}) - F (θ_{2}) + {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{2}) = B_{F} (θ_{1} : θ_{2}) . \end{matrix}

Similarly, we can reparameterize Bregman divergences on a k-dimensional simplex by k-dimensional Bregman divergences.

Remark 6.

Closing the loop: The Chernoff information although obtained from the one-dimensional likelihood ratio exponential family yields as a corollary the general multi-parametric exponential families which as a special instance includes the one-dimensional exponential families (e.g., LREFs!).

4. Forward and Reverse Chernoff–Bregman Divergences

In this section, we shall define Chernoff-type symmetrizations of Bregman divergences inspired by the study of Chernoff information, and briefly mention applications of these Chernoff–Bregman divergences in information theory.

4.1. Chernoff–Bregman Divergence

Let us define a Chernoff-like symmetrization of Bregman divergences [43] different from the traditional Jeffreys–Bregman symmetrization:

\begin{matrix} B_{F}^{J} (θ_{1} : θ_{2}) & = & B_{F} (θ_{1} : θ_{2}) + B_{F} (θ_{2} : θ_{1}), \\ = & {(θ_{1} - θ_{2})}^{⊤} (\nabla F (θ_{1}) - \nabla F (θ_{2})), \end{matrix}

or Jensen–Shannon-type symmetrization [44,54] which yields a Jensen divergence [42]:

\begin{matrix} B_{F}^{JS} (θ_{1} : θ_{2}) & = & \frac{1}{2} (B_{F} (θ_{1} : \frac{θ_{1} + θ_{2}}{2}) + B_{F} (θ_{2} : \frac{θ_{1} + θ_{2}}{2})), \\ = & \frac{F (θ_{1}) + F (θ_{2})}{2} - F (\frac{θ_{1} + θ_{2}}{2}) = : J_{F} (θ_{1}, θ_{2}) . \end{matrix}

Definition 2

(Chernoff–Bregman divergence). Let the Chernoff symmetrization of Bregman divergence

B_{F} (θ_{1}; θ_{2})

be the forward Chernoff–Bregman divergence

C_{F} (θ_{1}, θ_{2})

defined by

C_{F} (θ_{1}, θ_{2}) = max_{α \in (0, 1)} J_{F, α} (θ_{1} : θ_{2}),

(31)

where

J_{F, α}

is the α-skewed Jensen divergence.

The optimization problem in Equation (31) may be equivalently rewritten [43] as

{min}_{θ} R

such that both

B_{F} (θ_{1} : θ) \leq R

and

B_{F} (θ_{2} : θ) \leq R

. Thus, the optimal value of

α

defines the circumcenter

θ^{*} = α θ_{1} + (1 - α) θ_{2}

of the minimum enclosing right-sided Bregman sphere [55,56] and the Chernoff–Bregman divergence:

C_{F} (θ_{1}, θ_{2}) = min_{θ} {B_{F} (θ_{1} : θ), B_{F} (θ_{2} : θ)},

(32)

corresponds to the radius of a minimum enclosing Bregman ball. To summarize, this Chernoff symmetrization is a min-max symmetrization, and we have the following identities:

\begin{matrix} C_{F} (θ_{1}, θ_{2}) & = & min_{θ} {B_{F} (θ_{1} : θ), B_{F} (θ_{2} : θ)}, \\ = & min_{θ \in Θ} {α B_{F} (θ_{1} : θ) + (1 - α) B_{F} (θ_{2} : θ)}, \\ = & max_{α \in (0, 1)} {α B_{F} (θ_{1} : α θ_{1} + (1 - α) θ_{2}) + (1 - α) B_{F} (θ_{2} : α θ_{1} + (1 - α) θ_{2})}, \\ = & max_{α \in (0, 1)} J_{F, α} (θ_{1} : θ_{2}) . \end{matrix}

The second identity shows that the Chernoff symmetrization can be interpreted as a variational Jensen–Shannon-type divergence [54].

Notice that in general

C_{F} (θ_{1}, θ_{2}) \neq C_{F^{*}} (η_{1}, η_{2})

because the primal and dual geodesics do not coincide. Those geodesics coincide only for symmetric Bregman divergences which are squared Mahalanobis divergences [52].

When

F (θ) = F_{Shannon} (θ) = \sum_{i = 1}^{D} θ_{i} \log θ_{i}

(discrete Shannon negentropy), the Chernoff–Bregman divergence is related to the capacity of a discrete memoryless channel in information theory [13,43].

Conditions for which

C_{F} {(θ_{1}, θ_{2})}^{a}

(with

a > 0

) becomes a metric have been studied in [43]: For example,

C_{F_{Shannon}}^{\frac{1}{e}}

is a metric distance [43] (i.e.,

a = \frac{1}{e} ≃ 0.36787944117

). It is also known that the square root of the Chernoff distance between two univariate normal distributions is a metric distance [57].

We can thus use the Bregman generalization of the Badoiu–Clarkson (BC) algorithm [55] (Algorithm 2) to compute an approximation of the smallest enclosing Bregman ball which in turn yields an approximation of the Chernoff–Bregman divergence:

Algorithm 2 Approximating the circumcenter of the Bregman smallest enclosing ball of two parameters

θ_{1}

and

θ_{2}

.

Notice that when there are only two points to compute their smallest enclosing Bregman ball, all the arcs

{(θ^{(i - 1)} θ_{f_{i}})}^{G}

are sub-arcs of the exponential arc

{(θ_{1} θ_{2})}^{G}

. See [55] for convergence results of this iterative algorithm. Let us notice that Algorithm 1 approximates

α^{*}

while the Bregman BC Algorithm 2 approximates in spirit

D_{C} (θ_{1}, θ_{2})

(and as a byproduct

α^{*}

).

Remark 7.

To compute the farthest point to the current circumcenter with respect to Bregman divergence, we need to find the sign of

B_{F} (θ_{2} : θ) - B_{F} (θ_{1} : θ) = F (θ_{2}) - F (θ_{1}) - (θ_{2} - θ_{1}) \nabla F (θ) .

Thus we need to pre-calculate only once

F (θ_{1})

and

F (θ_{2})

which can be costly (e.g.,

- \log \det (Σ)

functions need to be calculated only once when approximating the Chernoff information between Gaussians).

4.2. Reverse Chernoff–Bregman Divergence and Universal Coding

Similarly, we may define the reverse Chernoff–Bregman divergence by considering the minimum enclosing left-sided Bregman ball:

C_{F}^{R} (θ_{1}, θ_{2}) = min_{θ} {B_{F} (θ : θ_{1}), B_{F} (θ : θ_{2})} .

(33)

Thus the reverse Bregman Chernoff divergence

D_{C}^{R} [θ_{1}, θ_{2}] = R^{*}

is the radius of a minimum enclosing left-sided Bregman ball.

This reverse Chernoff–Bregman divergence finds application in universal coding in information theory (chapter 13 of [13], pp. 428–433): Let

X = {A_{1}, \dots, A_{d}}

be a finite discrete alphabet of d letters, and X be a random variable with probability mass function p on

X

. Let

p_{λ} (x)

denote the categorical distribution corresponding to X so that

Pr (X = A_{i}) = p_{λ} (A_{i})

with

λ = (λ^{1}, \dots, λ^{d}) \in R_{+ +}^{d}

and

\sum_{i = 1}^{d} λ^{i} = 1

. The Huffman codeword for

x \in X

is of length

l (x) = - \log p (x)

(ignoring integer ceil rounding), and the expected codeword length of X is thus given by Shannon’s entropy

H (X) = - \sum_{x} p (x) \log p (x)

.

If we code according to a distribution

p_{λ^{'}}

instead of the true distribution

p_{λ}

, the code is not optimal, and the redundancy

R (p_{λ}, p_{λ^{'}})

is defined as the difference between the expected lengths of the codewords for

p_{λ^{'}}

and

p_{λ}

:

R (p_{λ}, p_{λ^{'}}) = (- E_{p_{λ}} [\log p_{λ^{'}} (x)] - (- E_{p_{λ}} [\log p_{λ} (x)]) = D_{KL} [p_{λ} : p_{λ^{'}}] \geq 0,

where

D_{KL}

is the Kullback–Leibler divergence.

Now, suppose that the true distribution

p_{λ}

belong to one of two prescribed distributions that we do not know:

p_{λ} \in P = {p_{λ_{1}}, p_{λ_{2}}}

. Then we seek for the minimax redundancy:

R^{*} = min_{p_{λ}} max_{i \in {1, 2}} D_{KL} [p_{λ_{i}} : p_{λ}] .

(34)

The distribution

p_{λ^{*}}

achieving the minimax redundancy is the circumcenter of the right-centered KL ball enclosing the distributions

P

.

Using the natural coordinates

θ = (θ_{1}, \dots, θ_{D}) \in R^{D}

with

θ_{i} = \log \frac{λ^{i}}{λ^{d}}

of the log-normalizer of the categorical distributions (an exponential family of order

D = d - 1

), we end up with calculating the smallest left-sided Bregman enclosing ball for the Bregman generator [58]:

F_{Categorical} (θ) = \log (1 + \sum_{i = 1}^{D} \exp θ_{i})

:

R^{*} = min_{θ \in \in R^{D}} max_{i \in {1, 2}} B_{F_{categorical}} (θ : θ_{i}) .

This latter minimax problem is unconstrained since

θ \in R^{D} = R^{d - 1}

.

5. Chernoff Information between Gaussian Distributions

5.1. Invariance of Chernoff Information under the Action of the Affine Group

The d-variate Gaussian density

p_{λ} (x)

with parameter

λ = (λ_{v} = μ, λ_{M} = Σ)

where

μ \in R^{d}

denotes the mean (

μ = E_{p_{λ}} [x]

) and

Σ

is a positive-definite covariance matrix (

Σ = {Cov}_{p_{λ}} [X]

for

X \sim p_{λ}

) is given by

p_{λ} (x; λ) = \frac{1}{{(2 π)}^{\frac{d}{2}} \sqrt{| λ_{M} |}} \exp (- \frac{1}{2} {(x - λ_{v})}^{⊤} λ_{M}^{- 1} (x - λ_{v})),

where

| \cdot |

denotes the matrix determinant. The set of d-variate Gaussian distributions form a regular (and hence steep) exponential family with natural parameters

θ (λ) = (λ_{M}^{- 1} λ_{v}, \frac{1}{2} λ_{M}^{- 1})

and sufficient statistics

t (x) = (x, x x^{⊤})

.

The Bhattacharrya distance between two multivariate Gaussians distributions

p_{μ_{1}, Σ_{1}}

and

p_{μ_{2}, Σ_{2}}

is

D_{B, α} [p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}] = \frac{1}{2} (α μ_{1}^{⊤} Σ_{1}^{- 1} μ_{1} + (1 - α) μ_{2}^{⊤} Σ_{2}^{- 1} μ_{2} - μ_{α}^{⊤} Σ_{α}^{- 1} μ_{α} + \log \frac{| Σ_{1} |^{α} {| Σ_{2} |}^{1 - α}}{| Σ_{α} |}),

where

\begin{matrix} Σ_{α} & = & {(α Σ_{1}^{- 1} + (1 - α) Σ_{2}^{- 1})}^{- 1}, \\ μ_{α} & = & Σ_{α} (α Σ_{1}^{- 1} μ_{1} + (1 - α) Σ_{2}^{- 1}) . \end{matrix}

The Gaussian density can be rewritten as a multivariate location-scale family:

p_{λ} (x; λ) = {| λ_{M} |}^{- \frac{1}{2}} p_{std} (λ_{M}^{\frac{1}{2}} (x - λ_{v})),

where

p_{std} (x) = \frac{1}{{(2 π)}^{\frac{d}{2}}} \exp (- \frac{1}{2} x^{⊤} x) = p_{(0, I)}

denotes the standard multivariate Gaussian distribution. The matrix

λ_{M}^{\frac{1}{2}}

is the unique symmetric square-root matrix which is positive-definite when

λ_{M}

is positive-definite.

Remark 8.

Notice that the product of two symmetric positive-definite matrices

P_{1}

and

P_{2}

may not be symmetric but

P_{1}^{\frac{1}{2}} P_{2} P_{1}^{\frac{1}{2}}

is always symmetric positive-definite, and the eigenvalues of

P_{1}^{\frac{1}{2}} P_{2} P_{1}^{\frac{1}{2}}

coincides with the eigenvalues of

P_{1} P_{2}

. Hence, we have

λ_{sp} (P_{1}^{- \frac{1}{2}} P_{2} P_{1}^{- \frac{1}{2}}) = λ_{sp} (P_{1}^{- 1} P_{2})

where

λ_{sp} (M)

denotes the eigenspectrum of matrix M.

We may interpret the Gaussian family as obtained by the action of the affine group

Aff (R^{d}) = R^{d} ⋊ {GL}_{d} (R)

on the standard density

p_{std}

: Let the dot symbol “.” denotes the group action. The affine group is equipped with the following (outer) semidirect product:

(l_{1}, A_{1}) . (l_{2}, A_{2}) = (l_{1} + A_{1} l_{2}, A_{1} A_{2}),

(35)

and this group can be handled as a matrix group with the following mapping of its elements to matrices:

(l, A) \equiv [\begin{matrix} A & l \\ 0 & 1 \end{matrix}] .

Then we have

p_{(μ, Σ)} (x) = (μ, Σ^{- \frac{1}{2}}) . p_{std} (x) = (μ, Σ^{- \frac{1}{2}}) . p_{0, I} (x) = p_{(μ, Σ^{- \frac{1}{2}}) . (0, I)} (x) .

We can show the following invariance of the skewed Bhattacharyya divergences:

Proposition 9

(Invariance of the Bhattacharyya divergence and f-divergences under the action of the affine group (Equation (35))). We have

\begin{matrix} D_{B, α} [(μ, Σ^{- \frac{1}{2}}) . p_{μ_{1}, Σ_{1}} : (μ, Σ^{- \frac{1}{2}}) . p_{μ_{2}, Σ_{2}}] & : = & D_{B, α} [p_{(μ, Σ^{- \frac{1}{2}}) . (μ_{1}, Σ_{1})} : p_{(μ, Σ^{- \frac{1}{2}}) . (μ_{2}, Σ_{2}})], \\ = & D_{B, α} [p_{Σ^{- \frac{1}{2}} (μ_{1} - μ) : Σ^{- \frac{1}{2}} Σ_{1} Σ^{- \frac{1}{2}}}, p_{Σ^{- \frac{1}{2}} (μ_{2} - μ), Σ^{- \frac{1}{2}} Σ_{2} Σ^{- \frac{1}{2}}}], \\ = & D_{B, α} [p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}] . \end{matrix}

Proof.

The proof follows from the

(f, g)

-form of Ali and Silvey’s divergences [59]. We can express

D_{B, α} [p : q] = g (I_{h_{α}} [p : q])

where

h_{α} (u) = - u^{α}

(convex for

α \in (0, 1)

) and

g (v) = - \log - v

. Then we rely on the proof of invariance of f-divergences under the action of the affine group (see Proposition 3 of [60] relying on a change of variable in the integral):

I_{f} [p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}] = I_{f} [p_{0, I}, p_{Σ_{1}^{- \frac{1}{2}} (μ_{2} - μ_{1}) : Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}}}] = I_{f} [p_{Σ_{2}^{- \frac{1}{2}} (μ_{1} - μ_{2}), Σ_{2}^{- \frac{1}{2}} Σ_{1} Σ_{2}^{- \frac{1}{2}}} : p_{0, I}],

where I denotes the identity matrix. □

Thus, by choosing

(μ, Σ) = (μ_{1}, Σ_{1})

and

(μ, Σ) = (μ_{2}, Σ_{2})

, we obtain the following corollary:

Corollary 2

(Bhattacharyya divergence from canonical Bhattacharyya divergences). We have

D_{B, α} [p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}] = D_{B, α} [p_{0, I}, p_{Σ_{1}^{- \frac{1}{2}} (μ_{2} - μ_{1}) : Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}}}] = D_{B, α} [p_{Σ_{2}^{- \frac{1}{2}} (μ_{1} - μ_{2}) : Σ_{2}^{- \frac{1}{2}} Σ_{1} Σ_{2}^{- \frac{1}{2}}}, p_{0, I}] .

It follows that the Chernoff optimal skewing parameter enjoys the same invariance property:

α^{*} (p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}) = α^{*} (p_{0, I}, p_{Σ_{1}^{- \frac{1}{2}} (μ_{2} - μ_{1}), Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}}}) = α^{*} (p_{Σ_{2}^{- \frac{1}{2}} (μ_{1} - μ_{2}) : Σ_{2}^{- \frac{1}{2}} Σ_{1} Σ_{2}^{- \frac{1}{2}}}, p_{0, I}) .

As a byproduct, we get the invariance of the Chernoff information under the action of the affine group:

Corollary 3

(Invariance of the Chernoff information under the action of the affine group). We have:

D_{C} [p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}] = D_{C} [p_{0, I}, p_{Σ_{1}^{- \frac{1}{2}} (μ_{2} - μ_{1}), Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}}}] = D_{C} [p_{Σ_{2}^{- \frac{1}{2}} (μ_{1} - μ_{2}), Σ_{2}^{- \frac{1}{2}} Σ_{1} Σ_{2}^{- \frac{1}{2}}}, p_{0, I}] .

Thus, the formula for the Chernoff information between two Gaussians

D_{C} (μ_{1}, Σ_{1}, μ_{2}, Σ_{2}) : = D_{C} [p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}] = D_{C} (μ_{12}, Σ_{12})

can be written as a function of two terms

μ_{12} = Σ_{1}^{- \frac{1}{2}} (μ_{2} - μ_{1})

and

Σ_{12} = Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}}

.

5.2. Closed-Form Formula for the Chernoff Information between Univariate Gaussian Distributions

We shall report the exact solution for the Chernoff information between univariate Gaussian distributions by solving a quadratic equation. We can also report a complex closed-form formula by using symbolic computing because the calculations are lengthy and thus prone to human error.

Instantiating Equation (24) for the case of univariate Gaussian distributions paramterized by

(μ, σ^{2})

, we get the following equation for the optimality condition of

α^{*}

:

\begin{matrix} 〈 θ_{2} - θ_{1}, η_{α^{*}} 〉 & = & F (θ_{2}) - F (θ_{1}), \end{matrix}

(36)

\begin{matrix} 〈(\frac{μ_{2}}{σ_{2}^{2}} - \frac{μ_{1}}{σ_{1}^{2}}, \frac{1}{2 σ_{1}^{2}} - \frac{1}{2 σ_{2}^{2}}), (m_{α}, v_{α})〉 & = & \frac{1}{2} \log \frac{σ_{2}^{2}}{σ_{1}^{2}} + \frac{μ_{2}^{2}}{2 σ_{2}^{2}} - \frac{μ_{1}^{2}}{2 σ_{1}^{2}}, \end{matrix}

(37)

where

〈 \cdot, \cdot 〉

denotes the scalar product and with the interpolated mean and variance along an exponential arc

{(m_{α}, v_{α})}_{α \in (0, 1)}

passing through

(μ_{1}, σ_{1}^{2})

when

α = 1

and

(μ_{2}, σ_{2}^{2})

when

α = 0

given by

\begin{matrix} m_{α} & = & \frac{α μ_{1} σ_{2}^{2} + (1 - α) μ_{2} σ_{1}^{2}}{(1 - α) σ_{1}^{2} + α σ_{2}^{2}} = \frac{α (μ_{1} σ_{2}^{2} - μ_{2} σ_{1}^{2}) + μ_{2} σ_{1}^{2}}{σ_{1}^{2} + α (σ_{2}^{2} - σ_{1}^{2})}, \end{matrix}

(38)

\begin{matrix} v_{α} & = & \frac{σ_{1}^{2} σ_{2}^{2}}{(1 - α) σ_{1}^{2} + α σ_{2}^{2}} = \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + α (σ_{2}^{2} - σ_{1}^{2})} . \end{matrix}

(39)

That is, for

p = p_{μ_{1}, σ_{1}^{2}}

and

q = p_{μ_{2}, σ_{2}^{2}}

, we have the weighted geometric mixture

{(p q)}_{α}^{G} = p_{m_{α}, v_{α}}

.

Thus, the optimality condition of the Chernoff optimal skewing parameter is given by:

{OC}_{Gaussian} : (\frac{μ_{2}}{σ_{2}^{2}} - \frac{μ_{1}}{σ_{1}^{2}}) m_{α} - (\frac{1}{2 σ_{2}^{2}} - \frac{1}{2 σ_{1}^{2}}) v_{α} = \frac{1}{2} \log \frac{σ_{2}^{2}}{σ_{1}^{2}} + \frac{μ_{2}^{2}}{2 σ_{2}^{2}} - \frac{μ_{1}^{2}}{2 σ_{1}^{2}} .

(40)

Let us rewrite compactly Equation (40) as

{OC}_{Gaussian} : a_{12} m_{α} + b_{12} v_{α} + c_{12} = 0,

(41)

with the following coefficients:

\begin{matrix} a_{12} & = & \frac{μ_{2}}{σ_{2}^{2}} - \frac{μ_{1}}{σ_{1}^{2}}, \end{matrix}

(42)

\begin{matrix} b_{12} & = & \frac{1}{2 σ_{1}^{2}} - \frac{1}{2 σ_{2}^{2}}, \end{matrix}

(43)

\begin{matrix} c_{12} & = & \frac{1}{2} \log \frac{σ_{1}^{2}}{σ_{2}^{2}} + \frac{μ_{1}^{2}}{2 σ_{1}^{2}} - \frac{μ_{2}^{2}}{2 σ_{2}^{2}} . \end{matrix}

(44)

By multiplying both sides of Equation (41) by

σ_{1}^{2} + α Δ_{v}

where

Δ_{v} : = σ_{2}^{2} - σ_{1}^{2}

and rearranging terms, we get a quadratic equation with positive root being

α^{*}

.

Using the computer algebra system (CAS) Maxima, we can also solve exactly this quadratic equation in

α

as a function of

μ_{1}

,

σ_{1}^{2}

,

μ_{2}

, and

σ_{2}^{2}

: See listing in Appendix C.

Once we get the optimal value of

α^{*} = α^{*} (μ_{1}, σ_{1}^{2}, μ_{2}, σ_{2}^{2})

, we get the Chernoff information as

D_{C} [p_{μ_{1}, σ_{1}^{2}}, p_{μ_{2}, σ_{2}^{2}}] = D_{KL} [p_{m_{α^{*}}, v_{α^{*}}} : p_{μ_{1}, σ_{1}^{2}}]

with the Kullback–Leibler divergence between two univariate Gaussians distributions

p_{μ_{1}, σ_{1}^{2}}

and

p_{μ_{2}, σ_{2}^{2}}

given by

D_{KL} [p_{μ_{1}, σ_{1}^{2}} : p_{μ_{2}, σ_{2}^{2}}] = \frac{1}{2} (\frac{{(μ_{2} - μ_{1})}^{2}}{σ_{2}^{2}} + \frac{σ_{1}^{2}}{σ_{2}^{2}} - \log \frac{σ_{1}^{2}}{σ_{2}^{2}} - 1) .

Notice that from the invariance of Proposition 9, we have for any

(μ, σ^{2}) \in R \times R_{+ +}

:

D_{KL} [p_{μ_{1}, σ_{1}^{2}} : p_{μ_{2}, σ_{2}^{2}}] = D_{KL} [p_{\frac{μ_{1} - μ}{σ}, \frac{σ_{1}^{2}}{σ^{2}}} : p_{\frac{μ_{2} - μ}{σ}, \frac{σ_{2}^{2}}{σ^{2}}}],

and therefore by choosing

(μ, σ^{2}) = (μ_{1}, σ_{1}^{2})

, we have

D_{KL} [p_{μ_{1}, σ_{1}^{2}} : p_{μ_{2}, σ_{2}^{2}}] = D_{KL} [p_{0, 1}, p_{\frac{μ_{2} - μ_{1}}{σ_{1}} : \frac{σ_{2}^{2}}{σ_{1}^{2}}}] .

Proposition 10.

The Chernoff information between two univariate Gaussian distributions can be calculated exactly in closed form.

One can also program these closed-form solutions in Python using the SymPy package (https://www.sympy.org/en/index.html (accessed on 30 July 2022)) for performing symbolic computations.

Let us report special cases with some illustrating examples.

First, let us consider the Gaussian subfamily with prescribed variance. When $σ_{1}^{2} = σ_{2}^{2} = σ^{2}$ , we always have $α^{*} = \frac{1}{2}$ , and the Chernoff information is

$D_{C} [p_{μ_{1}, σ^{2}} : p_{μ_{2}, σ^{2}}] = \frac{{(μ_{2} - μ_{1})}^{2}}{8 σ^{2}} .$

(45)

Notice that it amounts to one eight of the squared Mahalanobis distance (see [60] for a detailed explanation).
Second, let us consier Gaussian subfamily with prescribed mean. When $μ_{1} = μ_{2} = μ$ , we get the optimal skewing value independent of the mean $μ$ :

$α^{*} = - \frac{v_{1} \log (2 v_{2}) - v_{2} - v_{1} \log (2 v_{1}) + v_{1}}{(v_{2} - v_{1}) \log (2 v_{2}) - \log (2 v_{1}) v_{2} + v_{1} \log (2 v_{1})}$

where $v_{1} = σ_{1}^{2}$ and $v_{2} = σ_{2}^{2}$ . The Chernoff information is

$D_{C} [p_{μ_{1}, v_{1}} : p_{μ_{2}, v_{2}}] - \frac{(v_{2} - v_{1}) \log (\frac{v_{2} \log (2 v_{2}) - \log (2 v_{1}) v_{2}}{v_{2} - v_{1}}) - v_{2} \log (2 v_{2}) + (\log (2 v_{1}) + 1) v_{2} - v_{1}}{2 v_{2} - 2 v_{1}} .$

(46)
Third, consider the Chernoff information between the standard normal distribution and another normal distribution. When $(μ_{1}, σ_{1}^{2}) = (0, 1)$ and $(μ_{2}, σ_{2}^{2}) = (μ, v)$ , we get $α^{*} = \frac{\sqrt{(4 μ^{2} v^{2} - 4 μ^{2} v) \log (2 v) + v^{4} - 4 v^{3} + (6 - 4 \log 2 μ^{2}) v^{2} + (4 μ^{4} + 4 \log 2 μ^{2} - 4) v + 1} + (2 - 2 v) \log (2 v) + v^{2} + (2 \log 2 - 2) v - 2 μ^{2} - 2 \log 2 + 1}{(2 v^{2} - 4 v + 2) \log (2 v) - 2 \log 2 v^{2} + (2 μ^{2} + 4 \log 2) v - 2 μ^{2} - 2 \log 2}$

Example 4.

Let us consider

N (μ_{1} = 0, σ_{1}^{2} = 1)

and

N (μ_{2} = 1, σ_{2}^{2} = 2)

. The Chernoff exponent is

α^{*} = \frac{\sqrt{8 \log 4 - 8 \log 2 + 9} - 2 \log 4 + 2 \log 2 - 1}{2 \log 4 - 2 \log 2 + 2} \approx 0.4215580558605244,

and the Chernoff information is (zoom in for the formula):

- \frac{\sqrt{8 \log 4 - 8 \log 2 + 9} ((2 \log 4 - 2 \log 2 + 3) \log (\frac{4 \log 4 - 4 \log 2 + 4}{\sqrt{8 \log 4 - 8 \log 2 + 9} + 1}) - 4 {(\log 4)}^{2} + (8 \log 2 - 6) \log 4 - 4 {(\log 2)}^{2} + 6 \log 2 - 2) + (6 \log 4 - 6 \log 2 + 7) \log (\frac{4 \log 4 - 4 \log 2 + 4}{\sqrt{8 \log 4 - 8 \log 2 + 9} + 1}) + 4 {(\log 4)}^{2} + (10 - 8 \log 2) \log 4 + 4 {(\log 2)}^{2} - 10 \log 2 + 6}{(4 \log 4 - 4 \log 2 + 6) \sqrt{8 \log 4 - 8 \log 2 + 9} + 12 \log 4 - 12 \log 2 + 14}

\approx 0.1155433222682347

Using the bisection search of [24] with

ϵ = 10^{- 8}

takes 28 iterations, and we get

α^{*} \approx 0.42155805602669716,

and the Chernoff information is approximately

0.11554332226823472

. Now, if we swap

p_{μ_{1}, σ_{1}^{2}} \leftrightarrow p_{μ_{2}, σ_{2}^{2}}

, we find

α^{*} = 0.5784419439733028

(and

0.5784419439733028 + 0.42155805602669716 \approx 1

).

Notice that in general, we may evaluate how good is the approximation

\tilde{α}

of

α^{*}

by evaluating the deficiency of the optimal condition:

|{(θ_{2} - θ_{1})}^{⊤} η_{\tilde{α}} - F (θ_{2}) + F (θ_{1})| .

Example 5.

Let us consider

μ_{1} = 1

,

σ_{1}^{2} = 3

and

μ_{2} = 5

and

σ_{2}^{2} = 5

. We get

α^{*} = \frac{\sqrt{120 \log 10 - 120 \log 6 + 961} - 3 \log 10 + 3 \log 6 - 23}{2 \log 10 - 2 \log 6 + 16} \approx 0.4371453168322306

and the Chernoff information is reported in closed form and evaluated numerically as

0.5242883659200144 .

In comparison, the bisection algorithm of [24] with

ϵ = 10^{- 8}

takes 28 iterations, and reports

α^{*} \approx 0.43714531883597374

and the Chernoff information about

0.5242883659200137 .

Corollary 4.

The smallest enclosing left-sided Kullback–Leibler disk of n univariate Gaussian distributions can be calculated exactly in randomized linear time [56].

5.3. Fast Approximation of the Chernoff Information of Multivariate Gaussian Distributions

In general, the Chernoff information between d-variate Gaussians distributions is not known in closed-form formula when

d > 1

, see for example [61,62,63]. We shall consider below some special cases:

When the Gaussians have the same covariance matrix $Σ$ , the Chernoff information optimal skewing parameter is $α = \frac{1}{2}$ and the Chernoff information is

$D_{C} [p_{μ_{1}, Σ}, p_{μ_{2}, Σ}] = \frac{1}{8} Δ_{Σ}^{2} (μ_{1}, μ_{2}),$

where $Δ_{Σ}^{2} (μ_{1}, μ_{2}) = {(μ_{2} - μ_{1})}^{⊤} Σ^{- 1} (μ_{2} - μ_{1})$ is the squared Mahalanobis distance. The Mahalanobis distance enjoys the following property by congruence transformation:

$Δ_{Σ} (μ_{1}, μ_{2}) = Δ_{A Σ A^{⊤}} (A μ_{1}, A μ_{2}), \forall A \in GL (d) .$

(47)

Notice that we can rewrite the (squared) Mahalanobis distance as

$Δ_{Σ}^{2} (μ_{1}, μ_{2}) = tr (Σ^{- 1} (μ_{2} - μ_{1}) {(μ_{2} - μ_{1})}^{⊤})$

using the matrix trace cyclic property. Then we check that

$\begin{matrix} Δ_{A Σ A^{⊤}}^{2} (A μ_{1}, A μ_{2}) & = & tr (A^{- ⊤} Σ^{- 1} A^{- 1} A (μ_{2} - μ_{1}) {(μ_{2} - μ_{1})}^{⊤} A^{⊤}), \\ = & tr (Σ^{- 1} (μ_{2} - μ_{1}) {(μ_{2} - μ_{1})}^{⊤}) = Δ_{Σ}^{2} (μ_{1}, μ_{2}) . \end{matrix}$
The Chernoff information for the special case of centered multivariate Gaussians distributions was studied in [62]. The KLD between two centered Gaussians $p_{μ, Σ_{1}}$ and $p_{μ_{,} Σ_{2}}$ is half of the matrix Burg distance:

$D_{KL} [p_{μ, Σ_{1}} : p_{μ, Σ_{2}}] = \frac{1}{2} (\log \frac{\det (Σ_{2})}{\det (Σ_{1})} + tr (Σ_{2}^{- 1} Σ_{1}) - d) = : \frac{1}{2} D_{Burg} [Σ_{1} : Σ_{2}] .$

(48)

When $d = 1$ , the Burg distance corresponds to the well-known Itakura–Saito divergence. The matrix Burg distance is a matrix spectral distance [62]:

$D_{Burg} [Σ_{1} : Σ_{2}] = (\sum_{i = 1}^{d} λ_{i} - \log λ_{i} - 1),$

where the $λ_{i}$ ’s are the eigenvalues of $Σ_{2} Σ_{1}^{- 1}$ . The reverse KLD divergence $D_{KL} [p_{μ, Σ_{2}} : p_{μ, Σ_{1}}] = \frac{1}{2} D_{Burg} [Σ_{2} : Σ_{1}]$ is obtained by replacing $λ_{i} \leftrightarrow \frac{1}{λ_{i}}$ :

$D_{KL} [p_{μ, Σ_{2}} : p_{μ_{,} Σ_{1}}] = \frac{1}{2} (\sum_{i = 1}^{d} \frac{1}{λ_{i}} + \log λ_{i} - 1) .$

More generally, the f-divergences between centered Gaussian distributions are always matrix spectral divergences [60].

Otherwise, for the general multivariate case, we implement the dichotomic search of Algorithm 1 in Algorithm 3 with the KLD between two multivariate Gaussian distributions expressed as

\begin{matrix} D_{KL} [p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}] & = & \frac{1}{2} Δ_{Σ}^{2} (μ_{1}, μ_{2}) + \frac{1}{2} D_{Burg} [Σ_{1} : Σ_{2}], \end{matrix}

(49)

\begin{matrix} = & \frac{1}{2} (tr (Σ_{2}^{- 1} Σ_{1}) - \log \frac{\det (Σ_{2})}{\det (Σ_{1})} - d + {(μ_{2} - μ_{1})}^{⊤} Σ_{2}^{- 1} (μ_{2} - μ_{1})) . \end{matrix}

(50)

Algorithm 3 Dichotomic search for approximating the Chernoff information between two multivariate normal distributions

p_{μ_{1}, Σ_{1}}

and

p_{μ_{2}, Σ_{2}}

by approximating the optimal skewing parameter value

α \approx α^{*}

.

Example 6.

Let

d = 2

,

p_{μ_{1}, Σ_{1}} = p_{0, I}

be the standard bivariate Gaussian distribution and

p_{μ_{2}, Σ_{2}}

be the bivariate Gaussian distribution with mean

μ_{2} = {[1 2]}^{⊤}

and covariance matrix

Σ_{2} = [\begin{matrix} 1 & - 1 \\ - 1 & 2 \end{matrix}]

. Setting the numerical precision threshold ϵ to

ϵ = 10^{- 8}

, the dichotomic search performs 28 split iterations, and approximate

α^{*}

by

α^{*} \approx 0.5825489424169064 .

The Chernoff information

D_{C} [p_{0, I}, p_{μ_{2}, Σ_{2}}]

is approximated by

0.8827640697808525

.

The m-interpolation of multivariate Gaussian distributions

p_{μ_{1}, Σ_{1}}

and

p_{μ_{2}, Σ_{2}}

with respect to the mixture connection

\nabla^{m}

is given by

γ_{p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}}^{m} (α) = p_{μ_{α}^{m}, Σ_{α}^{m}},

where

\begin{matrix} μ_{α}^{m} & = & (1 - α) μ_{1} + α μ_{2} = : {\bar{μ}}_{α}, \\ Σ_{α}^{m} & = & (1 - α) Σ_{1} + α Σ_{2} + (1 - α) μ_{1} μ_{1}^{⊤} + α μ_{2} μ_{2}^{⊤} - {\bar{μ}}_{α} {\bar{μ}}_{α}^{⊤} . \end{matrix}

To e-interpolation of multivariate Gaussian distributions

p_{μ_{1}, Σ_{1}}

and

p_{μ_{2}, Σ_{2}}

with respect to the exponential connection

\nabla^{e}

is given by

γ_{p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}}^{e} (α) = p_{μ_{α}^{e}, Σ_{α}^{e}},

where

\begin{matrix} μ_{α}^{e} & = & Σ_{α}^{e} ((1 - α) Σ_{1}^{- 1} μ_{1} + α Σ_{2}^{- 1} μ_{2}), \\ Σ_{α}^{e} & = & {((1 - α) Σ_{1}^{- 1} + α Σ_{2}^{- 1})}^{- 1} . \end{matrix}

In information geometry, both these e- and m-connections defined with respect to an exponential family are shown to be flat. These geodesics correspond to linear interpolations in the

\nabla^{e}

-affine coordinate system

θ

and in the dual

\nabla^{m}

coordinate system

η

, respectively.

Figure 7 displays these two e-geodesic and m-geodesic between two multivariate normal distributions. Notice that the Riemannian geodesic with the Levi–Civita metric connection

\frac{\nabla^{e} + \nabla^{m}}{2}

is not known in closed form for boundary value conditions. The expression of the Riemannian geodesic is known only for initial value conditions [64] (i.e., starting point with a given vector direction).

5.4. Chernoff Information between Centered Multivariate Normal Distributions

The set

N_{0} = \{p_{Σ} (x) = \frac{1}{\sqrt{\det (2 π Σ)}} \exp (- \frac{1}{2} x^{⊤} Σ^{- 1} x) : Σ ≻ 0\}

of centered multivariate normal distributions is a regular exponential family with natural parameter

θ = Σ^{- 1}

, sufficient statistic

t (x) = - \frac{1}{2} x x^{⊤}

, log-normalizer

F (θ) = - \frac{1}{2} \log \det (θ)

and auxiliary carrier term

k (x) = - \frac{d}{2} \log (2 π)

. Family

N_{0}

is also a multivariate scale family with scale matrices

Σ^{\frac{1}{2}}

(standard deviation

σ

in 1D).

Let

〈 A, B 〉 = tr (A^{⊤} B)

defines the inner product between two symmetric matrices A and B. Then we can write the centered Gaussian distribution

p_{Σ} (x)

in the canonical form of exponential families:

p_{θ} (x) = \exp (〈 θ, t (x) 〉 - F (θ) + k (x)) .

The function log det of a positive-definite matrix is strictly concave [65], and hence we check that

F (θ)

is strictly convex. Furthermore, we have

\nabla_{X} \log \det (X) = X^{- ⊤}

so that

\nabla_{θ} F (θ) = - \frac{1}{2} θ^{- ⊤}

.

The optimality condition equation of Chernoff best skewing parameter

α^{*}

becomes:

\begin{matrix} 〈 θ_{2} - θ_{1}, \nabla F (θ_{1} + α^{*} (θ_{2} - θ_{1})) 〉 & = & F (θ_{2}) - F (θ_{1}), \end{matrix}

(51)

\begin{matrix} - \frac{1}{2} tr ({(θ_{2} - θ_{1})}^{⊤} {(θ_{1} + α^{*} (θ_{2} - θ_{1}))}^{- 1}) & = & - \frac{1}{2} \log \frac{\det (θ_{2})}{\det (θ_{1})}, \end{matrix}

(52)

\begin{matrix} tr ({(θ_{2} - θ_{1})}^{⊤} {(θ_{1} + α^{*} (θ_{2} - θ_{1}))}^{- 1}) & = & \log \frac{\det (θ_{2})}{\det (θ_{1})}, \end{matrix}

(53)

\begin{matrix} tr ((Σ_{2}^{- 1} - Σ_{1}^{- 1}) {(Σ_{1}^{- 1} + α^{*} (Σ_{2}^{- 1} - Σ_{1}^{- 1}))}^{- 1}) & = & \log \frac{\det (Σ_{1})}{\det (Σ_{2})} = \log \det (Σ_{1} Σ_{2}^{- 1}) . \end{matrix}

(54)

When

Σ_{2} = s Σ_{1}

(and

Σ_{2}^{- 1} = \frac{1}{s} Σ_{1}^{- 1}

) for

s > 0

and

s \neq 1

, we get a closed-form for

α^{*}

using the fact that

\det (\frac{I}{s}) = \frac{1}{s^{d}}

and

tr (I) = d

for d-dimensional identity matrix I. Solving Equation (54) yields

α^{*} (s) = \frac{s - 1 - \log s}{(s - 1) \log s} \in (0, 1) .

(55)

Therefore the Chernoff information between two scaled centered Gaussian distributions

p_{μ, Σ}

and

p_{μ, s Σ}

is available in closed form.

Proposition 11.

The Chernoff information between two scaled d-dimensional centered Gaussian distributions

p_{μ, Σ}

and

p_{μ, s Σ}

of

N_{μ}

(for

s > 0

) is available in closed form:

D_{C} [p_{μ, Σ}, p_{μ, s Σ}] = D_{B, α^{*}} [p_{μ, Σ}, p_{μ, s Σ}] = d \frac{(s - 1) \log (\frac{s}{s - 1} \log s) - s \log s + s - 1}{2 (1 - s)},

(56)

where

α^{*} = \frac{s - 1 - \log s}{(s - 1) \log s} \in (0, 1)

.

Notice that

α^{*} (p_{μ, Σ} : p_{μ, s Σ}) = α^{*} (p_{μ, Σ}, p_{μ, \frac{1}{s} Σ})

and

D_{C} [p_{μ, Σ}, p_{μ, s Σ}] = D_{C} [p_{μ, Σ}, p_{μ, \frac{1}{s} Σ}]

.

Example 7.

Consider

μ_{1} = μ_{2} = 0

and

Σ_{1} = I

,

Σ_{2} = \frac{1}{2} I

. We find that

α^{*} = \frac{2 \log 2 - 1}{\log 2}

, which is independent of the dimension of the matrices. The Chernoff information depends on the dimension:

D_{C} [p_{0, I}, p_{0, \frac{1}{2} I}] = d \frac{\log 2 - \log \log 2 - 1}{2} .

Notice that when

d = 1

, we have

s = \frac{σ_{2}^{2}}{σ_{1}^{2}}

, and we recover a special case of the closed-form formula for the Chernoff information between univariate Gaussians.

In [62], the following equation is reported for finding

α^{*}

based on Equation (54):

{OC}_{Centered Gaussians} : \sum_{i = 1}^{d} \frac{1 - λ_{i}}{α^{*} + (1 - α^{*}) λ_{i}} + \log λ_{i} = 0

(57)

where the

λ_{i}

’s are generalized eigenvalues of

Σ_{1} Σ_{2}^{- 1}

(this excludes the case of all

λ_{i}

’s equal to one). The value of

α^{*}

satisfying Equation (57) is unique. Let us notice that the product of two symmetric positive-definite matrices is not necessarily symmetric anymore. We can derive Equation (57) by expressing Equation (54) using the identity matrix I and matrix

Σ_{2}^{- \frac{1}{2}} Σ_{1} Σ_{2}^{- \frac{1}{2}}

.

Remark 9.

We can get closed-form solutions for

α^{*}

and the corresponding Chernoff information in some particular cases. For example, when the dimension

d = 2

, we need to solve a quadratic equation to get

α^{*}

. Thus, for

d \leq 4

, we get a closed-form solution for

α^{*}

by solving a polynomial equation characterizing the optimal condition, and obtain the Chernoff information in closed-form as a byproduct.

Example 8.

Consider the Chernoff information between

p_{0, I}

and

p_{0, Λ}

with

Λ = diag (1, 2, 3, 4)

. We get the exact Chernoff exponent value

α^{*}

by taking the root of a quartic polynomial equation falling in

(0, 1)

. By evaluating numerically this root, we find that

α^{*} ≃ 0.59694

and that the Chernoff information is

D_{C} [p_{0, I}, p_{0, Λ}] ≃ 0.22076

. See Appendix C for some symbolic computation code.

6. Chernoff Information between Densities of Different Exponential Families

Let

E_{1} = {p_{θ} = \exp (〈 θ, t_{1} (x) 〉 - F_{1} (θ)) : θ \in Θ},

and

E_{2} = {q_{θ^{'}} = \exp (〈 θ^{'}, t_{2} (x) 〉 - F_{2} (θ^{'}) : θ^{'} \in Θ^{'}},

be two distinct exponential families, and consider the Chernoff information between the densities

p_{θ_{1}}

and

q_{θ_{2}^{'}}

. The exponential arc induced by

p_{θ_{1}}

and

q_{θ_{2}^{'}}

is

{{(p_{θ_{1}} q_{θ_{2}^{'}})}_{α}^{G} \propto p_{θ_{1}}^{α} q_{θ_{2}^{'}}^{1 - α} : α \in (0, 1)} .

Let

E_{12}

denote the exponential family with sufficient statistics

(t_{1} (x), t_{2} (x))

, log-normalizer

F_{12} (θ, θ^{'})

, and denote by

Θ_{12}

its natural parameter space. Family

E_{12}

can be interpreted as a product exponential family which yields an exponential family. We have

{(p_{θ_{1}} q_{θ_{2}^{'}})}_{α}^{G} = \exp (〈 (t_{1} (x), t_{2} (x)), (α θ_{1}, (1 - α) θ_{2}^{'}) 〉 - F_{12} (α θ_{1}, (1 - α) θ_{2}^{'})) .

Thus the induced LREF

E_{p_{θ_{1}} q_{θ_{2}^{'}}}

with natural parameter space

Θ_{p_{θ_{1}} q_{θ_{2}^{'}}}

can be interpreted as a 1D curved exponential family of the product exponential family

E_{12}

.

The optimal skewing parameter

α^{*}

is found by setting the derivative of

F_{12} (α θ_{1}, (1 - α) θ_{2}^{'})

with respect

α

to zero:

\frac{d}{d α} F_{12} (α θ_{1}, (1 - α) θ_{2}^{'}) = 0 .

Example 9.

Let

E_{1}

can be chosen as the exponential family of exponential distributions

E_{1} = \{e_{λ} (x) = λ \exp (- λ x), λ \in (0, + \infty)\}

defined on the support

X_{1} = (0, \infty)

and

E_{2}

can be chosen as the exponential family of half-normal distributions

E_{2} = \{h_{σ} (x) = \sqrt{\frac{2}{π σ^{2}}} \exp (- \frac{x^{2}}{2 σ^{2}}) : σ^{2} > 0\}

with support

X_{2} = (0, \infty)

.

The product exponential family corresponds to the singly truncated normal family [50] which is a non-regular (i.e., parameter space is not topologically an open set):

Θ_{12} = (R \times R_{+ +}) \cup Θ_{0},

with

Θ_{0} = {(θ, 0) : θ < 0}

(the part corresponding to the exponential family of exponential distributions). This exponential family

E_{12} = {p_{θ_{1}, θ_{2}}}

of singly truncated normal distributions is also non-steep [50]. The log-normalizer is

F_{12} (θ_{1}, θ_{2}) = \frac{1}{2} \log \frac{π}{θ_{2}} + \log Φ (\frac{θ_{1}}{\sqrt{2 θ_{2}}}) + \frac{θ_{1}^{2}}{4 θ_{2}},

where

θ_{1} = \frac{μ}{σ^{2}}

and

θ_{2} = \frac{1}{2 σ^{2}}

, and Φ denotes the cumulative distribution function of the standard normal. Function

F_{12}

is proven of class

C^{1}

on

Θ_{12}

(see Proposition 3.1 of [50]) with

F_{12} (θ, 0) = - \log (- θ)

for

θ < 0

.

Notice that the KLD between an exponential distribution and a half-normal distribution is

+ \infty

since the definite integral diverges (hence

D_{KL} [e_{λ} : h_{σ}]

is not equivalent to a Bregman divergence, and

Θ_{e_{θ_{1}} h_{θ_{2}^{'}}}

is not open at 1) but the reverse KLD between a half-normal distribution and an exponential distribution is available in closed-form (using symbolic computing):

D_{KL} [h_{σ} : e_{λ}] = \sqrt{\sqrt{8} σ λ - \sqrt{π} (1 + \log \frac{π λ^{2} σ^{2}}{2})} 2 \sqrt{π} .

Figure 8 illustrate the domain of the singly truncated normal distributions and displays an exponential arc between an exponential distribution and a half-normal distribution. Notice that we could have also considered a similar but different example by taking the exponential family of Rayleigh distributions which exhibit an additional extra carrier term

k (x)

.

The Bhattacharyya α-skewed coefficient calculated using symbolic computing (see Appendix C) is

ρ_{α} [h_{σ} : e_{λ}] = ρ_{1 - α} [e_{λ} : h_{σ}] = \frac{π^{\frac{1}{2} - \frac{α}{2}} e^{- σ^{2} λ^{2}} (2^{\frac{α}{2} + \frac{1}{2}} σ λ e^{\frac{α σ^{2} λ^{2}}{2} + \frac{σ^{2} λ^{2}}{2 α}} \erf (\frac{(\sqrt{2} α - \sqrt{2}) σ λ}{2 \sqrt{α}}) + 2^{\frac{α}{2} + \frac{1}{2}} σ λ e^{\frac{α σ^{2} λ^{2}}{2} + \frac{σ^{2} λ^{2}}{2 α}})}{2 \sqrt{α} σ^{α} λ^{α}},

where

\erf

denotes the error function.

7. Conclusions

In this work, we revisited the Chernoff information [2] (1952) which was originally introduced to upper bound Bayes’ error in binary hypothesis testing. A general characterization of Chernoff information between two arbitrary probability measures was given in [11] (Theorem 32) by considering Rényi divergences which can be interpreted as scaled skewed Bhattacharyya divergences. Since its inception, the Chernoff information has proven useful as a statistical divergence (Chernoff divergence) in many applications ranging from information fusion to quantum metrology due to its empirical robustness property [19]. Informally, we may observe empirically that in practice the skewed Bhattacharyya divergence is more stable around the Chernoff exponent

α^{*}

than in other part of the range

(0, 1)

. By considering the maximal extension of the exponential arc joining two densities p and q on a Lebesgue space

L^{1} (μ)

, we built full likelihood ratio exponential families [10]

E_{p q}

(LREFs) in Section 2. When the LREF

E_{p q}

is a regular exponential family (with coinciding support of p and q), both the forward and reverse Kullback–Leibler divergence are finite and can be rewritten as finite Bregman divergences induced by the log-normalizer

F_{p q}

of

E_{p q}

which amounts to minus skewed Bhattacharyya divergences. Since log-normalizers of exponential families are strictly convex, we deduced that the skewed Bhattacharyya divergences are strictly concave and their maximization yielding the Chernoff information is hence proven unique. As a byproduct, this geometric characterization in

L^{1} (μ)

allowed us to prove that the intersection of a e-geodesic with a m-bisector is unique in dually flat subspaces of

L^{1} (μ)

, and similarly that the intersection of a m-geodesic with a e-bisector is unique (Proposition 8). We then considered the exponential families of univariate and multivariate normal distributions: We reported closed-form solutions for the Chernoff information of univariate normal distribution and centered normal distributions with scaled covariance matrices, and show how to implement efficiently a dichotomic search for approximating the Chernoff information between two multivariate normal distributions (Algorithm 3). Table 1 summarizes the various optimal condition studied characterizing the Chernoff exponent. Finally, inspired by the Chernoff information study, we defined in Section 4, the forward and reverse Bregman–Chernoff divergences [66], and show how these divergences are related to the capacity of a discrete memoryless channel and the minimax redundancy of universal coding in information theory [13].

Additional material including Maxima and Java^® snippet codes is available online at https://franknielsen.github.io/ChernoffInformation/index.html (accessed on 30 July 2022).

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

I would like to thank Rob Brekelmans for many fruitful discussions on likelihood ratio exponential families and related topics. I also warmly thank the three Reviewers for the careful and insightful review of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Background on Statistical Divergences

We introduce some statistical dissimilarities like the Kullback–Leibler divergence, the Bhattacharyya distance or the Hellinger divergence which have been proven among others useful is characterizing or bounding the probability of error in Bayesian statistical hypothesis testing [4,25,67].

The Kullback–Leibler divergence [13] (KLD) between two probability measures (PMs) P and Q is defined as

D_{KL} [P : Q] = \{\begin{matrix} \int_{X} \log (\frac{d P}{d Q}) d P, & i f P ≪ Q \\ + \infty & i f P ≪ Q \end{matrix}

Two PMs P and Q are mutually singular when there exists an event

A \in A

such that

P (A) = 0

and

Q (X ∖ A) = 0

. Mutually singular measures P and Q are notationally written as

P ⊥ Q

. Let P and Q be two non-singular probability measures on

(X, A)

dominated by a common

σ

-finite measure

μ

, and denote by

p = \frac{d P}{d μ}

and

q = \frac{d Q}{d μ}

their Radon–Nikodym densities with respect to

μ

. Then the KLD between P and Q can be calculated equivalently by the KLD between their densities as follows:

D_{KL} [P : Q] = D_{KL} [p : q] = \int_{X} p \log (\frac{p}{q}) d μ .

(A1)

It can be shown that

D_{KL} [p : q]

is independent of the chosen dominating measure

μ

[4], and thus when

P, Q ≪ μ

, we write for short

D_{KL} [P : Q] = D_{KL} [p : q]

. Although the dominating measure

μ

can be set to

μ = \frac{P + Q}{2}

in general, it is either often chosen as

μ_{L}

the Lebesgue measure for continuous sample spaces

R^{d}

(with the

σ

-algebra

A = B (R^{d})

of Borel sets) or as the counting measure

μ_{#}

for discrete sample spaces (with the

σ

-algebra

A

of power sets). The KLD is not a metric distance because it is asymmetric and does not satisfy the triangle inequality.

Let

supp (μ) = cl {A \in A : μ (A) \neq 0}

denote the support of a Radon positive measure [1]

μ

where

cl

denotes the topological closure operation. Notice that

D_{KL} [p : q] = + \infty

when the definite integral of Equation (A1) divergences (e.g., the KLD between a standard Cauchy distribution and a standard normal distribution is

+ \infty

but the KLD between a standard normal distribution and a standard Cauchy distributions is finite), and

D_{KL} [P : Q] = \infty

when the probability measures have disjoint supports (

P ⊥ Q

). Thus, when the supports of P and Q are distinct but not nested, both the forward KLD

D_{KL} [P : Q]

and the reverse KLD

D_{KL} [Q : P]

are infinite.

Let

f_{α} (u) = u^{α}

for

α \in R

. The functions

f_{α} (u)

are convex for

α \in R ∖ [0, 1]

and concave for

α \in [0, 1]

. Thus, we can define the f-divergences [59,68]

I_{f_{α}} [p : q] = \int p f_{α} (q / p) d μ = \int p^{1 - α} q^{α} d μ,

(A2)

for

α \in R ∖ [0, 1]

and

I_{- f_{α}} [p : q] = - \int p f_{α} (q / p) d μ = - \int p^{1 - α} q^{α} d μ

for

α \in (0, 1)

(or equivalently take the convex generator

h_{α} (u) = - u^{α}

for

α \in (0, 1)

). Notice that the conjugate f-divergence is obtained for the generator

f_{α}^{*} (u) = u f_{α} (1 / u) = u^{1 - α}

:

I_{f_{α}} [q : p] = I_{f_{α}^{*}} [p : q]

. By Jensen’s inequality, we have that the f-divergences are lower bounded by

f (1)

. Thus,

I_{h_{α}^{*}} [p : q] \geq h_{α} (1) = - 1

. Since f-divergences are upper bounded by

f (0) + f^{*} (0)

, we have that

I_{h_{α}^{*}} [p : q] < 0

for

α \in (0, 1)

. This gives another proof that the Bhattacharyya coefficient

ρ_{α} [p : q] = - I_{h_{1 - α}} [p : q]

is bounded between 0 and 1 since the

I_{h_{1 - α}}

divergence is bounded between

- 1

and 0. Moreover, Ali and Silvey [59] further defined the

(f, g)

-divergences as

I_{f, g} [p : q] = g (I_{f} [p : q])

for a strictly monotonically increasing function

g (v)

. Letting

g (v) = - \log (- v)

(with

g^{'} (v) = - \frac{1}{v} < 0

when

v \in (0, 1)

), we get that the

(h_{1 - α}, g)

-divergences are the Bhattacharyya distances for

α \in (0, 1)

. However, the Chernoff information is not a f-divergence despite the fact that Bhattacharyya distances are Ali–Silvey

(f, g)

-divergences because of the maximization criterion [59] of Equation (4).

We refer the reader to [69] (Chapter 14), [70] (Figure 1) and [71] (Figure 3) for other statistical distances and statistical dissimilarities with their connections.

Appendix B. Exponential Family of Univariate Gaussian Distributions

Consider the family of univariate normal distributions:

N = \{p_{μ, σ^{2}} (x) = \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{1}{2} \frac{{(x - μ)}^{2}}{σ^{2}}), μ \in R, σ^{2} > 0\} .

Let

λ = (λ_{1} = μ, λ_{2} = σ^{2})

denote the mean-variance parameterization, and consider the sufficient statistic vector

t (x) = (x, x^{2})

. Then the densities of

N

can be written in the canonical form of exponential families:

p_{λ} (x) = \exp (〈 θ (λ), t (x) 〉 - F (θ)),

where

θ (λ) = (\frac{λ_{1}}{λ_{2}}, - \frac{1}{2 λ_{2}})

and the log-normalizer is

F (θ) = - \frac{θ_{1}^{2}}{4 θ_{2}} + \frac{1}{2} \log \frac{π}{- θ_{2}} .

he dual moment parameterization is

η (λ) = E_{p_{λ}} [t (x)] = (λ_{1}, λ_{1}^{2} + λ_{2})

, and the convex conjugate is:

F^{*} (η) = sup_{θ \in Θ} {〈 θ, η 〉 - F (η)} = - \frac{1}{2} (\log (2 π e (η_{2} - η_{1}^{2})) .

We check that the convex conjugate coincides with the negentropy [72]:

h [p_{λ}] = - F^{*} (η (λ)) .

The conversion formulæ between the dual natural/moment parameters and the ordinary parameters are given by:

\begin{matrix} θ (λ) & = & (\frac{λ_{1}}{λ_{2}}, - \frac{1}{2 λ_{2}}), \end{matrix}

(A3)

\begin{matrix} λ (θ) & = & (- \frac{θ_{1}}{2 θ_{2}}, - \frac{1}{2 θ_{2}}), \end{matrix}

(A4)

\begin{matrix} η (λ) & = & (λ_{1}, λ_{1}^{2} + λ_{2}), \end{matrix}

(A5)

\begin{matrix} λ (η) & = & (η_{1}, η_{2} - η_{1}^{2}), \end{matrix}

(A6)

\begin{matrix} η (θ) & = & (E [x], E [x^{2}]) = \nabla F (θ) = (- \frac{θ_{1}}{2 θ_{2}}, - \frac{1}{2 θ_{2}} + \frac{θ_{1}^{2}}{4 θ_{2}^{2}}), \end{matrix}

(A7)

\begin{matrix} θ (η) & = & \nabla F^{*} (η) = (- \frac{η_{1}}{η_{1}^{2} - η_{2}}, \frac{1}{2 (η_{1}^{2} - η_{2})}) \end{matrix}

(A8)

We check that

\begin{matrix} D_{KL} [p_{λ} : p_{λ^{'}}] & = & \frac{1}{2} (\frac{{(λ_{1} - λ_{1}^{'})}^{2}}{{λ_{2}^{'}}^{2}} + \frac{λ_{2}^{2}}{{λ_{2}^{'}}^{2}} - \log \frac{λ_{2}^{2}}{{λ_{2}^{'}}^{2}} - 1), \\ = & B_{F} (θ (λ^{'}) : θ (λ)) = B_{F^{*}} (η (λ) : η (λ^{'})), \\ = & Y_{F, F^{*}} (θ (λ^{'}) : η (λ)) = Y_{F^{*}, F} (η (λ) : θ (λ^{'})), \end{matrix}

where

B_{F}

and

B_{F^{*}}

are the dual Bregman divergences and

Y_{F, F^{*}}

and

Y_{F^{*}, F}

are the dual Fenchel–Young divergences.

Appendix C. Code Snippets in M`AXIMA`

Code for plotting Figure 1.

Listing A1: Plot the cumulant function of a log ratio exponential family induced by two normal distributions.

varalpha(v1,v2,alpha):=(v1*v2)/((1-alpha)*v1+alpha*v2)$
mualpha(mu1,v1,mu2,v2,alpha):=(alpha*mu1*v2+(1-alpha)*mu2*v1)/((1-alpha)*v1+alpha*v2)$
assume(v1>0)$assume(v2>0)$
theta1(mu,v):=mu/v$
theta2(mu,v):=-1/(2*v)$;
F(theta1,theta2):=((-theta1**2)/(4*theta2))+(1/2)*log(-%pi/theta2)$
JF(alpha,theta1,theta2,theta1p,theta2p):=alpha*F(theta1,theta2)+(1-alpha)*F(theta1p,theta2p)-F(alpha*theta1+(1-alpha)*theta1p,alpha*theta2+(1-alpha)*theta2p);
m1:0;v1:1;m2:1;v2:2;
plot2d([JF(alpha,theta1(m1,v1),theta2(m1,v1),theta1(m2,v2),theta2(m2,v2)),
-JF(alpha,theta1(m1,v1),theta2(m1,v1),theta1(m2,v2),theta2(m2,v2)),
[discrete,[[0.4215580558605244,-0.15],[0.4215580558605244,0.15]]],
[discrete, [0.4215580558605244], [0.1155433222682347]],
[discrete, [0.4215580558605244], [-0.1155433222682347]]
],
[alpha,0,1], [xlabel,"alpha"], [ylabel,"F_{pq}(alpha)=-D_{B,alpha}[p:q]"],
[style, [lines,1,1],[lines,1,2],
[lines,2,0], [points, 3,3],[points, 3,3]],[legend, "skew Bhattacharyya D_{B,alpha}[p:q]","LREF log-normalizer F_{pq}(alpha)","","","" ],
[color, blue, red, black, black,black],[point_type,asterisk]);

Code for calculating the Chernoff information between two univariate Gaussian distributions (Proposition 10):

Listing A2: Calculate symbolically the exact Chernoff information between two univariate normal distributions.

varalpha(v1,v2,alpha):=(v1*v2)/((1-alpha)*v1+alpha*v2)$
mualpha(mu1,v1,mu2,v2,alpha):=(alpha*mu1*v2+(1-alpha)*mu2*v1)/((1-alpha)*v1+alpha*v2)$
/* Kullback--Leibler divergence */
KLD(mu1,v1,mu2,v2):=(1/2)*((((mu2-mu1)**2)/v2)+(v1/v2)-log(v1/v2)-1)$
assume(alpha>0)$assume(alpha<1)$
assume(v1>0)$assume(v2>0)$
theta1(mu,v):=mu/v$
theta2(mu,v):=-1/(2*v)$;
F(theta1,theta2):=-theta1**2/(4*theta2)+0.5*log(-1/theta2)$
eq: (theta1(mu1,v1)-theta1(mu2,v2))*mualpha(mu1,v1,mu2,v2,alpha)+(theta2(mu1,v1)-theta2(mu2,v2))*(mualpha(mu1,v1,mu2,v2,alpha)**2+varalpha(v1,v2,alpha))-F(theta1(mu1,v1),theta2(mu1,v1))+F(theta1(mu2,v2),theta2(mu2,v2));
solalpha: solve(eq,alpha)$
alphastar:rhs(solalpha[1]);
ChernoffInformation: KLD(mualpha(mu1,v1,mu2,v2,alphastar),varalpha(v1,v2,alphastar),mu1,v1)$
print("Chernoff information=")$ratsimp(ChernoffInformation);

Example of a plot of the

α

-Bhattacharryya distance for

α \in [0, 1]

when p and q are two normal distributions.

Listing A3: Plot the skewed Bhattacharrya divergences between two normal distributions as an equivalent skewed Jensen divergence between two normal distributions.

varalpha(v1,v2,alpha):=(v1*v2)/((1-alpha)*v1+alpha*v2)$
mualpha(mu1,v1,mu2,v2,alpha):=(alpha*mu1*v2+(1-alpha)*mu2*v1)/((1-alpha)*v1+alpha*v2)$
assume(v1>0)$assume(v2>0)$
theta1(mu,v):=mu/v$
theta2(mu,v):=-1/(2*v)$;
F(theta1,theta2):=((-theta1**2)/(4*theta2))+(1/2)*log(-%pi/theta2)$
JF(alpha,theta1,theta2,theta1p,theta2p):=alpha*F(theta1,theta2)+(1-alpha)*F(theta1p,theta2p)-F(alpha*theta1+(1-alpha)*theta1p,alpha*theta2+(1-alpha)*theta2p);
m1:0;v1:1;m2:1;v2:2;
plot2d(JF(alpha,theta1(m1,v1),theta2(m1,v1),theta1(m2,v2),theta2(m2,v2)),[alpha,0,1]);

Example which calculates exactly the Chernoff exponent between two centered 4D Gaussians by solving the polynomial roots of the Chernoff optimal condition:

Listing A4: Calculate the Chernoff information between two 4D centered normal distributions based on their eigenvalues.

assume(l1>0);assume(l2>0);assume(l3>0);assume(l4>0);
assume(alpha>0);assume(alpha<1);
l1:1;l2:2;l3:3;l4:4;
eq: (1-l1)/(alpha+(1-alpha)*l1)+ (1-l2)/(alpha+(1-alpha)*l2)+ (1-l3)/(alpha+(1-alpha)*l3)+ (1-l4)/(alpha+(1-alpha)*l4) + log(l1)+log(l2)+log(l3)+log(l4);
solve(eq,alpha);
sol:float(%);
realpart(sol);imagpart(sol);
/* alpha=0.5969427599369763 */

Example of choosing two different exponential families: The half-normal distributions and the exponential distributions:

Listing A5: Calculate symbolically the Kullback–Leibler divergence and the Bhattacharyya coefficient between a half normal distribution and an exponential distribution.

assume(sigma>0);
halfnormal(x,sigma):=(sqrt(2)/(sqrt(%pi*sigma**2)))*exp(-x**2/(2*sigma**2));
assume(lambda>0);
exponential(x,lambda):=lambda*exp(-lambda*x);
/* KLD diverges */
integrate(exponential(x,lambda)*log(exponential(x,lambda)/halfnormal(x,sigma)),x,0,inf);
/* KLD converges */
integrate(halfnormal(x,sigma)*log(halfnormal(x,sigma)/exponential(x,lambda)),x,0,inf);
/* Bhattacharyya coefficient */
assume(alpha>0);
assume(alpha<1);
integrate( (halfnormal(x,sigma)**alpha) * (exponential(x,lambda)**(1-alpha)),x,0,inf);

References

Keener, R.W. Theoretical Statistics: Topics for a Core Course; Springer Science & Business Media: New York, NY, USA, 2010. [Google Scholar]
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Csiszár, I. A class of measures of informativity of observation channels. Period. Math. Hung. 1972, 2, 191–213. [Google Scholar] [CrossRef]
Torgersen, E. Comparison of Statistical Experiments; Cambridge University Press: Cambridge, UK, 1991; Volume 36. [Google Scholar]
Audenaert, K.M.; Calsamiglia, J.; Munoz-Tapia, R.; Bagan, E.; Masanes, L.; Acin, A.; Verstraete, F. Discriminating states: The quantum Chernoff bound. Phys. Rev. Lett. 2007, 98, 160501. [Google Scholar] [PubMed] [Green Version]
Audenaert, K.M.; Nussbaum, M.; Szkoła, A.; Verstraete, F. Asymptotic error rates in quantum hypothesis testing. Commun. Math. Phys. 2008, 279, 251–283. [Google Scholar] [CrossRef] [Green Version]
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Grünwald, P.D. Information-Theoretic Properties of Exponential Families. In The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007; pp. 623–650. [Google Scholar]
Van Erven, T.; Harremos, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
Nakiboğlu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2018, 65, 841–860. [Google Scholar] [CrossRef] [Green Version]
Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Borade, S.; Zheng, L. I-projection and the geometry of error exponents. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 27–29 September 2006. [Google Scholar]
Boyer, R.; Nielsen, F. On the error exponent of a random tensor with orthonormal factor matrices. In International Conference on Geometric Science of Information; Springer: Cham, Switzerland, 2017; pp. 657–664. [Google Scholar]
D’Costa, A.; Ramachandran, V.; Sayeed, A.M. Distributed classification of Gaussian space-time sources in wireless sensor networks. IEEE J. Sel. Areas Commun. 2004, 22, 1026–1036. [Google Scholar] [CrossRef]
Yu, N.; Zhou, L. Comments on and Corrections to “When Is the Chernoff Exponent for Quantum Operations Finite?”. IEEE Trans. Inf. Theory 2022, 68, 3989–3990. [Google Scholar] [CrossRef]
Konishi, S.; Yuille, A.L.; Coughlan, J.; Zhu, S.C. Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins, CO, USA, 23–25 June 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 1, pp. 573–579. [Google Scholar]
Julier, S.J. An empirical study into the use of Chernoff information for robust, distributed fusion of Gaussian mixture models. In Proceedings of the 2006 9th International Conference on Information Fusion, Florence, Italy, 10–13 July 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 1–8. [Google Scholar]
Kakizawa, Y.; Shumway, R.H.; Taniguchi, M. Discrimination and clustering for multivariate time series. J. Am. Stat. Assoc. 1998, 93, 328–340. [Google Scholar] [CrossRef]
Dutta, S.; Wei, D.; Yueksel, H.; Chen, P.Y.; Liu, S.; Varshney, K. Is there a trade-off between fairness and accuracy? A perspective using mismatched hypothesis testing. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 2803–2813. [Google Scholar]
Agarwal, S.; Varshney, L.R. Limits of deepfake detection: A robust estimation viewpoint. arXiv 2019, arXiv:1905.03493. [Google Scholar]
Maherin, I.; Liang, Q. Radar sensor network for target detection using Chernoff information and relative entropy. Phys. Commun. 2014, 13, 244–252. [Google Scholar] [CrossRef]
Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef] [Green Version]
Westover, M.B. Asymptotic geometry of multiple hypothesis testing. IEEE Trans. Inf. Theory 2008, 54, 3327–3329. [Google Scholar] [CrossRef]
Nielsen, F. Hypothesis testing, information divergence and computational geometry. In International Conference on Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2013; pp. 241–248. [Google Scholar]
Leang, C.C.; Johnson, D.H. On the asymptotics of M-hypothesis Bayesian detection. IEEE Trans. Inf. Theory 1997, 43, 280–282. [Google Scholar] [CrossRef]
Cena, A.; Pistone, G. Exponential statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 27–56. [Google Scholar] [CrossRef]
Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Brekelmans, R.; Nielsen, F.; Makhzani, A.; Galstyan, A.; Steeg, G.V. Likelihood Ratio Exponential Families. arXiv 2020, arXiv:2012.15480. [Google Scholar]
De Andrade, L.H.; Vieira, F.L.; Vigelis, R.F.; Cavalcante, C.C. Mixture and exponential arcs on generalized statistical manifold. Entropy 2018, 20, 147. [Google Scholar] [CrossRef] [Green Version]
Siri, P.; Trivellato, B. Minimization of the Kullback–Leibler Divergence over a Log-Normal Exponential Arc. In International Conference on Geometric Science of Information; Springer: Cham, Switzerland, 2019; pp. 453–461. [Google Scholar]
Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef] [Green Version]
Collins, M.; Dasgupta, S.; Schapire, R.E. A generalization of principal components analysis to the exponential family. Adv. Neural Inf. Process. Syst. 2001, 14, 617–624. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
Sundberg, R. Statistical Modelling by Exponential Families; Cambridge University Press: Cambridge, UK, 2019; Volume 12. [Google Scholar]
Nielsen, F.; Okamura, K. On f-divergences between Cauchy distributions. arXiv 2021, arXiv:2101.12459. [Google Scholar]
Chyzak, F.; Nielsen, F. A closed-form formula for the Kullback–Leibler divergence between Cauchy distributions. arXiv 2019, arXiv:1905.10965. [Google Scholar]
Huzurbazar, V.S. Exact forms of some invariants for distributions admitting sufficient statistics. Biometrika 1955, 42, 533–537. [Google Scholar] [CrossRef]
Burbea, J.; Rao, C. On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 1982, 28, 489–495. [Google Scholar] [CrossRef]
Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences: Part 2. Commun. Math. Sci. 2008, 6, 927–948. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
Han, Q.; Kato, K. Berry–Esseen bounds for Chernoff-type nonstandard asymptotics in isotonic regression. Ann. Appl. Probab. 2022, 32, 1459–1498. [Google Scholar] [CrossRef]
Neal, R.M. Annealed importance sampling. Stat. Comput. 2001, 11, 125–139. [Google Scholar] [CrossRef]
Grosse, R.B.; Maddison, C.J.; Salakhutdinov, R. Annealing between distributions by averaging moments. In Advances in Neural Information Processing Systems 26, Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 5–10 December 2013; Citeseer: La Jolla, CA, USA, 2013; pp. 2769–2777. [Google Scholar]
Takenouchi, T. Parameter Estimation with Generalized Empirical Localization. In International Conference on Geometric Science of Information; Springer: Cham, Switzerland, 2019; pp. 368–376. [Google Scholar]
Rockafellar, R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967, 19, 200–205. [Google Scholar] [CrossRef]
Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef] [Green Version]
Amari, S.I. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016; Volume 194. [Google Scholar]
Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discret. Comput. Geom. 2010, 44, 281–307. [Google Scholar] [CrossRef] [Green Version]
Lê, H.V. Statistical manifolds are statistical models. J. Geom. 2006, 84, 83–93. [Google Scholar] [CrossRef]
Nielsen, F. On a Variational Definition for the Jensen–Shannon Symmetrization of Distances Based on the Information Radius. Entropy 2021, 23, 464. [Google Scholar] [CrossRef]
Nock, R.; Nielsen, F. Fitting the smallest enclosing Bregman ball. In Proceedings of the European Conference on Machine Learning, Porto, Portugal, 3–7 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 649–656. [Google Scholar]
Nielsen, F.; Nock, R. On the smallest enclosing information disk. Inf. Process. Lett. 2008, 105, 93–97. [Google Scholar] [CrossRef] [Green Version]
Costa, R. Information Geometric Probability Models in Statistical Signal Processing. Ph.D. Thesis, University of Rhode Island, Kingston, RI, USA, 2016. [Google Scholar]
Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv 2009, arXiv:0911.4863. [Google Scholar]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar] [CrossRef]
Nielsen, F.; Okamura, K. A note on the f-divergences between multivariate location-scale families with either prescribed scale matrices or location parameters. arXiv 2022, arXiv:2204.10952. [Google Scholar]
Athreya, A.; Fishkind, D.E.; Tang, M.; Priebe, C.E.; Park, Y.; Vogelstein, J.T.; Levin, K.; Lyzinski, V.; Qin, Y. Statistical inference on random dot product graphs: A survey. J. Mach. Learn. Res. 2017, 18, 8393–8484. [Google Scholar]
Li, B.; Wei, S.; Wang, Y.; Yuan, J. Topological and algebraic properties of Chernoff information between Gaussian graphs. In Proceedings of the 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 670–675. [Google Scholar]
Tang, M.; Priebe, C.E. Limit theorems for eigenvectors of the normalized Laplacian for random graphs. Ann. Stat. 2018, 46, 2360–2415. [Google Scholar] [CrossRef] [Green Version]
Calvo, M.; Oller, J.M. An explicit solution of information geodesic equations for the multivariate normal model. Stat. Risk Model. 1991, 9, 119–138. [Google Scholar] [CrossRef]
Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences. Commun. Math. Sci. 2008, 6, 915–926. [Google Scholar] [CrossRef] [Green Version]
Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
Csiszar, I. Eine information’s theoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoschen Ketten. Publ. Math. Inst. Hung. Acad. Sc. 1963, 3, 85–107. [Google Scholar]
Deza, M.M.; Deza, E. Encyclopedia of distances. In Encyclopedia of Distances; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–583. [Google Scholar]
Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef] [Green Version]
Jian, B.; Vemuri, B.C. Robust point set registration using Gaussian mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 1633–1645. [Google Scholar] [CrossRef]
Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 3621–3624. [Google Scholar]

Figure 1. Plot of the Bhattacharryya distance

D_{B, α} (p : q)

(strictly concave, displayed in blue) and the log-normalizer

F_{p q} (α)

of the induced LREF

E_{p q}

(strictly convex, displayed in red) for two univariate normal densities

p = p_{0, 1})

(standard normal) and

q = p_{1, 2}

: The curves

D_{B, α} (p : q) = - F_{p q} (α)

are horizontally mirror symmetric to each others. The Chernoff information optimal skewing value

α^{*}

between these two univariate normal distributions can be calculated exactly in closed-form, see Section 5.2 (approximated numerically here for plotting the vertical grey line by

α^{*} \approx 0.4215580558605244

).

Figure 1. Plot of the Bhattacharryya distance

D_{B, α} (p : q)

(strictly concave, displayed in blue) and the log-normalizer

F_{p q} (α)

of the induced LREF

E_{p q}

(strictly convex, displayed in red) for two univariate normal densities

p = p_{0, 1})

(standard normal) and

q = p_{1, 2}

: The curves

D_{B, α} (p : q) = - F_{p q} (α)

are horizontally mirror symmetric to each others. The Chernoff information optimal skewing value

α^{*}

between these two univariate normal distributions can be calculated exactly in closed-form, see Section 5.2 (approximated numerically here for plotting the vertical grey line by

α^{*} \approx 0.4215580558605244

).

Figure 2. The best unique parameter

α^{*}

defining the Chernoff information optimal skewing parameter is found by setting the derivative of the strictly convex function

F_{p q} (α)

to zero. At the optimal value

α^{*}

, we have

D_{C} [p : q] = D_{KL} [{(p q)}_{α^{*}}^{G} : p] = D_{KL} [{(p q)}_{α^{*}}^{G} : q] = - F (α^{*}) > 0

.

Figure 2. The best unique parameter

α^{*}

defining the Chernoff information optimal skewing parameter is found by setting the derivative of the strictly convex function

F_{p q} (α)

to zero. At the optimal value

α^{*}

, we have

D_{C} [p : q] = D_{KL} [{(p q)}_{α^{*}}^{G} : p] = D_{KL} [{(p q)}_{α^{*}}^{G} : q] = - F (α^{*}) > 0

.

Figure 3. The Chernoff information distribution

{(P Q)}_{α^{*}}^{G}

with density

{(p q)}_{α^{*}}^{G}

is obtained as the unique intersection of the exponential arc

γ^{G} (p, q)

linking density p to density q of

L^{1} (μ)

with the left-sided Kullback–Leibler divergence bisector

{Bi}_{KL}^{left} (p, q)

of p and q:

{(p q)}_{α^{*}}^{G} = γ^{G} (p, q) \cap {Bi}_{KL}^{left} (p, q)

.

Figure 3. The Chernoff information distribution

{(P Q)}_{α^{*}}^{G}

with density

{(p q)}_{α^{*}}^{G}

is obtained as the unique intersection of the exponential arc

γ^{G} (p, q)

linking density p to density q of

L^{1} (μ)

with the left-sided Kullback–Leibler divergence bisector

{Bi}_{KL}^{left} (p, q)

of p and q:

{(p q)}_{α^{*}}^{G} = γ^{G} (p, q) \cap {Bi}_{KL}^{left} (p, q)

.

Figure 4. Illustration of the dichotomic search for approximating the optimal skewing parameter

α^{*}

to within some prescribed numerical precision

ϵ > 0

.

Figure 4. Illustration of the dichotomic search for approximating the optimal skewing parameter

α^{*}

to within some prescribed numerical precision

ϵ > 0

.

Figure 5. Taxonomy of exponential families: Regular (and always steep) or steep (but not necessarily regular). The Kullback–Leibler divergence between two densities of a regular exponential family amounts to dual Bregman divergences.

Figure 6. The Chernoff information optimal skewing parameter

α^{*}

for two densities

p_{θ_{1}}

and

p_{θ_{2}}

of some regular exponential family

E

inducing an exponential family dually flat manifold

M = ({p_{θ}}, g_{F} = \nabla^{2} F (θ), \nabla^{m}, \nabla^{e})

is characterized by the intersection of their

\nabla^{e}

-flat exponential geodesic with their mixture bisector a

\nabla^{m}

-flat right-sided Bregman bisector.

Figure 6. The Chernoff information optimal skewing parameter

α^{*}

for two densities

p_{θ_{1}}

and

p_{θ_{2}}

of some regular exponential family

E

inducing an exponential family dually flat manifold

M = ({p_{θ}}, g_{F} = \nabla^{2} F (θ), \nabla^{m}, \nabla^{e})

is characterized by the intersection of their

\nabla^{e}

-flat exponential geodesic with their mixture bisector a

\nabla^{m}

-flat right-sided Bregman bisector.

Figure 7. Interpolation along the e-geodesic and the m-geodesic passing through two given multivariate normal distributions. No closed-form is known for Riemannian geodesic with respect to the metric Levi–Civita connection (shown in dashed style).

Figure 8. The natural parameter space of the non-regular full exponential family of singly truncated normal distributions is not regular (i.e., not open): The negative real axis corresponds to the exponential family of exponential distributions.

Table 1. Summary of the optimal conditions characterizing the Chernoff exponent.

Generic case
Primal LREF	${OC}_{α} : D_{KL} [{(p q)}_{α^{}}^{G} : p] = D_{KL} [{(p q)}_{α^{}}^{G} : q]$
Dual LREF	${OC}_{β} : β (α^{}) = E_{{(p q)}_{α^{}}^{G}} [\log \frac{p (x)}{q (x)}] = 0$
Geometric OC	${(p q)}_{α^{*}}^{G} = γ^{G} (p, q) \cap {Bi}_{KL}^{left} (p, q]$
Case of exponential families
Bregman	${OC}_{EF} : B_{F} (θ_{1} : θ_{α^{}}) = B_{F} (θ_{2} : θ_{α^{}})$
Fenchel–Young	${OC}_{YF} : Y_{F, F^{}} (θ_{1} : η_{α^{}}) = Y_{F, F^{}} (θ_{2} : η_{α^{}})$
Simplified	${OC}_{{SEF}^{'}} : F_{θ_{1}, θ_{2}}^{'} (α) = 0$
	${OC}_{SEF} : {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1} + α^{*} (θ_{2} - θ_{1})) = F (θ_{2}) - F (θ_{1})$
Geometric OC	$γ_{p q}^{e} (α) \cap {Bi}^{m} (p, q)$
1D EF	$α^{*} = \frac{{F^{'}}^{- 1} (\frac{F (θ_{2}) - F (θ_{1})}{θ_{2} - θ_{1}}) - θ_{2}}{θ_{1} - θ_{2}}$
Gaussian case
1D Gaussians	${OC}_{Gaussian} : (\frac{μ_{2}}{σ_{2}^{2}} - \frac{μ_{1}}{σ_{1}^{2}}) m_{α} - (\frac{1}{2 σ_{2}^{2}} - \frac{1}{2 σ_{1}^{2}}) v_{α} = \frac{1}{2} \log \frac{σ_{2}^{2}}{σ_{1}^{2}} + \frac{μ_{2}^{2}}{2 σ_{2}^{2}} - \frac{μ_{1}^{2}}{2 σ_{1}^{2}}$
	$α^{*}$ is root of quadratic polynomial in $(0, 1)$
Centered Gaussians	${OC}_{Centered Gaussians} : \sum_{i = 1}^{d} \frac{1 - λ_{i}}{α^{} + (1 - α^{}) λ_{i}} + \log λ_{i} = 0$
	where $λ_{i}$ is the i-th eigenvalue of $Σ_{1} Σ_{2}^{- 1}$
Centered Gaussians	$α^{*} = \frac{s - 1 - \log s}{(s - 1) \log s} \in (0, 1)$
scaled covariances	when $Σ_{2} = s Σ_{1}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. Revisiting Chernoff Information with Likelihood Ratio Exponential Families. Entropy 2022, 24, 1400. https://doi.org/10.3390/e24101400

AMA Style

Nielsen F. Revisiting Chernoff Information with Likelihood Ratio Exponential Families. Entropy. 2022; 24(10):1400. https://doi.org/10.3390/e24101400

Chicago/Turabian Style

Nielsen, Frank. 2022. "Revisiting Chernoff Information with Likelihood Ratio Exponential Families" Entropy 24, no. 10: 1400. https://doi.org/10.3390/e24101400

APA Style

Nielsen, F. (2022). Revisiting Chernoff Information with Likelihood Ratio Exponential Families. Entropy, 24(10), 1400. https://doi.org/10.3390/e24101400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Revisiting Chernoff Information with Likelihood Ratio Exponential Families

Abstract

1. Introduction

1.1. Chernoff Information: Definition and Related Statistical Divergences

1.2. Prior Work and Contributions

2. Chernoff Information from the Viewpoint of Likelihood Ratio Exponential Families

2.1. LREFs and the Chernoff Information

2.2. Geometric Characterization of the Chernoff Information and the Chernoff Information Distribution

2.3. Dual Parameterization of LREFs

3. Chernoff Information between Densities of an Exponential Family

3.1. General Case

3.2. Case of One-Dimensional Parameters

4. Forward and Reverse Chernoff–Bregman Divergences

4.1. Chernoff–Bregman Divergence

4.2. Reverse Chernoff–Bregman Divergence and Universal Coding

5. Chernoff Information between Gaussian Distributions

5.1. Invariance of Chernoff Information under the Action of the Affine Group

5.2. Closed-Form Formula for the Chernoff Information between Univariate Gaussian Distributions

5.3. Fast Approximation of the Chernoff Information of Multivariate Gaussian Distributions

5.4. Chernoff Information between Centered Multivariate Normal Distributions

6. Chernoff Information between Densities of Different Exponential Families

7. Conclusions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Background on Statistical Divergences

Appendix B. Exponential Family of Univariate Gaussian Distributions

Appendix C. Code Snippets in M`AXIMA`

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Revisiting Chernoff Information with Likelihood Ratio Exponential Families

Abstract

1. Introduction

1.1. Chernoff Information: Definition and Related Statistical Divergences

1.2. Prior Work and Contributions

2. Chernoff Information from the Viewpoint of Likelihood Ratio Exponential Families

2.1. LREFs and the Chernoff Information

2.2. Geometric Characterization of the Chernoff Information and the Chernoff Information Distribution

2.3. Dual Parameterization of LREFs

3. Chernoff Information between Densities of an Exponential Family

3.1. General Case

3.2. Case of One-Dimensional Parameters

4. Forward and Reverse Chernoff–Bregman Divergences

4.1. Chernoff–Bregman Divergence

4.2. Reverse Chernoff–Bregman Divergence and Universal Coding

5. Chernoff Information between Gaussian Distributions

5.1. Invariance of Chernoff Information under the Action of the Affine Group

5.2. Closed-Form Formula for the Chernoff Information between Univariate Gaussian Distributions

5.3. Fast Approximation of the Chernoff Information of Multivariate Gaussian Distributions

5.4. Chernoff Information between Centered Multivariate Normal Distributions

6. Chernoff Information between Densities of Different Exponential Families

7. Conclusions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Background on Statistical Divergences

Appendix B. Exponential Family of Univariate Gaussian Distributions

Appendix C. Code Snippets in MAXIMA

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix C. Code Snippets in M`AXIMA`