Next Article in Journal
Selected Data Mining Tools for Data Analysis in Distributed Environment
Next Article in Special Issue
An Approach to Canonical Correlation Analysis Based on Rényi’s Pseudodistances
Previous Article in Journal
Dynamic Mixed Data Analysis and Visualization
Previous Article in Special Issue
Testing Equality of Multiple Population Means under Contaminated Normal Model Using the Density Power Divergence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Revisiting Chernoff Information with Likelihood Ratio Exponential Families

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Entropy 2022, 24(10), 1400; https://doi.org/10.3390/e24101400
Submission received: 31 July 2022 / Revised: 23 September 2022 / Accepted: 28 September 2022 / Published: 1 October 2022

Abstract

:
The Chernoff information between two probability measures is a statistical divergence measuring their deviation defined as their maximally skewed Bhattacharyya distance. Although the Chernoff information was originally introduced for bounding the Bayes error in statistical hypothesis testing, the divergence found many other applications due to its empirical robustness property found in applications ranging from information fusion to quantum information. From the viewpoint of information theory, the Chernoff information can also be interpreted as a minmax symmetrization of the Kullback–Leibler divergence. In this paper, we first revisit the Chernoff information between two densities of a measurable Lebesgue space by considering the exponential families induced by their geometric mixtures: The so-called likelihood ratio exponential families. Second, we show how to (i) solve exactly the Chernoff information between any two univariate Gaussian distributions or get a closed-form formula using symbolic computing, (ii) report a closed-form formula of the Chernoff information of centered Gaussians with scaled covariance matrices and (iii) use a fast numerical scheme to approximate the Chernoff information between any two multivariate Gaussian distributions.

Graphical Abstract

1. Introduction

1.1. Chernoff Information: Definition and Related Statistical Divergences

Let ( 𝒳 , A ) denote a measurable space [1] with sample space X and finite σ -algebra A of events. A measure P is absolutely continuous with respect to another measure Q if P ( A ) = 0 whenever Q ( A ) = 0 : P is said dominated by Q and written notationally for short as P Q . We shall write P Q when P is not dominated by Q. When P Q , we denote by d P d Q the Radon–Nikodym density [1] of P with respect to Q.
The Chernoff information [2], also called Chernoff information number [3,4] or the Chernoff divergence [5,6], is the following symmetric measure of dissimilarity (see Appendix A for some background on statistical divergences) between any two comparable probability measures P and Q dominated by μ :
D C [ P , Q ] : = max α ( 0 , 1 ) log ρ α [ P : Q ] = D C [ Q , P ] ,
ρ α [ P : Q ] : = p α q 1 α d μ = ρ 1 α [ Q : P ] ,
is the α -skewed Bhattacharyya affinity coefficient [7] (a coefficient measuring the similarity of two densities). In the remainder, we shall use the following conventions: When a (dis)similarity is asymmetric (e.g., ρ α [ P : Q ] ), we use the colon notation “:” to separate its arguments. When the (dis)similarity is symmetric (e.g., D C [ P , Q ] ), we use the comma notation “,” to separate its arguments.
The α -skewed Bhattacharyya coefficients are always upper bounded by 1 and are strictly greater than zero for non-empty intersecting support (non-singular PMs):
0 < ρ α [ P : Q ] 1 .
A proof can be obtained by applying Hölder’s inequality (see also Appendix A for an alternative proof).
Since the affinity coefficient ρ α [ P : Q ] does not depend on the underlying dominating measure μ  [4], we shall write D C [ p , q ] instead of D C [ P , Q ] in the reminder.
Let D B , α [ p : q ] denote the α -skewed Bhattacharyya distance [7,8]:
D B , α [ p : q ] : = log ρ α [ P : Q ] = D B , 1 α [ q : p ] ,
The α -skewed Bhattacharyya distances are not metric distances since they can be asymmetric and do not satisfy the triangle inequality even when α = 1 2 .
Thus, the Chernoff information is defined as the maximal skewed Bhattacharyya distance:
D C [ p , q ] = max α ( 0 , 1 ) D B , α [ p : q ] .
Grünwald [9,10] called the skewed Bhattacharyya coefficients and distances the α -Rényi affinity and the unnormalized Rényi divergence, respectively, (see Section 19.6 of [9]) since the Rényi divergence [11,12] is defined by
D R , α [ P : Q ] = 1 α 1 log p α q 1 α d μ = 1 1 α D B , α [ P : Q ] .
Thus D B , α [ P : Q ] = ( 1 α ) D R , α [ P : Q ] can be interpreted as the unnormalized Rényi divergence in [9]. However, let us notice that the Rényi α -divergences are defined in general for a wider range α [ 0 , ] { 1 } with lim α 1 D R , α [ P : Q ] = D KL [ P : Q ] but the skew Bhattacharyya distances are defined for α ( 0 , 1 ) in general.
The Chernoff information was originally introduced to upper bound the probability error of misclassification in Bayesian binary hypothesis testing [2] where the optimal skewing parameter α * such that D C [ p , q ] = D B , α * [ p : q ] is referred to in the statistical literature as the Chernoff error exponent [13,14,15] or Chernoff exponent [16,17] for short. The Chernoff information has found many other fruitful applications beyond its original statistical hypothesis testing scope like in computer vision [18], information fusion [19], time-series clustering [20], and more generally in machine learning [21] (just to cite a few use cases). It has been observed empirically that the Chernoff information exhibits superior robustness [22] compared to the Kullback–Leibler divergence in distributed fusion of Gaussian Mixtures Models [19] (GMMs) or in target detection in radar sensor network [23]. The Chernoff information has also been used for analysis deepfake detection performance of Generative Adversarial Networks [22] (GANs).

1.2. Prior Work and Contributions

The Chernoff information between any two categorical distributions (multinomial distributions with one trial also called “multinoulli” since they are extensions of the Bernoulli distributions) has been very well-studied and described in many reference textbooks of information theory or statistics (e.g., see Section 12.9 of [13]). The Chernoff information between two probability distributions of an exponential family was considered from the viewpoint of information geometry in [24], and in the general case from the viewpoint of unnormalized Rényi divergences in [11] (Theorem 32). By replacing the weighted geometric mean in the definition of the Bhattacharyya coefficient ρ α of Equation (2) by an arbitrary weighted mean, the generalized Bhattacharyya coefficient and its associated divergences including the Chernoff information was generalized in [25]. The geometry of the Chernoff error exponent was studied in [26,27] when dealing with a finite set of mutually absolutely probability distributions P 1 , , P n . In this case, the Chernoff information amounts to the minimum pairwise Chernoff information of the probability distributions [28]:
D C [ P 1 , , P n ] : = min i { 1 , , n } j { 1 , , n } D C [ P i , P j ] .
We summarize our contributions as follows: In Section 2, we study the Chernoff information between two given mutually non-singular probability measures P and Q by considering their “exponential arc” [29] as a special 1D exponential family termed a Likelihood Ratio Exponential Family (LREF) in [10]. We show that the optimal skewing value (Chernoff exponent) defining their Chernoff information is unique (Proposition 1) and can be characterized geometrically on the Banach vector space L 1 ( μ ) of equivalence classes of measurable functions (i.e., two functions f 1 and f 2 are said equivalent in L 1 ( μ ) if they are equal μ -almost everywhere, abbreviated as μ -a.e. in the remainder) for which their absolute value is Lebesgue integrable (Proposition 4). This geometric characterization allows us to design a generic dichotomic search algorithm (Algorithm 1) to approximate the Chernoff optimal skewing parameter, generalizing the prior work [24]. When P and Q belong to a same exponential family, we recover in Section 3 the results of [24]. This geometric characterization also allows us to reinterpret the Chernoff information as a minmax symmetrization of the Kullback–Leibler divergence, and we define by analogy the forward and reverse Chernoff–Bregman divergences in Section 4 (Definition 2). In Section 5, we consider the Chernoff information between Gaussian distributions: We show that the optimality condition for the Chernoff information between univariate Gaussian distributions can be solved exactly and report a closed-form formula for the Chernoff information between any two univariate Gaussian distributions (Proposition 10). For multivariate Gaussian distributions, we show how to implement the dichotomic search algorithms to approximate the Chernoff information, and report a closed-form formula for the Chernoff information between two centered multivariate Gaussian distributions with scaled covariance matrices (Proposition 11). Finally, we conclude in Section 7.

2. Chernoff Information from the Viewpoint of Likelihood Ratio Exponential Families

2.1. LREFs and the Chernoff Information

Recall that L 1 ( μ ) denotes the Lebesgue vector space of measurable functions f such that X | f | d μ < . Given two prescribed densities p and q of L 1 ( μ ) , consider building a uniparametric exponential family [30] E p q which consists of the weighted geometric mixtures of p and q:
E p q : = ( p q ) α G ( x ) : = p ( x ) α q ( x ) 1 α Z p q ( α ) : α Θ ,
where
Z p q ( α ) = X p ( x ) α q ( x ) 1 α d μ ( x ) = ρ α [ p : q ]
denotes the normalizer (or partition function) of the geometric mixture
( p q ) α G ( x ) p ( x ) α q ( x ) 1 α
so that X ( p q ) α G d μ = 1 . Parameter space Θ is defined as the set of α values which yieds convergence of the definite integral Z p q ( α ) :
Θ : = { α R : Z p q ( α ) < } .
Let us express the density ( p q ) α G in the canonical form (∗) of exponential families [30]:
( p q ) α G ( x ) = exp α log p ( x ) q ( x ) log Z p q ( α ) q ( x ) ,
= : * exp α t ( x ) F p q ( α ) + k ( x ) .
It follows from this decomposition that α Θ R is the scalar natural parameter, t ( x ) = log p ( x ) q ( x ) denotes the sufficient statistic (minimal when p ( x ) q ( x ) μ -a.e.), k ( x ) = log q ( x ) is an auxiliary carrier term wrt. measure μ (i.e., measure d ν ( x ) = q ( x ) d μ ( x ) ), and
F p q ( α ) = log Z p q ( α ) = D B , α [ p : q ] < 0
is the log-normalizer (or log-partition or cumulant function). Since the sufficient statistic is the logarithm of the likelihood ratio of p ( x ) and q ( x ) , Grünwald [9] (Section 19.6) termed E p q a Likelihood Ratio Exponential Family (LREF). See also [31] for applications of LREFs to Markov chain Monte Carlo (McMC) methods.
We have p = ( p q ) 1 G and q = ( p q ) 0 G . Thus, let α p = 1 and α q = 0 , and let us interpret geometrically { ( p q ) α G , α Θ } as a maximal exponential arc [29,32,33] where Θ R is an interval. We denote by E p q ¯ the open exponential arc with extremities p and q.
Since the log-normalizers F ( θ ) of exponential families are always strictly convex and real analytic [30] (i.e., F ( θ ) C ω ( R ) ), we deduce that D B , α [ p : q ] = F p q ( α ) is strictly concave and real analytic. Moreover, we have D B , 0 [ p : q ] = D B , 1 [ p : q ] = 0 . Hence, the Chernoff optimal skewing parameter α * is unique when p q μ -a.e., and we get the Chernoff information calculated as
D C [ p : q ] = D B , α * ( p , q ) .
See Figure 1 for a plot of the strictly concave function D B , α [ p : q ] and the strictly convex function F p q ( α ) = D B , α [ p : q ] when p = p 0 , 1 is the standard normal density and q = p 1 , 2 is a normal density of mean 1 and variance 2.
Consider the full natural parameter space Θ p q of E p q :
Θ p q = α R : ρ α ( p : q ) < + .
The natural parameter space Θ p q is always convex [30] and since ρ 0 ( p : q ) = ρ 1 ( p : q ) = 1 , we necessarily have ( 0 , 1 ) Θ p q but not necessarily [ 0 , 1 ] Θ p q as detailed in the following remark:
Remark 1.
In order to be an exponential family, the densities ( p q ) α G shall have the same coinciding support for all values of α belonging to the natural parameter space. The support of the geometric mixture density ( p q ) α G is
supp ( p q ) α G = supp ( p ) supp ( q ) , α Θ p q { 0 , 1 } supp ( p ) , α = 1 supp ( q ) , α = 0 .
This condition is trivially satisfied when the supports of p and q coincide, and therefore [ 0 , 1 ] Θ p q in that case. Otherwise, we may consider the common support X p q = supp ( p ) supp ( q ) for α ( 0 , 1 ) . In this latter case, we are poised to restrict the natural parameter space to Θ p q = ( 0 , 1 ) even if ρ α ( p : q ) < for some α outside that range.
To emphasize that α * depends on p and q, we shall use the notation α * ( p : q ) whenever necessary. We have α * ( q : p ) = 1 α * ( p : q ) , and since D B , α ( p : q ) = D B , 1 α ( q : p ) , and we check that
D C [ p , q ] = D B , α * ( p : q ) ( p : q ) = D B , α * ( q : p ) ( q : p ) = D C [ q , p ] .
Thus the skewing value α * ( q : p ) may be called the conjugate Chernoff exponent (i.e., depends on the convention chosen for interpolating on the exponential arc).
However, since the Chernoff information does not satisfy the triangle inequality, it is not a metric distance and the Chernoff information is called a quasi-distance.
Proposition 1
(Uniqueness of the Chernoff information optimal skewing parameter [11,12]). Let P and Q be two probability measures dominated by a positive measure μ with corresponding Radon–Nikodym densities p and q, respectively. The Chernoff information optimal skewing parameter α * ( p : q ) is unique when p q  μ-almost everywhere, and
D C [ p , q ] = D B , α * ( p : q ) ( p : q ) = D B , α * ( q : p ) ( q : p ) = D C [ q , p ] .
When p = q  μ-a.e., we have D C [ p : q ] = 0 and α * is undefined since it can range in [ 0 , 1 ] .
Definition 1.
An exponential family is called regular [30] when the natural parameter space Θ is open, i.e., Θ = Θ where Θ denotes the interior of Θ (i.e., an open interval).
Proposition 2
(Finite sided Kullback–Leibler divergences). When the LREF E p q is a regular exponential family with natural parameter space Θ [ 0 , 1 ] , both the forward Kullback–Leibler divergence D KL [ p : q ] and the reverse Kullback–Leibler divergence D KL [ q : p ] are finite.
Proof. 
A reverse parameter divergence D * ( θ 1 : θ 2 ) is a parameter divergence on the swapped parameter order: D * ( θ 1 : θ 2 ) : = D ( θ 2 : θ 1 ) . Similarly, a reverse statistical divergence D * [ p : q ] is a statistical divergence on the swapped parameter order: D * [ p : q ] : = D [ q : p ] . We shall use the result pioneered in [34,35] that the KLD between two densities p θ 1 and p θ 2 of a regular exponential family E = { p θ : θ Θ } amounts to a reverse Bregman divergence ( B F ) * (i.e., a Bregman divergence on swapped parameter order) induced by the log-normalizer of the family:
D KL [ p θ 1 : p θ 2 ] = ( B F ) * ( θ 1 : θ 2 ) = B F ( θ 2 : θ 1 ) ,
where B F is the Bregman divergence defined on domain D = dom ( F ) (see Definition 1 of [36]):
B F : D × ri ( D ) [ 0 , ) ( θ 1 , θ 2 ) B F ( θ 1 : θ 2 ) = F ( θ 1 ) F ( θ 2 ) ( θ 1 θ 2 ) F ( θ 2 ) < + ,
where ri ( D ) denotes the relative interior of domain D. Bregman divergences are always finite and the only symmetric Bregman divergences are squared Mahalanobis distances [37] (i.e., with corresponding Bregman generators defining quadratic forms).
For completeness, we recall the proof as follows: We have
log p θ 1 ( x ) p θ 2 ( x ) = ( θ 1 θ 2 ) t ( x ) F ( θ 1 ) + F ( θ 2 ) .
Thus we get
D KL [ p θ 1 : p θ 2 ] = E p θ 1 log p θ 1 p θ 2 , = F ( θ 2 ) F ( θ 1 ) ( θ 1 θ 2 ) E p θ 1 [ t ( x ) ] ,
using the linearity property of the expectation operator. When E is regular, we also have E p θ 1 [ t ( x ) ] = F ( θ 1 ) (see [38]), and therefore we get
D KL [ p θ 1 : p θ 2 ] = F ( θ 2 ) F ( θ 1 ) ( θ 1 θ 2 ) F ( θ 1 ) = : B F ( θ 2 : θ 1 ) = ( B F ) * ( θ 1 : θ 2 ) .
In our LREF setting, we thus have:
D KL [ p : q ] = ( B F ) * ( α p : α q ) = B F p q ( α q : α p ) = B F p q ( 0 : 1 ) ,
and D KL [ q : p ] = B F p q ( α p : α q ) = B F p q ( 1 : 0 ) where B F p q ( α 1 : α 2 ) denotes the following scalar Bregman divergence:
B F p q ( α 1 : α 2 ) = F p q ( α 1 ) F p q ( α 2 ) ( α 1 α 2 ) F p q ( α 2 ) .
Since F p q ( 0 ) = F p q ( 1 ) = 0 and B F p q : Θ × ri ( Θ ) [ 0 , ) , we have
D KL [ p : q ] = B F p q ( α q : α p ) = B F p q ( 0 : 1 ) = F p q ( 1 ) < .
Similarly
D KL [ q : p ] = B F p q ( α p : α q ) = B F p q ( 1 : 0 ) = F p q ( 0 ) < .
Notice that since B F p q ( α 1 : α 2 ) > 0 , we have F p q ( 1 ) > 0 and F p q ( 0 ) < 0 when p q μ -almost everywhere (a.e.). Moreover, since F p q ( α ) is strictly convex, F p q ( α ) is strictly monotonically increasing, and therefore there exists a unique α * ( 0 , 1 ) such that F p q ( α * ) = 0 .    □
Example 1.
When p and q belongs to a same regular exponential family E (e.g., p and q are two normal densities), their sided KLDs [37] are both finite. The LREF induced by two Cauchy distributions p l 1 , s 1 and p l 2 , s 2 is such that [ 0 , 1 ] Θ since the skewed Bhattacharyya distance is defined and finite for α R  [39]. Therefore the KLDs between two Cauchy distributions are always finite [39], see the closed-form formula in [40].
Remark 2.
If 0 Θ , then B F p q ( 1 : 0 ) < and therefore D KL [ q : p ] < . Since the KLD between a standard Cauchy distribution p and a standard normal distribution q is + , we deduce that D KL [ p : q ] B F p q ( 0 : 1 ) , and therefore 1 Θ . Similarly, when 1 Θ , we have B F p q ( 0 : 1 ) < and therefore D KL [ p : q ] < .
Proposition 3
(Chernoff information expressed as KLDs). (see also Theorem 32 of [11]) We have at the Chernoff information optimal skewing value α * ( 0 , 1 ) the following identities:
D C [ p : q ] = D KL [ ( p q ) α * G : p ] = D KL [ ( p q ) α * G : q ] .
Proof. 
Since the skewed Bhattacharyya distance between two densities p θ 1 and p θ 2 of an exponential family with log-normalizer F amounts to a skew Jensen divergence for the log-normalizer [8,41], we have:
D B , α ( p θ 1 : p θ 2 ) = J F , α ( θ 1 : θ 2 ) ,
where the skew Jensen divergence [42] is given by
J F , α ( θ 1 : θ 2 ) = α F ( θ 1 ) + ( 1 α ) F ( θ 2 ) F ( α θ 1 + ( 1 α ) θ 2 ) .
In the setting of the LREF, we have
D B , α ( ( p q ) α 1 G : ( p q ) α 2 G ) = J F p q , α ( α 1 : α 2 ) , = α F p q ( α 1 ) + ( 1 α ) F p q ( α 2 ) F p q ( α α 1 + ( 1 α ) α 2 ) .
At the optimal value α * , we have F p q ( α * ) = 0 . Since D KL [ ( p q ) α * G : p ] = B F p q ( 1 : α * ) = F ( α * ) and D KL [ ( p q ) α * G : q ] = B F p q ( 0 : α * ) = F ( α * ) and D C [ p : q ] = log ρ α * ( p : q ) = J F p q , α * ( 1 : 0 ) = F p q ( α * ) , we get
D C [ p : q ] = D KL [ ( p q ) α * G : p ] = D KL [ ( p q ) α * G : q ] .
Figure 2 illustrates the proposition on the plot of the scalar function F p q ( α ) .    □
Corollary 1.
The Chernoff information optimal skewing value α * ( p : q ) ( 0 , 1 ) can be used to calculate the Chernoff information D C [ p , q ] as a Bregman divergence induced by the LREF:
D C [ p : q ] = B F p q [ 1 : α * ] = B F p q [ 0 : α * ] = J F p q , α * ( 1 : 0 ) .
In general, the divergence J F C ( θ 1 , θ 2 ) = max α ( 0 , 1 ) J F , α ( θ 1 : θ 2 ) is called a Jensen–Chernoff divergence.
Proposition 3 let us interpret the Chernoff information as a special symmetrization of the Kullback–Leibler divergence [43], different from the Jeffreys divergence or the Jensen–Shannon divergence [44]. Indeed, the Chernoff information can be rewritten as
D C [ p : q ] = min r E p q ¯ { D KL [ r : p ] , D KL [ r : q ] } .
As such, we can interpret the Chernoff information as the radius of a minimum enclosing left-sided Kullback–Leibler ball on the space L 1 ( μ ) . A related concept is the radius [12] of two densities p and q with respect to Rényi divergences of order α (see Equation (2) of [12]):
r α ( p , q ) : = inf c max { D R , α [ p : c ] , D R , α [ q : c ] } .
When α = 1 , the radius is called the Shannon radius [12] since the Rényi divergences of order 1 corresponds to the Kullback–Leibler divergence (relative entropy).

2.2. Geometric Characterization of the Chernoff Information and the Chernoff Information Distribution

Let us term the probability distribution ( P Q ) α * G μ with corresponding density ( p q ) α * G the Chernoff information distribution to avoid confusion with another concept of Chernoff distributions [45] used in statistics. We can characterize geometrically the Chernoff information distribution ( p q ) α * G on L 1 ( μ ) as the intersection of a left-sided Kullback–Leibler divergence bisector:
Bi KL left ( p , q ) : = r L 1 ( μ ) : D KL [ r : p ] = D KL [ r : q ] ,
with an exponential arc [29]
γ G ( p , q ) : = ( p q ) α G : α [ 0 , 1 ] .
We thus interpret Proposition 3 geometrically by the following proposition (see Figure 3):
Proposition 4
(Geometric characterization of the Chernoff information). On the vector space L 1 ( μ ) , the Chernoff information distribution is the unique distribution
( p q ) α * G = γ G ( p , q ) Bi KL left ( p , q ) .
The point ( p q ) α * G has been called the Chernoff point in [24].
Proposition 4 allows us to design a dichotomic search to numerically approximate α * as reported in pseudo-code in Algorithm 1 (see also the illustration in Figure 4).
Algorithm 1 Dichotomic search for approximating the Chernoff information by approximating the optimal skewing parameter value α ˜ α * and reporting D C [ p : q ] D KL [ ( p q ) α ˜ G : p ] . The search requires log 2 1 ϵ iterations to guarantee | α * α ˜ | ϵ .
Entropy 24 01400 i001
Remark 3.
We do not need to necessarily handle normalized densities p and q since we have for α R { 0 , 1 } :
( p q ) α G = ( p ˜ q ˜ ) α G ,
where p ( x ) = p ˜ ( x ) Z p and q ( x ) = q ˜ ( x ) Z q with p ˜ and q ˜ denoting the computationally-friendly unnormalized positive densities. This property of geometric mixtures is used in Annealed Importance Sampling [46,47] (AIS), and for designing an asymptotically efficient estimator for computationally-intractable parametric densities [48] q ˜ θ (e.g., distributions learned by Boltzmann machines).

2.3. Dual Parameterization of LREFs

The densities ( p q ) α G of a LREF can also be parameterized by their dual moment parameter [30] (or mean parameter):
β = β ( α ) : = E ( p q ) α G [ t ( x ) ] = E ( p q ) α G log p ( x ) q ( x ) .
When the LREF is regular (and therefore steep [38]), we have β ( α ) = F p q ( α ) and α = F p q * ( β ) , where F p q * denotes the Legendre transform of F p q . At the optimal value α * , we have F p q ( α * ) = 0 . Therefore an equivalent condition of optimality is
β ( α * ) = F p q ( α * ) = 0 = E ( p q ) α * G log p ( x ) q ( x ) .
Notice that when [ 0 , 1 ] Θ , we have finite forward and reverse Kullback–Leibler divergences:
  • α = 1 , we have ( p q ) 1 G = p and
    β ( 1 ) = E p log p ( x ) q ( x ) = D KL [ p : q ] = F p q ( 1 ) > 0 .
  • α = 0 , we have ( p q ) 0 G = q and
    β ( 0 ) = E q log p ( x ) q ( x ) = D KL [ q : p ] = F p q ( 0 ) < 0 .
Since F p q ( α ) is strictly convex, we have F p q ( α ) > 0 and F p q is strictly increasing with F p q ( 0 ) = D KL [ q : p ] < 0 and F p q ( 1 ) = D KL [ p : q ] > 0 . The value α * is thus the unique value such that F p q ( α * ) = 0 .
Proposition 5
(Dual optimality condition for the Chernoff information). The unique Chernoff information optimal skewing parameter α * is such that
OC α : D KL [ ( p q ) α * G : p ] = D KL [ ( p q ) α * G : q ] OC β : β ( α * ) = E ( p q ) α * G log p ( x ) q ( x ) = 0 .
One can understand that the Chernoff information is more robust or stable than a skewed Bhattacharrya distance by considering the derivative of the corresponding skewed Bhattacharrya distance. Consider without loss of generality densities p θ 1 and p θ 2 of a 1D exponential family. Their skewed Bhattacharrya distances amounts to skew Jensen divergences, and we have:
J F , α ( θ 1 : θ 2 ) : = d d α J F , α ( θ 1 : θ 2 ) = F ( θ 1 ) F ( θ 2 ) ( θ 1 θ 2 ) F ( α θ 1 + ( 1 α ) θ 2 ) .
Since J F , α * is by definition maximal, we have F ( α * θ 1 + ( 1 α * ) θ 2 ) = 0 , and therefore | J F , α ( θ 1 : θ 2 ) J F , α * ( θ 1 : θ 2 ) | = | ( θ 1 θ 2 ) F ( α θ 1 + ( 1 α ) θ 2 ) | > 0 . Further assuming without loss of generality that θ 2 θ 1 = 1 , we get | J F , α ( θ 1 : θ 2 ) J F , α * ( θ 1 : θ 2 ) | = | F ( α θ 1 + ( 1 α ) θ 2 ) | > 0 = F ( α * θ 1 + ( 1 α * ) θ 2 ) .
As a side remark, let us notice that the Fisher information of a likelihood ratio exponential family E p q is
I p q ( α ) = E ( p q ) α G [ ( log ( p q ) α G ) ] = F p q ( α ) > 0 ,
and = F p q ( α ) F * p q ( β ) = 1 .

3. Chernoff Information between Densities of an Exponential Family

3.1. General Case

We shall now consider that the densities p and q (with respect to measure μ ) belong to a same exponential family [30]:
E = P λ : d P λ d μ = p λ ( x ) = exp ( θ ( λ ) t ( x ) F ( θ ( λ ) ) ) , λ Λ ,
where θ ( λ ) denotes the natural parameter associated with the ordinary parameter λ , t ( x ) the sufficient statistic vector and F ( θ ( λ ) ) the log-normalizer. When θ ( λ ) = λ and t ( x ) = x , the exponential family is called a natural exponential family (NEF). The exponential family E is defined by μ and t ( x ) , hence we may write when necessary E = E μ , t .
Example 2.
The set of univariate Gaussian distributions
N = { p μ , σ 2 ( x ) : λ = ( μ , σ 2 ) Λ = R × R + + }
forms an exponential family with the following decomposition terms:
λ = ( μ , σ 2 ) Λ = R × R + + , θ ( λ ) = θ 1 = μ σ 2 , θ 2 = 1 2 σ 2 Θ = R × R , t ( x ) = ( x , x 2 ) , F ( θ ) = θ 1 2 4 θ 2 + 1 2 log π θ 2 ,
where R + + = { x R : x > 0 } and R = { x R : x < 0 } denotes the set of positive real numbers and negative real numbers, respectively. Letting v = σ 2 be the variance parameter, we get the equivalent natural parameters μ v , 1 2 v . The log-normalizer can be written using the ( μ , v ) -parameterization as F ( μ , v ) = 1 2 log ( 2 π v ) + μ 2 2 v and θ = θ 1 = μ v , 1 2 v . See Appendix B for further details concerning this normal exponential family.
Notice that we can check easily that the LREF between two densities of an exponential family forms a 1D sub-exponential family of the exponential family:
p θ 1 ( x ) α p θ 2 ( x ) 1 α exp ( α θ 1 + ( 1 α ) θ 2 , t ( x ) α F ( θ 1 ) ( 1 α ) F ( θ 2 ) ) , = p α θ 1 + ( 1 α ) θ 2 ( x ) exp ( F ( α θ 1 + ( 1 α ) θ 2 ) α F ( θ 1 ) ( 1 α ) F ( θ 2 ) ) ) , = p α θ 1 + ( 1 α ) θ 2 ( x ) exp ( J F , α ( θ 1 : θ 2 ) ) ,
where J F denote the Jensen divergence induced by F.
The optimal skewing value condition of the Chernoff information between two categorical distributions [13] was extended to densities p θ 1 and p θ 2 of an exponential family in [24]. The family of categorical distributions with d choices forms an exponential family with natural parameter of dimension d 1 . Thus, Proposition 7 generalizes the analysis in [13].
Let p = p θ 1 and q = p θ 2 . Then we have the property that exponential families are closed under geometric mixtures:
( p θ 1 p θ 2 ) α G = p α θ 1 + ( 1 α ) θ 2 .
Since the natural parameter space Θ is convex, we have α θ 1 + ( 1 α ) θ 2 Θ .
The KLD between two densities p θ 1 and p θ 2 of a regular exponential family E amounts to a reverse Bregman divergence for the log-normalizer of E :
D KL [ p θ 1 : p θ 2 ] = B F ( θ 2 : θ 1 ) ,
where B F ( θ 2 : θ 1 ) denotes the Bregman divergence:
B F ( θ 2 : θ 1 ) = F ( θ 2 ) F ( θ 1 ) ( θ 2 θ 1 ) F ( θ 1 ) .
Thus, when the exponential family E is regular, both the forward and reverse KLD are finite, and we can rewrite Proposition 3 to characterize α * as follows:
OC EF : B F ( θ 1 : θ α * ) = B F ( θ 2 : θ α * ) ,
where θ α * = α * θ 1 + ( 1 α * ) θ 2 .
The Legendre–Fenchel transform of F ( θ ) yields the convex conjugate
F * ( η ) = sup θ Θ { θ η F ( θ ) }
with η ( θ ) = F ( θ ) . Let H = { η ( θ ) : θ Θ } denote the dual moment parameter space also called domain of means. The Legendre transform associates to ( Θ , F ( θ ) ) the convex conjugate ( H , F * ( η ) ) . In order for ( H , F * ( η ) ) to be of the same well-behaved type of ( Θ , F ( θ ) ) , we shall consider convex functions F ( θ ) which are steep, meaning that their gradient diverges when nearing the boundary bd ( Θ )  [49] and thus ensures that domain H is also convex. Steep convex functions are said of Legendre-type, and ( ( Θ , F ( θ ) ) * ) * = ( Θ , F ( θ ) ) (Moreau biconjugation theorem which shows that the Legendre transform is involutive). For Legendre-type functions, there is a one-to-one mapping between parameters θ ( η ) and parameters η ( θ ) as follows:
θ ( η ) = F * ( η ) = ( F ) 1 ( η ) ,
and
η ( θ ) = F ( θ ) = ( F * ) 1 ( θ ) .
Exponential families with log-normalizers of Legendre-type are called steep exponential families [30]. All regular exponential families are steep, and the maximum likelihood estimator in steep exponential families exists and is unique [38] (with the likelihood equations corresponding to the method of moments for the sufficient statistics). The set of inverse Gaussian distributions form a non-regular but steep exponential family, and the set of singly truncated normal distributions form a non-regular and non-steep exponential family [50] (but the exponential family of doubly truncated normal distributions is regular and hence steep).
For Legende-type convex generators F ( θ ) , we can express the Bregman divergence B F ( θ 1 : θ 2 ) using the dual Bregman divergence: B F ( θ 1 : θ 2 ) = B F * ( η 2 : η 1 ) since there is a one-to-one correspondence between η = F ( θ ) and θ = F * ( η ) .
For Legendre-type generators F ( θ ) , the Bregman divergence B F ( θ 1 : θ 2 ) can be rewritten as the following Fenchel–Young divergence:
B F ( θ 1 : θ 2 ) = F ( θ 1 ) + F * ( η 2 ) θ 1 η 2 : = Y F , F * ( θ 1 : η 2 ) .
Proposition 6.
(KLD between densities of a regular (and steep) exponential family). The KLD between two densities p θ 1 and p θ 2 of a regular and steep exponential family can be obtained equivalently as
D KL [ p θ 1 : p θ 2 ] = B F ( θ 2 : θ 1 ) = Y F , F * ( θ 2 : η 1 ) = Y F * , F ( η 1 : θ 2 ) = B F * ( η 1 : η 2 ) ,
where F ( θ ) and its convex conjugate F * ( η ) are Legendre-type functions.
Figure 5 illustrates the taxonomy of regularity and steepness of exponential families by a Venn diagram.
It follows that the optimal condition of Equation (18) can be restated as
OC YF : Y F , F * ( θ 1 : η α * ) = Y F , F * ( θ 2 : η α * ) ,
where η α * = F 1 ( α * θ 1 + ( 1 α * ) θ 2 ) . From the equality of Equation (22), we get the following simplified optimality condition:
OC SEF : ( θ 2 θ 1 ) η α * = F ( θ 2 ) F ( θ 1 ) ,
where η α * = F ( α * θ 1 + ( 1 α * ) θ 2 ) .
Remark 4.
We can recover ( OC SEF ) by instantiating the equivalent condition E p θ ¯ α * log p θ 1 p θ 2 = 0 . Indeed, since log p θ 1 p θ 2 = ( θ 1 θ 2 ) t ( x ) F ( θ 1 ) + F ( θ 2 ) , we get
E p θ ¯ α * [ ( θ 1 θ 2 ) t ( x ) F ( θ 1 ) + F ( θ 2 ) ] = 0 , ( θ 1 θ 2 ) η ¯ α * = F ( θ 1 ) F ( θ 2 ) .
Since the α -skewed Bhattacharyya distance amounts to a α -skewed Jensen divergence [8], we get the Chernoff information as
D C [ p λ 1 : p λ 2 ] = J F , α * ( θ ( λ 1 ) : θ ( λ 2 ) ) , = B F ( θ 1 : θ α * ) = B F ( θ 2 : θ α * ) ,
where J F , α ( θ 1 : θ 2 ) is the Jensen divergence:
J F , α ( θ 1 : θ 2 ) = α F ( θ 1 ) + ( 1 α ) F ( θ 2 ) F ( α θ 1 + ( 1 α ) θ 2 ) .
Notice that we have the induced LREF with log-normalizer expressed as the negative Jensen divergence induced the log-normalizer of E :
F p θ 1 p θ 2 ( α ) = log ρ α [ p θ 1 : p θ 2 ] = J F , α ( θ 1 : θ 2 ) .
We summarize the result in the following proposition:
Proposition 7.
Let p λ 1 and p λ 2 be two densities of a regular exponential family E with natural parameter θ ( λ ) and log-normalizer F ( θ ) . Then the Chernoff information is
D C [ p λ 1 : p λ 2 ] = J F , α * ( θ ( λ 1 ) : θ ( λ 2 ) ) = B F ( θ 1 : θ α * ) = B F ( θ 2 : θ α * ) ,
where θ 1 = θ ( λ 1 ) , θ 2 = θ ( λ 2 ) , and the optimal skewing parameter α * is unique and satisfies the following optimality condition:
OC EF : ( θ 2 θ 1 ) η α * = F ( θ 2 ) F ( θ 1 ) ,
where η α * = F ( α * θ 1 + ( 1 α * ) θ 2 ) = E p α * θ 1 + ( 1 α * ) θ 2 [ t ( x ) ] .
Figure 6 illustrates geometrically the Chernoff point [24] which is the geometric mixture ( p θ 1 p θ 2 ) α * induced by two comparable probability measures P θ 1 , P θ 2 μ .
In information geometry [51], the manifold of densities M = { p θ : θ Θ } of this exponential family is a dually flat space [51] M = ( { p θ } , g F ( θ ) = 2 F ( θ ) , m , e ) with respect to the exponential connection e and the mixture connection m , where g F ( θ ) is the Fisher information metric expressed in the θ -coordinate system as 2 F ( θ ) (and in the dual moment parameter η as g F ( η ) = 2 F * ( η ) ). Then the exponential geodesic e is flat and corresponds to the exponential arc of geometric mixtures when parameterized with the e -affine coordinate system θ .
The left-sided Kullback–Voronoi bisector:
Bi KL left ( p θ 1 , p θ 2 ) = { p θ : D KL [ p θ : p θ 1 ] = D KL [ p θ : p θ 1 ] }
corresponds to a Bregman right-sided bisector [52] and is m flat (i.e., an affine subspace in the η -coordinate system):
Bi F right ( θ 1 , θ 2 ) = { θ Θ : B F ( θ 1 , θ ) = B F ( θ 2 , θ ) } .
The Chernoff information distribution ( p θ 1 p θ 2 ) α * G is called the Chernoff point on this exponential family manifold (see Figure 6). Since the Chernoff point is unique and since in general statistical manifolds ( M , g , , * ) can be realized by statistical models [53], we deduce the following proposition of interest for information geometry [51]:
Proposition 8.
Let ( M , g , , * ) be a dually flat space with corresponding canonical divergence a Bregman divergence B F . Let γ p q e ( α ) and γ p q m ( α ) be a e-geodesic and m-geodesic passing through the points p and q of M , respectively. Let Bi m ( p , q ) and Bi e ( p , q ) be the right-sided m -flat and left-sided e -flat Bregman bisectors, respectively. Then the intersection of γ p q e ( α ) with Bi m ( p , q ) and the intersection of γ p q m ( α ) with Bi e ( p , q ) are unique. The point γ p q e ( α ) Bi m ( p , q ) is called the Chernoff point and the point γ p q m ( α ) Bi e ( p , q ) is termed the reverse or dual Chernoff point.

3.2. Case of One-Dimensional Parameters

When the exponential family has one-dimensional natural parameter α Θ R , we thus get from OC SEF :
η α * = F ( α 2 ) F ( α 1 ) α 2 α 1 .
That is, α * can be obtained as the following closed-form formula:
α * = F 1 F ( α 2 ) F ( α 1 ) α 2 α 1 α 2 α 1 α 2 .
Example 3.
Consider the exponential family { p v ( x ) : v > 0 } of 0-centered scale univariate normal distributions with variance v = σ 2 and density
p v ( x ) = 1 2 π v exp 1 2 x 2 v .
The natural parameter corresponding to the sufficient statistic t ( x ) = x 2 is θ = 1 2 v . The log-normalizer is F ( θ ) = 1 2 log π θ . We have η = F ( θ ) = 1 2 θ and F 1 ( η ) = 1 2 η . It follows that
α * ( p v 1 : p v 2 ) = v 1 log v 1 v 2 v 1 + v 2 ( v 2 v 1 ) log v 2 v 1 .
Let s = v 2 v 1 . Then we can rewrite α * as
α * ( p v 1 : p v 2 ) = s 1 log s ( s 1 ) log s .
The Chernoff information is D C [ p v 1 , p v 2 ] = log ρ α * [ p v 1 , p v 2 ] , with
ρ α [ p v 1 , p v 2 ] = σ 1 1 α σ 2 α ( 1 α ) σ 1 2 + α σ 2 2 .
This result will be generalized in Proposition 11 to multivariate centered Gaussians with scaled covariance matrices.
For multi-dimensional parameters θ , we may consider the one-dimensional LREF E p θ 1 p θ 2 induced by p θ 1 and p θ 2 with F θ 1 , θ 2 ( α ) = F ( ( 1 α ) θ 1 + α θ 2 ) , and write F p q ( α ) as the following directional derivative:
θ 2 θ 1 F θ 1 , θ 2 ( α ) : = lim ϵ 0 F ( θ 1 + ( ϵ + α ) ( θ 2 θ 1 ) ) F ( θ 1 + α ( θ 2 θ 1 ) ) ϵ ,
= ( θ 2 θ 1 ) F ( θ 1 + α ( θ 2 θ 1 ) ) ,
using a first-order Taylor expansion. Thus, the optimality condition
OC SEF : F θ 1 , θ 2 ( α ) = 0
amounts to
OC SEF : ( θ 2 θ 1 ) F ( θ 1 + α * ( θ 2 θ 1 ) ) = F ( θ 2 ) F ( θ 1 ) .
This is equivalent to Equation (8) of [24].
Remark 5.
In general, we may consider multivariate Bregman divergences as univariate Bregman divergences: We have
B F ( θ 1 : θ 2 ) = B F θ 1 , θ 2 ( 0 : 1 ) , θ 1 , θ 2 Θ
where
F θ 1 , θ 2 ( u ) : = F ( θ 1 + u ( θ 2 θ 1 ) ) .
The functions F θ 1 , θ 2 are 1D Bregman generators (i.e., strictly convex and C 1 ), and we have the directional derivative
θ 2 θ 1 F θ 1 , θ 2 ( u ) = lim ϵ 0 F ( θ 1 + ( ϵ + u ) ( θ 2 θ 1 ) ) F ( θ 1 + u ( θ 2 θ 1 ) ) ϵ , = ( θ 2 θ 1 ) F ( θ 1 + u ( θ 2 θ 1 ) ) ,
Since F θ 1 , θ 2 ( 0 ) = F ( θ 1 ) , F θ 1 , θ 2 ( 1 ) = F ( θ 2 ) , and F θ 1 , θ 2 ( u ) = θ 2 θ 1 F θ 1 , θ 2 ( u ) , it follows that
B F θ 1 , θ 2 ( 0 : 1 ) = F θ 1 , θ 2 ( 0 ) F θ 1 , θ 2 ( 1 ) ( 0 1 ) θ 2 θ 1 F θ 1 , θ 2 ( 1 ) , = F ( θ 1 ) F ( θ 2 ) + ( θ 2 θ 1 ) F ( θ 2 ) = B F ( θ 1 : θ 2 ) .
Similarly, we can reparameterize Bregman divergences on a k-dimensional simplex by k-dimensional Bregman divergences.
Remark 6.
Closing the loop: The Chernoff information although obtained from the one-dimensional likelihood ratio exponential family yields as a corollary the general multi-parametric exponential families which as a special instance includes the one-dimensional exponential families (e.g., LREFs!).

4. Forward and Reverse Chernoff–Bregman Divergences

In this section, we shall define Chernoff-type symmetrizations of Bregman divergences inspired by the study of Chernoff information, and briefly mention applications of these Chernoff–Bregman divergences in information theory.

4.1. Chernoff–Bregman Divergence

Let us define a Chernoff-like symmetrization of Bregman divergences [43] different from the traditional Jeffreys–Bregman symmetrization:
B F J ( θ 1 : θ 2 ) = B F ( θ 1 : θ 2 ) + B F ( θ 2 : θ 1 ) , = ( θ 1 θ 2 ) ( F ( θ 1 ) F ( θ 2 ) ) ,
or Jensen–Shannon-type symmetrization [44,54] which yields a Jensen divergence [42]:
B F JS ( θ 1 : θ 2 ) = 1 2 B F θ 1 : θ 1 + θ 2 2 + B F θ 2 : θ 1 + θ 2 2 , = F ( θ 1 ) + F ( θ 2 ) 2 F θ 1 + θ 2 2 = : J F ( θ 1 , θ 2 ) .
Definition 2
(Chernoff–Bregman divergence). Let the Chernoff symmetrization of Bregman divergence B F ( θ 1 ; θ 2 ) be the forward Chernoff–Bregman divergence C F ( θ 1 , θ 2 ) defined by
C F ( θ 1 , θ 2 ) = max α ( 0 , 1 ) J F , α ( θ 1 : θ 2 ) ,
where J F , α is the α-skewed Jensen divergence.
The optimization problem in Equation (31) may be equivalently rewritten [43] as min θ R such that both B F ( θ 1 : θ ) R and B F ( θ 2 : θ ) R . Thus, the optimal value of α defines the circumcenter θ * = α θ 1 + ( 1 α ) θ 2 of the minimum enclosing right-sided Bregman sphere [55,56] and the Chernoff–Bregman divergence:
C F ( θ 1 , θ 2 ) = min θ { B F ( θ 1 : θ ) , B F ( θ 2 : θ ) } ,
corresponds to the radius of a minimum enclosing Bregman ball. To summarize, this Chernoff symmetrization is a min-max symmetrization, and we have the following identities:
C F ( θ 1 , θ 2 ) = min θ { B F ( θ 1 : θ ) , B F ( θ 2 : θ ) } , = min θ Θ { α B F ( θ 1 : θ ) + ( 1 α ) B F ( θ 2 : θ ) } , = max α ( 0 , 1 ) { α B F ( θ 1 : α θ 1 + ( 1 α ) θ 2 ) + ( 1 α ) B F ( θ 2 : α θ 1 + ( 1 α ) θ 2 ) } , = max α ( 0 , 1 ) J F , α ( θ 1 : θ 2 ) .
The second identity shows that the Chernoff symmetrization can be interpreted as a variational Jensen–Shannon-type divergence [54].
Notice that in general C F ( θ 1 , θ 2 ) C F * ( η 1 , η 2 ) because the primal and dual geodesics do not coincide. Those geodesics coincide only for symmetric Bregman divergences which are squared Mahalanobis divergences [52].
When F ( θ ) = F Shannon ( θ ) = i = 1 D θ i log θ i (discrete Shannon negentropy), the Chernoff–Bregman divergence is related to the capacity of a discrete memoryless channel in information theory [13,43].
Conditions for which C F ( θ 1 , θ 2 ) a (with a > 0 ) becomes a metric have been studied in [43]: For example, C F Shannon 1 e is a metric distance [43] (i.e., a = 1 e 0.36787944117 ). It is also known that the square root of the Chernoff distance between two univariate normal distributions is a metric distance [57].
We can thus use the Bregman generalization of the Badoiu–Clarkson (BC) algorithm [55] (Algorithm 2) to compute an approximation of the smallest enclosing Bregman ball which in turn yields an approximation of the Chernoff–Bregman divergence:    
Algorithm 2 Approximating the circumcenter of the Bregman smallest enclosing ball of two parameters θ 1 and θ 2 .
Entropy 24 01400 i002
    Notice that when there are only two points to compute their smallest enclosing Bregman ball, all the arcs ( θ ( i 1 ) θ f i ) G are sub-arcs of the exponential arc ( θ 1 θ 2 ) G . See [55] for convergence results of this iterative algorithm. Let us notice that Algorithm 1 approximates α * while the Bregman BC Algorithm 2 approximates in spirit D C ( θ 1 , θ 2 ) (and as a byproduct  α * ).
Remark 7.
To compute the farthest point to the current circumcenter with respect to Bregman divergence, we need to find the sign of
B F ( θ 2 : θ ) B F ( θ 1 : θ ) = F ( θ 2 ) F ( θ 1 ) ( θ 2 θ 1 ) F ( θ ) .
Thus we need to pre-calculate only once F ( θ 1 ) and F ( θ 2 ) which can be costly (e.g., log det Σ functions need to be calculated only once when approximating the Chernoff information between Gaussians).

4.2. Reverse Chernoff–Bregman Divergence and Universal Coding

Similarly, we may define the reverse Chernoff–Bregman divergence by considering the minimum enclosing left-sided Bregman ball:
C F R ( θ 1 , θ 2 ) = min θ { B F ( θ : θ 1 ) , B F ( θ : θ 2 ) } .
Thus the reverse Bregman Chernoff divergence D C R [ θ 1 , θ 2 ] = R * is the radius of a minimum enclosing left-sided Bregman ball.
This reverse Chernoff–Bregman divergence finds application in universal coding in information theory (chapter 13 of [13], pp. 428–433): Let X = { A 1 , , A d } be a finite discrete alphabet of d letters, and X be a random variable with probability mass function p on X . Let p λ ( x ) denote the categorical distribution corresponding to X so that Pr ( X = A i ) = p λ ( A i ) with λ = ( λ 1 , , λ d ) R + + d and i = 1 d λ i = 1 . The Huffman codeword for x X is of length l ( x ) = log p ( x ) (ignoring integer ceil rounding), and the expected codeword length of X is thus given by Shannon’s entropy H ( X ) = x p ( x ) log p ( x ) .
If we code according to a distribution p λ instead of the true distribution p λ , the code is not optimal, and the redundancy R ( p λ , p λ ) is defined as the difference between the expected lengths of the codewords for p λ and p λ :
R ( p λ , p λ ) = ( E p λ [ log p λ ( x ) ] ( E p λ [ log p λ ( x ) ] ) = D KL [ p λ : p λ ] 0 ,
where D KL is the Kullback–Leibler divergence.
Now, suppose that the true distribution p λ belong to one of two prescribed distributions that we do not know: p λ P = { p λ 1 , p λ 2 } . Then we seek for the minimax redundancy:
R * = min p λ max i { 1 , 2 } D KL [ p λ i : p λ ] .
The distribution p λ * achieving the minimax redundancy is the circumcenter of the right-centered KL ball enclosing the distributions P .
Using the natural coordinates θ = ( θ 1 , , θ D ) R D with θ i = log λ i λ d of the log-normalizer of the categorical distributions (an exponential family of order D = d 1 ), we end up with calculating the smallest left-sided Bregman enclosing ball for the Bregman generator [58]: F Categorical ( θ ) = log ( 1 + i = 1 D exp θ i ) :
R * = min θ R D max i { 1 , 2 } B F categorical ( θ : θ i ) .
This latter minimax problem is unconstrained since θ R D = R d 1 .

5. Chernoff Information between Gaussian Distributions

5.1. Invariance of Chernoff Information under the Action of the Affine Group

The d-variate Gaussian density p λ ( x ) with parameter λ = ( λ v = μ , λ M = Σ ) where μ R d denotes the mean ( μ = E p λ [ x ] ) and Σ is a positive-definite covariance matrix ( Σ = Cov p λ [ X ] for X p λ ) is given by
p λ ( x ; λ ) = 1 ( 2 π ) d 2 | λ M | exp 1 2 ( x λ v ) λ M 1 ( x λ v ) ,
where | · | denotes the matrix determinant. The set of d-variate Gaussian distributions form a regular (and hence steep) exponential family with natural parameters θ ( λ ) = λ M 1 λ v , 1 2 λ M 1 and sufficient statistics t ( x ) = ( x , x x ) .
The Bhattacharrya distance between two multivariate Gaussians distributions p μ 1 , Σ 1 and p μ 2 , Σ 2 is
D B , α [ p μ 1 , Σ 1 , p μ 2 , Σ 2 ] = 1 2 α μ 1 Σ 1 1 μ 1 + ( 1 α ) μ 2 Σ 2 1 μ 2 μ α Σ α 1 μ α + log | Σ 1 | α | Σ 2 | 1 α | Σ α | ,
where
Σ α = ( α Σ 1 1 + ( 1 α ) Σ 2 1 ) 1 , μ α = Σ α ( α Σ 1 1 μ 1 + ( 1 α ) Σ 2 1 ) .
The Gaussian density can be rewritten as a multivariate location-scale family:
p λ ( x ; λ ) = | λ M | 1 2 p std ( λ M 1 2 ( x λ v ) ) ,
where
p std ( x ) = 1 ( 2 π ) d 2 exp 1 2 x x = p ( 0 , I )
denotes the standard multivariate Gaussian distribution. The matrix λ M 1 2 is the unique symmetric square-root matrix which is positive-definite when λ M is positive-definite.
Remark 8.
Notice that the product of two symmetric positive-definite matrices P 1 and P 2 may not be symmetric but P 1 1 2 P 2 P 1 1 2 is always symmetric positive-definite, and the eigenvalues of P 1 1 2 P 2 P 1 1 2 coincides with the eigenvalues of P 1 P 2 . Hence, we have λ sp ( P 1 1 2 P 2 P 1 1 2 ) = λ sp ( P 1 1 P 2 ) where λ sp ( M ) denotes the eigenspectrum of matrix M.
We may interpret the Gaussian family as obtained by the action of the affine group Aff ( R d ) = R d GL d ( R ) on the standard density p std : Let the dot symbol “.” denotes the group action. The affine group is equipped with the following (outer) semidirect product:
( l 1 , A 1 ) . ( l 2 , A 2 ) = ( l 1 + A 1 l 2 , A 1 A 2 ) ,
and this group can be handled as a matrix group with the following mapping of its elements to matrices:
( l , A ) A l 0 1 .
Then we have
p ( μ , Σ ) ( x ) = ( μ , Σ 1 2 ) . p std ( x ) = ( μ , Σ 1 2 ) . p 0 , I ( x ) = p ( μ , Σ 1 2 ) . ( 0 , I ) ( x ) .
We can show the following invariance of the skewed Bhattacharyya divergences:
Proposition 9
(Invariance of the Bhattacharyya divergence and f-divergences under the action of the affine group (Equation (35))). We have
D B , α [ ( μ , Σ 1 2 ) . p μ 1 , Σ 1 : ( μ , Σ 1 2 ) . p μ 2 , Σ 2 ] : = D B , α [ p ( μ , Σ 1 2 ) . ( μ 1 , Σ 1 ) : p ( μ , Σ 1 2 ) . ( μ 2 , Σ 2 ) ] , = D B , α p Σ 1 2 ( μ 1 μ ) : Σ 1 2 Σ 1 Σ 1 2 , p Σ 1 2 ( μ 2 μ ) , Σ 1 2 Σ 2 Σ 1 2 , = D B , α [ p μ 1 , Σ 1 : p μ 2 , Σ 2 ] .
Proof. 
The proof follows from the ( f , g ) -form of Ali and Silvey’s divergences [59]. We can express D B , α [ p : q ] = g ( I h α [ p : q ] ) where h α ( u ) = u α (convex for α ( 0 , 1 ) ) and g ( v ) = log v . Then we rely on the proof of invariance of f-divergences under the action of the affine group (see Proposition 3 of [60] relying on a change of variable in the integral):
I f [ p μ 1 , Σ 1 : p μ 2 , Σ 2 ] = I f p 0 , I , p Σ 1 1 2 ( μ 2 μ 1 ) : Σ 1 1 2 Σ 2 Σ 1 1 2 = I f p Σ 2 1 2 ( μ 1 μ 2 ) , Σ 2 1 2 Σ 1 Σ 2 1 2 : p 0 , I ,
where I denotes the identity matrix.    □
Thus, by choosing ( μ , Σ ) = ( μ 1 , Σ 1 ) and ( μ , Σ ) = ( μ 2 , Σ 2 ) , we obtain the following corollary:
Corollary 2
(Bhattacharyya divergence from canonical Bhattacharyya divergences). We have
D B , α [ p μ 1 , Σ 1 : p μ 2 , Σ 2 ] = D B , α p 0 , I , p Σ 1 1 2 ( μ 2 μ 1 ) : Σ 1 1 2 Σ 2 Σ 1 1 2 = D B , α p Σ 2 1 2 ( μ 1 μ 2 ) : Σ 2 1 2 Σ 1 Σ 2 1 2 , p 0 , I .
It follows that the Chernoff optimal skewing parameter enjoys the same invariance property:
α * ( p μ 1 , Σ 1 : p μ 2 , Σ 2 ) = α * p 0 , I , p Σ 1 1 2 ( μ 2 μ 1 ) , Σ 1 1 2 Σ 2 Σ 1 1 2 = α * p Σ 2 1 2 ( μ 1 μ 2 ) : Σ 2 1 2 Σ 1 Σ 2 1 2 , p 0 , I .
As a byproduct, we get the invariance of the Chernoff information under the action of the affine group:
Corollary 3
(Invariance of the Chernoff information under the action of the affine group). We have:
D C [ p μ 1 , Σ 1 , p μ 2 , Σ 2 ] = D C p 0 , I , p Σ 1 1 2 ( μ 2 μ 1 ) , Σ 1 1 2 Σ 2 Σ 1 1 2 = D C p Σ 2 1 2 ( μ 1 μ 2 ) , Σ 2 1 2 Σ 1 Σ 2 1 2 , p 0 , I .
Thus, the formula for the Chernoff information between two Gaussians
D C ( μ 1 , Σ 1 , μ 2 , Σ 2 ) : = D C [ p μ 1 , Σ 1 , p μ 2 , Σ 2 ] = D C ( μ 12 , Σ 12 )
can be written as a function of two terms μ 12 = Σ 1 1 2 ( μ 2 μ 1 ) and Σ 12 = Σ 1 1 2 Σ 2 Σ 1 1 2 .

5.2. Closed-Form Formula for the Chernoff Information between Univariate Gaussian Distributions

We shall report the exact solution for the Chernoff information between univariate Gaussian distributions by solving a quadratic equation. We can also report a complex closed-form formula by using symbolic computing because the calculations are lengthy and thus prone to human error.
Instantiating Equation (24) for the case of univariate Gaussian distributions paramterized by ( μ , σ 2 ) , we get the following equation for the optimality condition of α * :
θ 2 θ 1 , η α * = F ( θ 2 ) F ( θ 1 ) ,
μ 2 σ 2 2 μ 1 σ 1 2 , 1 2 σ 1 2 1 2 σ 2 2 , ( m α , v α ) = 1 2 log σ 2 2 σ 1 2 + μ 2 2 2 σ 2 2 μ 1 2 2 σ 1 2 ,
where · , · denotes the scalar product and with the interpolated mean and variance along an exponential arc { ( m α , v α ) } α ( 0 , 1 ) passing through ( μ 1 , σ 1 2 ) when α = 1 and ( μ 2 , σ 2 2 ) when α = 0 given by
m α = α μ 1 σ 2 2 + ( 1 α ) μ 2 σ 1 2 ( 1 α ) σ 1 2 + α σ 2 2 = α ( μ 1 σ 2 2 μ 2 σ 1 2 ) + μ 2 σ 1 2 σ 1 2 + α ( σ 2 2 σ 1 2 ) ,
v α = σ 1 2 σ 2 2 ( 1 α ) σ 1 2 + α σ 2 2 = σ 1 2 σ 2 2 σ 1 2 + α ( σ 2 2 σ 1 2 ) .
That is, for p = p μ 1 , σ 1 2 and q = p μ 2 , σ 2 2 , we have the weighted geometric mixture ( p q ) α G = p m α , v α .
Thus, the optimality condition of the Chernoff optimal skewing parameter is given by:
OC Gaussian : μ 2 σ 2 2 μ 1 σ 1 2 m α 1 2 σ 2 2 1 2 σ 1 2 v α = 1 2 log σ 2 2 σ 1 2 + μ 2 2 2 σ 2 2 μ 1 2 2 σ 1 2 .
Let us rewrite compactly Equation (40) as
OC Gaussian : a 12 m α + b 12 v α + c 12 = 0 ,
with the following coefficients:
a 12 = μ 2 σ 2 2 μ 1 σ 1 2 ,
b 12 = 1 2 σ 1 2 1 2 σ 2 2 ,
c 12 = 1 2 log σ 1 2 σ 2 2 + μ 1 2 2 σ 1 2 μ 2 2 2 σ 2 2 .
By multiplying both sides of Equation (41) by σ 1 2 + α Δ v where Δ v : = σ 2 2 σ 1 2 and rearranging terms, we get a quadratic equation with positive root being α * .
Using the computer algebra system (CAS) Maxima, we can also solve exactly this quadratic equation in α as a function of μ 1 , σ 1 2 , μ 2 , and σ 2 2 : See listing in Appendix C.
Once we get the optimal value of α * = α * ( μ 1 , σ 1 2 , μ 2 , σ 2 2 ) , we get the Chernoff information as
D C [ p μ 1 , σ 1 2 , p μ 2 , σ 2 2 ] = D KL [ p m α * , v α * : p μ 1 , σ 1 2 ]
with the Kullback–Leibler divergence between two univariate Gaussians distributions p μ 1 , σ 1 2 and p μ 2 , σ 2 2 given by
D KL [ p μ 1 , σ 1 2 : p μ 2 , σ 2 2 ] = 1 2 ( μ 2 μ 1 ) 2 σ 2 2 + σ 1 2 σ 2 2 log σ 1 2 σ 2 2 1 .
Notice that from the invariance of Proposition 9, we have for any ( μ , σ 2 ) R × R + + :
D KL [ p μ 1 , σ 1 2 : p μ 2 , σ 2 2 ] = D KL p μ 1 μ σ , σ 1 2 σ 2 : p μ 2 μ σ , σ 2 2 σ 2 ,
and therefore by choosing ( μ , σ 2 ) = ( μ 1 , σ 1 2 ) , we have
D KL [ p μ 1 , σ 1 2 : p μ 2 , σ 2 2 ] = D KL p 0 , 1 , p μ 2 μ 1 σ 1 : σ 2 2 σ 1 2 .
Proposition 10.
The Chernoff information between two univariate Gaussian distributions can be calculated exactly in closed form.
One can also program these closed-form solutions in Python using the SymPy package (https://www.sympy.org/en/index.html (accessed on 30 July 2022)) for performing symbolic computations.
Let us report special cases with some illustrating examples.
  • First, let us consider the Gaussian subfamily with prescribed variance. When σ 1 2 = σ 2 2 = σ 2 , we always have α * = 1 2 , and the Chernoff information is
    D C [ p μ 1 , σ 2 : p μ 2 , σ 2 ] = ( μ 2 μ 1 ) 2 8 σ 2 .
    Notice that it amounts to one eight of the squared Mahalanobis distance (see [60] for a detailed explanation).
  • Second, let us consier Gaussian subfamily with prescribed mean. When μ 1 = μ 2 = μ , we get the optimal skewing value independent of the mean μ :
    α * = v 1 log 2 v 2 v 2 v 1 log 2 v 1 + v 1 v 2 v 1 log 2 v 2 log 2 v 1 v 2 + v 1 log 2 v 1
    where v 1 = σ 1 2 and v 2 = σ 2 2 . The Chernoff information is
    D C [ p μ 1 , v 1 : p μ 2 , v 2 ] v 2 v 1 log v 2 log 2 v 2 log 2 v 1 v 2 v 2 v 1 v 2 log 2 v 2 + log 2 v 1 + 1 v 2 v 1 2 v 2 2 v 1 .
  • Third, consider the Chernoff information between the standard normal distribution and another normal distribution. When ( μ 1 , σ 1 2 ) = ( 0 , 1 ) and ( μ 2 , σ 2 2 ) = ( μ , v ) , we get    α * = 4 μ 2 v 2 4 μ 2 v log 2 v + v 4 4 v 3 + 6 4 log 2 μ 2 v 2 + 4 μ 4 + 4 log 2 μ 2 4 v + 1 + 2 2 v log 2 v + v 2 + 2 log 2 2 v 2 μ 2 2 log 2 + 1 2 v 2 4 v + 2 log 2 v 2 log 2 v 2 + 2 μ 2 + 4 log 2 v 2 μ 2 2 log 2
Example 4.
Let us consider N ( μ 1 = 0 , σ 1 2 = 1 ) and N ( μ 2 = 1 , σ 2 2 = 2 ) . The Chernoff exponent is
α * = 8 log 4 8 log 2 + 9 2 log 4 + 2 log 2 1 2 log 4 2 log 2 + 2 0.4215580558605244 ,
and the Chernoff information is (zoom in for the formula):
  8 log 4 8 log 2 + 9 2 log 4 2 log 2 + 3 log 4 log 4 4 log 2 + 4 8 log 4 8 log 2 + 9 + 1 4 log 4 2 + 8 log 2 6 log 4 4 log 2 2 + 6 log 2 2 + 6 log 4 6 log 2 + 7 log 4 log 4 4 log 2 + 4 8 log 4 8 log 2 + 9 + 1 + 4 log 4 2 + 10 8 log 2 log 4 + 4 log 2 2 10 log 2 + 6 4 log 4 4 log 2 + 6 8 log 4 8 log 2 + 9 + 12 log 4 12 log 2 + 14
0.1155433222682347
Using the bisection search of [24] with ϵ = 10 8 takes 28 iterations, and we get
α * 0.42155805602669716 ,
and the Chernoff information is approximately 0.11554332226823472 . Now, if we swap p μ 1 , σ 1 2 p μ 2 , σ 2 2 , we find α * = 0.5784419439733028 (and 0.5784419439733028 + 0.42155805602669716 1 ).
Notice that in general, we may evaluate how good is the approximation α ˜ of α * by evaluating the deficiency of the optimal condition:
( θ 2 θ 1 ) η α ˜ F ( θ 2 ) + F ( θ 1 ) .
Example 5.
Let us consider μ 1 = 1 , σ 1 2 = 3 and μ 2 = 5 and σ 2 2 = 5 . We get
α * = 120 log 10 120 log 6 + 961 3 log 10 + 3 log 6 23 2 log 10 2 log 6 + 16 0.4371453168322306
and the Chernoff information is reported in closed form and evaluated numerically as
0.5242883659200144 .
In comparison, the bisection algorithm of [24] with ϵ = 10 8 takes 28 iterations, and reports α * 0.43714531883597374 and the Chernoff information about
0.5242883659200137 .
Corollary 4.
The smallest enclosing left-sided Kullback–Leibler disk of n univariate Gaussian distributions can be calculated exactly in randomized linear time [56].

5.3. Fast Approximation of the Chernoff Information of Multivariate Gaussian Distributions

In general, the Chernoff information between d-variate Gaussians distributions is not known in closed-form formula when d > 1 , see for example [61,62,63]. We shall consider below some special cases:
  • When the Gaussians have the same covariance matrix Σ , the Chernoff information optimal skewing parameter is α = 1 2 and the Chernoff information is
    D C [ p μ 1 , Σ , p μ 2 , Σ ] = 1 8 Δ Σ 2 ( μ 1 , μ 2 ) ,
    where Δ Σ 2 ( μ 1 , μ 2 ) = ( μ 2 μ 1 ) Σ 1 ( μ 2 μ 1 ) is the squared Mahalanobis distance. The Mahalanobis distance enjoys the following property by congruence transformation:
    Δ Σ ( μ 1 , μ 2 ) = Δ A Σ A ( A μ 1 , A μ 2 ) , A GL ( d ) .
    Notice that we can rewrite the (squared) Mahalanobis distance as
    Δ Σ 2 ( μ 1 , μ 2 ) = tr Σ 1 ( μ 2 μ 1 ) ( μ 2 μ 1 )
    using the matrix trace cyclic property. Then we check that
    Δ A Σ A 2 ( A μ 1 , A μ 2 ) = tr A Σ 1 A 1 A ( μ 2 μ 1 ) ( μ 2 μ 1 ) A , = tr ( Σ 1 ( μ 2 μ 1 ) ( μ 2 μ 1 ) ) = Δ Σ 2 ( μ 1 , μ 2 ) .
  • The Chernoff information for the special case of centered multivariate Gaussians distributions was studied in [62]. The KLD between two centered Gaussians p μ , Σ 1 and p μ , Σ 2 is half of the matrix Burg distance:
    D KL [ p μ , Σ 1 : p μ , Σ 2 ] = 1 2 log det Σ 2 det Σ 1 + tr ( Σ 2 1 Σ 1 ) d = : 1 2 D Burg [ Σ 1 : Σ 2 ] .
    When d = 1 , the Burg distance corresponds to the well-known Itakura–Saito divergence. The matrix Burg distance is a matrix spectral distance [62]:
    D Burg [ Σ 1 : Σ 2 ] = i = 1 d λ i log λ i 1 ,
    where the λ i ’s are the eigenvalues of Σ 2 Σ 1 1 . The reverse KLD divergence D KL [ p μ , Σ 2 : p μ , Σ 1 ] = 1 2 D Burg [ Σ 2 : Σ 1 ] is obtained by replacing λ i 1 λ i :
    D KL [ p μ , Σ 2 : p μ , Σ 1 ] = 1 2 i = 1 d 1 λ i + log λ i 1 .
    More generally, the f-divergences between centered Gaussian distributions are always matrix spectral divergences [60].
Otherwise, for the general multivariate case, we implement the dichotomic search of Algorithm 1 in Algorithm 3 with the KLD between two multivariate Gaussian distributions expressed as
D KL [ p μ 1 , Σ 1 : p μ 2 , Σ 2 ] = 1 2 Δ Σ 2 ( μ 1 , μ 2 ) + 1 2 D Burg [ Σ 1 : Σ 2 ] ,
= 1 2 tr ( Σ 2 1 Σ 1 ) log det Σ 2 det Σ 1 d + ( μ 2 μ 1 ) Σ 2 1 ( μ 2 μ 1 ) .
Algorithm 3 Dichotomic search for approximating the Chernoff information between two multivariate normal distributions p μ 1 , Σ 1 and p μ 2 , Σ 2 by approximating the optimal skewing parameter value α α * .
Entropy 24 01400 i003
Example 6.
Let d = 2 , p μ 1 , Σ 1 = p 0 , I be the standard bivariate Gaussian distribution and p μ 2 , Σ 2 be the bivariate Gaussian distribution with mean μ 2 = [ 1 2 ] and covariance matrix Σ 2 = 1 1 1 2 . Setting the numerical precision threshold ϵ to ϵ = 10 8 , the dichotomic search performs 28 split iterations, and approximate α * by
α * 0.5825489424169064 .
The Chernoff information D C [ p 0 , I , p μ 2 , Σ 2 ] is approximated by 0.8827640697808525 .
The m-interpolation of multivariate Gaussian distributions p μ 1 , Σ 1 and p μ 2 , Σ 2 with respect to the mixture connection m is given by
γ p μ 1 , Σ 1 , p μ 2 , Σ 2 m ( α ) = p μ α m , Σ α m ,
where
μ α m = ( 1 α ) μ 1 + α μ 2 = : μ ¯ α , Σ α m = ( 1 α ) Σ 1 + α Σ 2 + ( 1 α ) μ 1 μ 1 + α μ 2 μ 2 μ ¯ α μ ¯ α .
To e-interpolation of multivariate Gaussian distributions p μ 1 , Σ 1 and p μ 2 , Σ 2 with respect to the exponential connection e is given by
γ p μ 1 , Σ 1 , p μ 2 , Σ 2 e ( α ) = p μ α e , Σ α e ,
where
μ α e = Σ α e ( 1 α ) Σ 1 1 μ 1 + α Σ 2 1 μ 2 , Σ α e = ( 1 α ) Σ 1 1 + α Σ 2 1 1 .
In information geometry, both these e- and m-connections defined with respect to an exponential family are shown to be flat. These geodesics correspond to linear interpolations in the e -affine coordinate system θ and in the dual m coordinate system η , respectively.
Figure 7 displays these two e-geodesic and m-geodesic between two multivariate normal distributions. Notice that the Riemannian geodesic with the Levi–Civita metric connection e + m 2 is not known in closed form for boundary value conditions. The expression of the Riemannian geodesic is known only for initial value conditions [64] (i.e., starting point with a given vector direction).

5.4. Chernoff Information between Centered Multivariate Normal Distributions

The set
N 0 = p Σ ( x ) = 1 det 2 π Σ exp 1 2 x Σ 1 x : Σ 0
of centered multivariate normal distributions is a regular exponential family with natural parameter θ = Σ 1 , sufficient statistic t ( x ) = 1 2 x x , log-normalizer F ( θ ) = 1 2 log det θ and auxiliary carrier term k ( x ) = d 2 log ( 2 π ) . Family N 0 is also a multivariate scale family with scale matrices Σ 1 2 (standard deviation σ in 1D).
Let A , B = tr ( A B ) defines the inner product between two symmetric matrices A and B. Then we can write the centered Gaussian distribution p Σ ( x ) in the canonical form of exponential families:
p θ ( x ) = exp θ , t ( x ) F ( θ ) + k ( x ) .
The function log det of a positive-definite matrix is strictly concave [65], and hence we check that F ( θ ) is strictly convex. Furthermore, we have X log det X = X so that θ F ( θ ) = 1 2 θ .
The optimality condition equation of Chernoff best skewing parameter α * becomes:
θ 2 θ 1 , F ( θ 1 + α * ( θ 2 θ 1 ) ) = F ( θ 2 ) F ( θ 1 ) ,
1 2 tr ( ( θ 2 θ 1 ) ( θ 1 + α * ( θ 2 θ 1 ) ) 1 ) = 1 2 log det θ 2 det θ 1 ,
tr ( ( θ 2 θ 1 ) ( θ 1 + α * ( θ 2 θ 1 ) ) 1 ) = log det θ 2 det θ 1 ,
tr ( ( Σ 2 1 Σ 1 1 ) ( Σ 1 1 + α * ( Σ 2 1 Σ 1 1 ) ) 1 ) = log det Σ 1 det Σ 2 = log det Σ 1 Σ 2 1 .
When Σ 2 = s Σ 1 (and Σ 2 1 = 1 s Σ 1 1 ) for s > 0 and s 1 , we get a closed-form for α * using the fact that det I s = 1 s d and tr ( I ) = d for d-dimensional identity matrix I. Solving Equation (54) yields
α * ( s ) = s 1 log s ( s 1 ) log s ( 0 , 1 ) .
Therefore the Chernoff information between two scaled centered Gaussian distributions p μ , Σ and p μ , s Σ is available in closed form.
Proposition 11.
The Chernoff information between two scaled d-dimensional centered Gaussian distributions p μ , Σ and p μ , s Σ of N μ (for s > 0 ) is available in closed form:
D C [ p μ , Σ , p μ , s Σ ] = D B , α * [ p μ , Σ , p μ , s Σ ] = d ( s 1 ) log s s 1 log s s log s + s 1 2 ( 1 s ) ,
where α * = s 1 log s ( s 1 ) log s ( 0 , 1 ) .
Notice that α * ( p μ , Σ : p μ , s Σ ) = α * ( p μ , Σ , p μ , 1 s Σ ) and D C [ p μ , Σ , p μ , s Σ ] = D C [ p μ , Σ , p μ , 1 s Σ ] .
Example 7.
Consider μ 1 = μ 2 = 0 and Σ 1 = I , Σ 2 = 1 2 I . We find that α * = 2 log 2 1 log 2 , which is independent of the dimension of the matrices. The Chernoff information depends on the dimension:
D C [ p 0 , I , p 0 , 1 2 I ] = d log 2 log log 2 1 2 .
Notice that when d = 1 , we have s = σ 2 2 σ 1 2 , and we recover a special case of the closed-form formula for the Chernoff information between univariate Gaussians.
In [62], the following equation is reported for finding α * based on Equation (54):
OC Centered Gaussians : i = 1 d 1 λ i α * + ( 1 α * ) λ i + log λ i = 0
where the λ i ’s are generalized eigenvalues of Σ 1 Σ 2 1 (this excludes the case of all λ i ’s equal to one). The value of α * satisfying Equation (57) is unique. Let us notice that the product of two symmetric positive-definite matrices is not necessarily symmetric anymore. We can derive Equation (57) by expressing Equation (54) using the identity matrix I and matrix  Σ 2 1 2 Σ 1 Σ 2 1 2 .
Remark 9.
We can get closed-form solutions for α * and the corresponding Chernoff information in some particular cases. For example, when the dimension d = 2 , we need to solve a quadratic equation to get α * . Thus, for d 4 , we get a closed-form solution for α * by solving a polynomial equation characterizing the optimal condition, and obtain the Chernoff information in closed-form as a byproduct.
Example 8.
Consider the Chernoff information between p 0 , I and p 0 , Λ with Λ = diag ( 1 , 2 , 3 , 4 ) . We get the exact Chernoff exponent value α * by taking the root of a quartic polynomial equation falling in ( 0 , 1 ) . By evaluating numerically this root, we find that α * 0.59694 and that the Chernoff information is D C [ p 0 , I , p 0 , Λ ] 0.22076 . See Appendix C for some symbolic computation code.

6. Chernoff Information between Densities of Different Exponential Families

Let
E 1 = { p θ = exp ( θ , t 1 ( x ) F 1 ( θ ) ) : θ Θ } ,
and
E 2 = { q θ = exp ( θ , t 2 ( x ) F 2 ( θ ) : θ Θ } ,
be two distinct exponential families, and consider the Chernoff information between the densities p θ 1 and q θ 2 . The exponential arc induced by p θ 1 and q θ 2 is
{ ( p θ 1 q θ 2 ) α G p θ 1 α q θ 2 1 α : α ( 0 , 1 ) } .
Let E 12 denote the exponential family with sufficient statistics ( t 1 ( x ) , t 2 ( x ) ) , log-normalizer F 12 ( θ , θ ) , and denote by Θ 12 its natural parameter space. Family E 12 can be interpreted as a product exponential family which yields an exponential family. We have
( p θ 1 q θ 2 ) α G = exp ( t 1 ( x ) , t 2 ( x ) ) , ( α θ 1 , ( 1 α ) θ 2 ) F 12 ( α θ 1 , ( 1 α ) θ 2 ) .
Thus the induced LREF E p θ 1 q θ 2 with natural parameter space Θ p θ 1 q θ 2 can be interpreted as a 1D curved exponential family of the product exponential family E 12 .
The optimal skewing parameter α * is found by setting the derivative of F 12 ( α θ 1 , ( 1 α ) θ 2 ) with respect α to zero:
d d α F 12 ( α θ 1 , ( 1 α ) θ 2 ) = 0 .
Example 9.
Let E 1 can be chosen as the exponential family of exponential distributions
E 1 = e λ ( x ) = λ exp ( λ x ) , λ ( 0 , + )
defined on the support X 1 = ( 0 , ) and E 2 can be chosen as the exponential family of half-normal distributions
E 2 = h σ ( x ) = 2 π σ 2 exp ( x 2 2 σ 2 ) : σ 2 > 0
with support X 2 = ( 0 , ) .
The product exponential family corresponds to the singly truncated normal family [50] which is a non-regular (i.e., parameter space is not topologically an open set):
Θ 12 = ( R × R + + ) Θ 0 ,
with Θ 0 = { ( θ , 0 ) : θ < 0 } (the part corresponding to the exponential family of exponential distributions). This exponential family E 12 = { p θ 1 , θ 2 } of singly truncated normal distributions is also non-steep [50]. The log-normalizer is
F 12 ( θ 1 , θ 2 ) = 1 2 log π θ 2 + log Φ θ 1 2 θ 2 + θ 1 2 4 θ 2 ,
where θ 1 = μ σ 2 and θ 2 = 1 2 σ 2 , and Φ denotes the cumulative distribution function of the standard normal. Function F 12 is proven of class C 1 on Θ 12 (see Proposition 3.1 of [50]) with F 12 ( θ , 0 ) = log ( θ ) for θ < 0 .
Notice that the KLD between an exponential distribution and a half-normal distribution is + since the definite integral diverges (hence D KL [ e λ : h σ ] is not equivalent to a Bregman divergence, and Θ e θ 1 h θ 2 is not open at 1) but the reverse KLD between a half-normal distribution and an exponential distribution is available in closed-form (using symbolic computing):
D KL [ h σ : e λ ] = 8 σ λ π ( 1 + log π λ 2 σ 2 2 ) 2 π .
Figure 8 illustrate the domain of the singly truncated normal distributions and displays an exponential arc between an exponential distribution and a half-normal distribution. Notice that we could have also considered a similar but different example by taking the exponential family of Rayleigh distributions which exhibit an additional extra carrier term k ( x ) .
The Bhattacharyya α-skewed coefficient calculated using symbolic computing (see Appendix C) is
ρ α [ h σ : e λ ] = ρ 1 α [ e λ : h σ ] = π 1 2 α 2 e σ 2 λ 2 2 α 2 + 1 2 σ λ e α σ 2 λ 2 2 + σ 2 λ 2 2 α erf 2 α 2 σ λ 2 α + 2 α 2 + 1 2 σ λ e α σ 2 λ 2 2 + σ 2 λ 2 2 α 2 α σ α λ α ,
where erf denotes the error function.

7. Conclusions

In this work, we revisited the Chernoff information [2] (1952) which was originally introduced to upper bound Bayes’ error in binary hypothesis testing. A general characterization of Chernoff information between two arbitrary probability measures was given in [11] (Theorem 32) by considering Rényi divergences which can be interpreted as scaled skewed Bhattacharyya divergences. Since its inception, the Chernoff information has proven useful as a statistical divergence (Chernoff divergence) in many applications ranging from information fusion to quantum metrology due to its empirical robustness property [19]. Informally, we may observe empirically that in practice the skewed Bhattacharyya divergence is more stable around the Chernoff exponent α * than in other part of the range ( 0 , 1 ) . By considering the maximal extension of the exponential arc joining two densities p and q on a Lebesgue space L 1 ( μ ) , we built full likelihood ratio exponential families [10] E p q (LREFs) in Section 2. When the LREF E p q is a regular exponential family (with coinciding support of p and q), both the forward and reverse Kullback–Leibler divergence are finite and can be rewritten as finite Bregman divergences induced by the log-normalizer F p q of E p q which amounts to minus skewed Bhattacharyya divergences. Since log-normalizers of exponential families are strictly convex, we deduced that the skewed Bhattacharyya divergences are strictly concave and their maximization yielding the Chernoff information is hence proven unique. As a byproduct, this geometric characterization in L 1 ( μ ) allowed us to prove that the intersection of a e-geodesic with a m-bisector is unique in dually flat subspaces of L 1 ( μ ) , and similarly that the intersection of a m-geodesic with a e-bisector is unique (Proposition 8). We then considered the exponential families of univariate and multivariate normal distributions: We reported closed-form solutions for the Chernoff information of univariate normal distribution and centered normal distributions with scaled covariance matrices, and show how to implement efficiently a dichotomic search for approximating the Chernoff information between two multivariate normal distributions (Algorithm 3). Table 1 summarizes the various optimal condition studied characterizing the Chernoff exponent. Finally, inspired by the Chernoff information study, we defined in Section 4, the forward and reverse Bregman–Chernoff divergences [66], and show how these divergences are related to the capacity of a discrete memoryless channel and the minimax redundancy of universal coding in information theory [13].
Additional material including Maxima and Java® snippet codes is available online at https://franknielsen.github.io/ChernoffInformation/index.html (accessed on 30 July 2022).

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

I would like to thank Rob Brekelmans for many fruitful discussions on likelihood ratio exponential families and related topics. I also warmly thank the three Reviewers for the careful and insightful review of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Background on Statistical Divergences

We introduce some statistical dissimilarities like the Kullback–Leibler divergence, the Bhattacharyya distance or the Hellinger divergence which have been proven among others useful is characterizing or bounding the probability of error in Bayesian statistical hypothesis testing [4,25,67].
The Kullback–Leibler divergence [13] (KLD) between two probability measures (PMs) P and Q is defined as
D KL [ P : Q ] = X log d P d Q d P , i f P Q + i f P Q
Two PMs P and Q are mutually singular when there exists an event A A such that P ( A ) = 0 and Q ( X A ) = 0 . Mutually singular measures P and Q are notationally written as P Q . Let P and Q be two non-singular probability measures on ( X , A ) dominated by a common σ -finite measure μ , and denote by p = d P d μ and q = d Q d μ their Radon–Nikodym densities with respect to μ . Then the KLD between P and Q can be calculated equivalently by the KLD between their densities as follows:
D KL [ P : Q ] = D KL [ p : q ] = X p log p q d μ .
It can be shown that D KL [ p : q ] is independent of the chosen dominating measure μ  [4], and thus when P , Q μ , we write for short D KL [ P : Q ] = D KL [ p : q ] . Although the dominating measure μ can be set to μ = P + Q 2 in general, it is either often chosen as μ L the Lebesgue measure for continuous sample spaces R d (with the σ -algebra A = B ( R d ) of Borel sets) or as the counting measure μ # for discrete sample spaces (with the σ -algebra A of power sets). The KLD is not a metric distance because it is asymmetric and does not satisfy the triangle inequality.
Let supp ( μ ) = cl { A A : μ ( A ) 0 } denote the support of a Radon positive measure [1] μ where cl denotes the topological closure operation. Notice that D KL [ p : q ] = + when the definite integral of Equation (A1) divergences (e.g., the KLD between a standard Cauchy distribution and a standard normal distribution is + but the KLD between a standard normal distribution and a standard Cauchy distributions is finite), and D KL [ P : Q ] = when the probability measures have disjoint supports ( P Q ). Thus, when the supports of P and Q are distinct but not nested, both the forward KLD D KL [ P : Q ] and the reverse KLD D KL [ Q : P ] are infinite.
Let f α ( u ) = u α for α R . The functions f α ( u ) are convex for α R [ 0 , 1 ] and concave for α [ 0 , 1 ] . Thus, we can define the f-divergences [59,68]
I f α [ p : q ] = p f α ( q / p ) d μ = p 1 α q α d μ ,
for α R [ 0 , 1 ] and I f α [ p : q ] = p f α ( q / p ) d μ = p 1 α q α d μ for α ( 0 , 1 ) (or equivalently take the convex generator h α ( u ) = u α for α ( 0 , 1 ) ). Notice that the conjugate f-divergence is obtained for the generator f α * ( u ) = u f α ( 1 / u ) = u 1 α : I f α [ q : p ] = I f α * [ p : q ] . By Jensen’s inequality, we have that the f-divergences are lower bounded by f ( 1 ) . Thus, I h α * [ p : q ] h α ( 1 ) = 1 . Since f-divergences are upper bounded by f ( 0 ) + f * ( 0 ) , we have that I h α * [ p : q ] < 0 for α ( 0 , 1 ) . This gives another proof that the Bhattacharyya coefficient ρ α [ p : q ] = I h 1 α [ p : q ] is bounded between 0 and 1 since the I h 1 α divergence is bounded between 1 and 0. Moreover, Ali and Silvey [59] further defined the ( f , g ) -divergences as I f , g [ p : q ] = g ( I f [ p : q ] ) for a strictly monotonically increasing function g ( v ) . Letting g ( v ) = log ( v ) (with g ( v ) = 1 v < 0 when v ( 0 , 1 ) ), we get that the ( h 1 α , g ) -divergences are the Bhattacharyya distances for α ( 0 , 1 ) . However, the Chernoff information is not a f-divergence despite the fact that Bhattacharyya distances are Ali–Silvey ( f , g ) -divergences because of the maximization criterion [59] of Equation (4).
We refer the reader to [69] (Chapter 14), [70] (Figure 1) and [71] (Figure 3) for other statistical distances and statistical dissimilarities with their connections.

Appendix B. Exponential Family of Univariate Gaussian Distributions

Consider the family of univariate normal distributions:
N = p μ , σ 2 ( x ) = 1 2 π σ 2 exp 1 2 ( x μ ) 2 σ 2 , μ R , σ 2 > 0 .
Let λ = ( λ 1 = μ , λ 2 = σ 2 ) denote the mean-variance parameterization, and consider the sufficient statistic vector t ( x ) = ( x , x 2 ) . Then the densities of N can be written in the canonical form of exponential families:
p λ ( x ) = exp θ ( λ ) , t ( x ) F ( θ ) ,
where θ ( λ ) = λ 1 λ 2 , 1 2 λ 2 and the log-normalizer is
F ( θ ) = θ 1 2 4 θ 2 + 1 2 log π θ 2 .
he dual moment parameterization is η ( λ ) = E p λ [ t ( x ) ] = λ 1 , λ 1 2 + λ 2 , and the convex conjugate is:
F * ( η ) = sup θ Θ { θ , η F ( η ) } = 1 2 ( log ( 2 π e ( η 2 η 1 2 ) ) .
We check that the convex conjugate coincides with the negentropy [72]:
h [ p λ ] = F * ( η ( λ ) ) .
The conversion formulæ between the dual natural/moment parameters and the ordinary parameters are given by:
θ ( λ ) = λ 1 λ 2 , 1 2 λ 2 ,
λ ( θ ) = θ 1 2 θ 2 , 1 2 θ 2 ,
η ( λ ) = λ 1 , λ 1 2 + λ 2 ,
λ ( η ) = η 1 , η 2 η 1 2 ,
η ( θ ) = ( E [ x ] , E [ x 2 ] ) = F ( θ ) = θ 1 2 θ 2 , 1 2 θ 2 + θ 1 2 4 θ 2 2 ,
θ ( η ) = F * ( η ) = η 1 η 1 2 η 2 , 1 2 ( η 1 2 η 2 )
We check that
D KL [ p λ : p λ ] = 1 2 ( λ 1 λ 1 ) 2 λ 2 2 + λ 2 2 λ 2 2 log λ 2 2 λ 2 2 1 , = B F ( θ ( λ ) : θ ( λ ) ) = B F * ( η ( λ ) : η ( λ ) ) , = Y F , F * ( θ ( λ ) : η ( λ ) ) = Y F * , F ( η ( λ ) : θ ( λ ) ) ,
where B F and B F * are the dual Bregman divergences and Y F , F * and Y F * , F are the dual Fenchel–Young divergences.

Appendix C. Code Snippets in MAXIMA

Code for plotting Figure 1.
Listing A1: Plot the cumulant function of a log ratio exponential family induced by two normal distributions.
  • varalpha(v1,v2,alpha):=(v1*v2)/((1-alpha)*v1+alpha*v2)$
  • mualpha(mu1,v1,mu2,v2,alpha):=(alpha*mu1*v2+(1-alpha)*mu2*v1)/((1-alpha)*v1+alpha*v2)$
  • assume(v1>0)$assume(v2>0)$
  • theta1(mu,v):=mu/v$
  • theta2(mu,v):=-1/(2*v)$;
  • F(theta1,theta2):=((-theta1**2)/(4*theta2))+(1/2)*log(-%pi/theta2)$
  • JF(alpha,theta1,theta2,theta1p,theta2p):=alpha*F(theta1,theta2)+(1-alpha)*F(theta1p,theta2p)-F(alpha*theta1+(1-alpha)*theta1p,alpha*theta2+(1-alpha)*theta2p);
  •  
  • m1:0;v1:1;m2:1;v2:2;
  •  
  • plot2d([JF(alpha,theta1(m1,v1),theta2(m1,v1),theta1(m2,v2),theta2(m2,v2)),
  • -JF(alpha,theta1(m1,v1),theta2(m1,v1),theta1(m2,v2),theta2(m2,v2)),
  •  [discrete,[[0.4215580558605244,-0.15],[0.4215580558605244,0.15]]],
  • [discrete, [0.4215580558605244], [0.1155433222682347]],
  • [discrete, [0.4215580558605244], [-0.1155433222682347]]
  • ],
  • [alpha,0,1], [xlabel,"alpha"], [ylabel,"F_{pq}(alpha)=-D_{B,alpha}[p:q]"],
  • [style,  [lines,1,1],[lines,1,2],
  •  [lines,2,0], [points, 3,3],[points, 3,3]],[legend, "skew Bhattacharyya D_{B,alpha}[p:q]","LREF log-normalizer F_{pq}(alpha)","","","" ],
  • [color, blue, red,   black, black,black],[point_type,asterisk]);
Code for calculating the Chernoff information between two univariate Gaussian distributions (Proposition 10):
Listing A2: Calculate symbolically the exact Chernoff information between two univariate normal distributions.
  • varalpha(v1,v2,alpha):=(v1*v2)/((1-alpha)*v1+alpha*v2)$
  • mualpha(mu1,v1,mu2,v2,alpha):=(alpha*mu1*v2+(1-alpha)*mu2*v1)/((1-alpha)*v1+alpha*v2)$
  •  
  • /* Kullback--Leibler divergence */
  • KLD(mu1,v1,mu2,v2):=(1/2)*((((mu2-mu1)**2)/v2)+(v1/v2)-log(v1/v2)-1)$
  •  
  • assume(alpha>0)$assume(alpha<1)$
  • assume(v1>0)$assume(v2>0)$
  • theta1(mu,v):=mu/v$
  • theta2(mu,v):=-1/(2*v)$;
  • F(theta1,theta2):=-theta1**2/(4*theta2)+0.5*log(-1/theta2)$
  •  
  • eq: (theta1(mu1,v1)-theta1(mu2,v2))*mualpha(mu1,v1,mu2,v2,alpha)+(theta2(mu1,v1)-theta2(mu2,v2))*(mualpha(mu1,v1,mu2,v2,alpha)**2+varalpha(v1,v2,alpha))-F(theta1(mu1,v1),theta2(mu1,v1))+F(theta1(mu2,v2),theta2(mu2,v2));
  • solalpha: solve(eq,alpha)$
  • alphastar:rhs(solalpha[1]);
  •  
  • ChernoffInformation: KLD(mualpha(mu1,v1,mu2,v2,alphastar),varalpha(v1,v2,alphastar),mu1,v1)$
  • print("Chernoff information=")$ratsimp(ChernoffInformation);
Example of a plot of the α -Bhattacharryya distance for α [ 0 , 1 ] when p and q are two normal distributions.
Listing A3: Plot the skewed Bhattacharrya divergences between two normal distributions as an equivalent skewed Jensen divergence between two normal distributions.
  • varalpha(v1,v2,alpha):=(v1*v2)/((1-alpha)*v1+alpha*v2)$
  • mualpha(mu1,v1,mu2,v2,alpha):=(alpha*mu1*v2+(1-alpha)*mu2*v1)/((1-alpha)*v1+alpha*v2)$
  • assume(v1>0)$assume(v2>0)$
  • theta1(mu,v):=mu/v$
  • theta2(mu,v):=-1/(2*v)$;
  • F(theta1,theta2):=((-theta1**2)/(4*theta2))+(1/2)*log(-%pi/theta2)$
  • JF(alpha,theta1,theta2,theta1p,theta2p):=alpha*F(theta1,theta2)+(1-alpha)*F(theta1p,theta2p)-F(alpha*theta1+(1-alpha)*theta1p,alpha*theta2+(1-alpha)*theta2p);
  • m1:0;v1:1;m2:1;v2:2;
  • plot2d(JF(alpha,theta1(m1,v1),theta2(m1,v1),theta1(m2,v2),theta2(m2,v2)),[alpha,0,1]);
Example which calculates exactly the Chernoff exponent between two centered 4D Gaussians by solving the polynomial roots of the Chernoff optimal condition:
Listing A4: Calculate the Chernoff information between two 4D centered normal distributions based on their eigenvalues.
  • assume(l1>0);assume(l2>0);assume(l3>0);assume(l4>0);
  • assume(alpha>0);assume(alpha<1);
  • l1:1;l2:2;l3:3;l4:4;
  • eq: (1-l1)/(alpha+(1-alpha)*l1)+ (1-l2)/(alpha+(1-alpha)*l2)+ (1-l3)/(alpha+(1-alpha)*l3)+ (1-l4)/(alpha+(1-alpha)*l4) + log(l1)+log(l2)+log(l3)+log(l4);
  • solve(eq,alpha);
  • sol:float(%);
  • realpart(sol);imagpart(sol);
  • /* alpha=0.5969427599369763 */
Example of choosing two different exponential families: The half-normal distributions and the exponential distributions:
Listing A5: Calculate symbolically the Kullback–Leibler divergence and the Bhattacharyya coefficient between a half normal distribution and an exponential distribution.
  • assume(sigma>0);
  • halfnormal(x,sigma):=(sqrt(2)/(sqrt(%pi*sigma**2)))*exp(-x**2/(2*sigma**2));
  • assume(lambda>0);
  • exponential(x,lambda):=lambda*exp(-lambda*x);
  • /* KLD diverges */
  • integrate(exponential(x,lambda)*log(exponential(x,lambda)/halfnormal(x,sigma)),x,0,inf);
  • /* KLD converges */
  • integrate(halfnormal(x,sigma)*log(halfnormal(x,sigma)/exponential(x,lambda)),x,0,inf);
  • /* Bhattacharyya coefficient */
  • assume(alpha>0);
  • assume(alpha<1);
  • integrate( (halfnormal(x,sigma)**alpha) * (exponential(x,lambda)**(1-alpha)),x,0,inf);

References

  1. Keener, R.W. Theoretical Statistics: Topics for a Core Course; Springer Science & Business Media: New York, NY, USA, 2010. [Google Scholar]
  2. Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
  3. Csiszár, I. A class of measures of informativity of observation channels. Period. Math. Hung. 1972, 2, 191–213. [Google Scholar] [CrossRef]
  4. Torgersen, E. Comparison of Statistical Experiments; Cambridge University Press: Cambridge, UK, 1991; Volume 36. [Google Scholar]
  5. Audenaert, K.M.; Calsamiglia, J.; Munoz-Tapia, R.; Bagan, E.; Masanes, L.; Acin, A.; Verstraete, F. Discriminating states: The quantum Chernoff bound. Phys. Rev. Lett. 2007, 98, 160501. [Google Scholar] [PubMed] [Green Version]
  6. Audenaert, K.M.; Nussbaum, M.; Szkoła, A.; Verstraete, F. Asymptotic error rates in quantum hypothesis testing. Commun. Math. Phys. 2008, 279, 251–283. [Google Scholar] [CrossRef] [Green Version]
  7. Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
  8. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
  9. Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
  10. Grünwald, P.D. Information-Theoretic Properties of Exponential Families. In The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007; pp. 623–650. [Google Scholar]
  11. Van Erven, T.; Harremos, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  12. Nakiboğlu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2018, 65, 841–860. [Google Scholar] [CrossRef] [Green Version]
  13. Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
  14. Borade, S.; Zheng, L. I-projection and the geometry of error exponents. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 27–29 September 2006. [Google Scholar]
  15. Boyer, R.; Nielsen, F. On the error exponent of a random tensor with orthonormal factor matrices. In International Conference on Geometric Science of Information; Springer: Cham, Switzerland, 2017; pp. 657–664. [Google Scholar]
  16. D’Costa, A.; Ramachandran, V.; Sayeed, A.M. Distributed classification of Gaussian space-time sources in wireless sensor networks. IEEE J. Sel. Areas Commun. 2004, 22, 1026–1036. [Google Scholar] [CrossRef]
  17. Yu, N.; Zhou, L. Comments on and Corrections to “When Is the Chernoff Exponent for Quantum Operations Finite?”. IEEE Trans. Inf. Theory 2022, 68, 3989–3990. [Google Scholar] [CrossRef]
  18. Konishi, S.; Yuille, A.L.; Coughlan, J.; Zhu, S.C. Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins, CO, USA, 23–25 June 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 1, pp. 573–579. [Google Scholar]
  19. Julier, S.J. An empirical study into the use of Chernoff information for robust, distributed fusion of Gaussian mixture models. In Proceedings of the 2006 9th International Conference on Information Fusion, Florence, Italy, 10–13 July 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 1–8. [Google Scholar]
  20. Kakizawa, Y.; Shumway, R.H.; Taniguchi, M. Discrimination and clustering for multivariate time series. J. Am. Stat. Assoc. 1998, 93, 328–340. [Google Scholar] [CrossRef]
  21. Dutta, S.; Wei, D.; Yueksel, H.; Chen, P.Y.; Liu, S.; Varshney, K. Is there a trade-off between fairness and accuracy? A perspective using mismatched hypothesis testing. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 2803–2813. [Google Scholar]
  22. Agarwal, S.; Varshney, L.R. Limits of deepfake detection: A robust estimation viewpoint. arXiv 2019, arXiv:1905.03493. [Google Scholar]
  23. Maherin, I.; Liang, Q. Radar sensor network for target detection using Chernoff information and relative entropy. Phys. Commun. 2014, 13, 244–252. [Google Scholar] [CrossRef]
  24. Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef] [Green Version]
  25. Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef] [Green Version]
  26. Westover, M.B. Asymptotic geometry of multiple hypothesis testing. IEEE Trans. Inf. Theory 2008, 54, 3327–3329. [Google Scholar] [CrossRef]
  27. Nielsen, F. Hypothesis testing, information divergence and computational geometry. In International Conference on Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2013; pp. 241–248. [Google Scholar]
  28. Leang, C.C.; Johnson, D.H. On the asymptotics of M-hypothesis Bayesian detection. IEEE Trans. Inf. Theory 1997, 43, 280–282. [Google Scholar] [CrossRef]
  29. Cena, A.; Pistone, G. Exponential statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 27–56. [Google Scholar] [CrossRef]
  30. Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  31. Brekelmans, R.; Nielsen, F.; Makhzani, A.; Galstyan, A.; Steeg, G.V. Likelihood Ratio Exponential Families. arXiv 2020, arXiv:2012.15480. [Google Scholar]
  32. De Andrade, L.H.; Vieira, F.L.; Vigelis, R.F.; Cavalcante, C.C. Mixture and exponential arcs on generalized statistical manifold. Entropy 2018, 20, 147. [Google Scholar] [CrossRef] [Green Version]
  33. Siri, P.; Trivellato, B. Minimization of the Kullback–Leibler Divergence over a Log-Normal Exponential Arc. In International Conference on Geometric Science of Information; Springer: Cham, Switzerland, 2019; pp. 453–461. [Google Scholar]
  34. Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef] [Green Version]
  35. Collins, M.; Dasgupta, S.; Schapire, R.E. A generalization of principal components analysis to the exponential family. Adv. Neural Inf. Process. Syst. 2001, 14, 617–624. [Google Scholar]
  36. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6. [Google Scholar] [CrossRef] [Green Version]
  37. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
  38. Sundberg, R. Statistical Modelling by Exponential Families; Cambridge University Press: Cambridge, UK, 2019; Volume 12. [Google Scholar]
  39. Nielsen, F.; Okamura, K. On f-divergences between Cauchy distributions. arXiv 2021, arXiv:2101.12459. [Google Scholar]
  40. Chyzak, F.; Nielsen, F. A closed-form formula for the Kullback–Leibler divergence between Cauchy distributions. arXiv 2019, arXiv:1905.10965. [Google Scholar]
  41. Huzurbazar, V.S. Exact forms of some invariants for distributions admitting sufficient statistics. Biometrika 1955, 42, 533–537. [Google Scholar] [CrossRef]
  42. Burbea, J.; Rao, C. On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 1982, 28, 489–495. [Google Scholar] [CrossRef]
  43. Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences: Part 2. Commun. Math. Sci. 2008, 6, 927–948. [Google Scholar] [CrossRef] [Green Version]
  44. Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
  45. Han, Q.; Kato, K. Berry–Esseen bounds for Chernoff-type nonstandard asymptotics in isotonic regression. Ann. Appl. Probab. 2022, 32, 1459–1498. [Google Scholar] [CrossRef]
  46. Neal, R.M. Annealed importance sampling. Stat. Comput. 2001, 11, 125–139. [Google Scholar] [CrossRef]
  47. Grosse, R.B.; Maddison, C.J.; Salakhutdinov, R. Annealing between distributions by averaging moments. In Advances in Neural Information Processing Systems 26, Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 5–10 December 2013; Citeseer: La Jolla, CA, USA, 2013; pp. 2769–2777. [Google Scholar]
  48. Takenouchi, T. Parameter Estimation with Generalized Empirical Localization. In International Conference on Geometric Science of Information; Springer: Cham, Switzerland, 2019; pp. 368–376. [Google Scholar]
  49. Rockafellar, R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967, 19, 200–205. [Google Scholar] [CrossRef]
  50. Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef] [Green Version]
  51. Amari, S.I. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016; Volume 194. [Google Scholar]
  52. Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discret. Comput. Geom. 2010, 44, 281–307. [Google Scholar] [CrossRef] [Green Version]
  53. Lê, H.V. Statistical manifolds are statistical models. J. Geom. 2006, 84, 83–93. [Google Scholar] [CrossRef]
  54. Nielsen, F. On a Variational Definition for the Jensen–Shannon Symmetrization of Distances Based on the Information Radius. Entropy 2021, 23, 464. [Google Scholar] [CrossRef]
  55. Nock, R.; Nielsen, F. Fitting the smallest enclosing Bregman ball. In Proceedings of the European Conference on Machine Learning, Porto, Portugal, 3–7 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 649–656. [Google Scholar]
  56. Nielsen, F.; Nock, R. On the smallest enclosing information disk. Inf. Process. Lett. 2008, 105, 93–97. [Google Scholar] [CrossRef] [Green Version]
  57. Costa, R. Information Geometric Probability Models in Statistical Signal Processing. Ph.D. Thesis, University of Rhode Island, Kingston, RI, USA, 2016. [Google Scholar]
  58. Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv 2009, arXiv:0911.4863. [Google Scholar]
  59. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar] [CrossRef]
  60. Nielsen, F.; Okamura, K. A note on the f-divergences between multivariate location-scale families with either prescribed scale matrices or location parameters. arXiv 2022, arXiv:2204.10952. [Google Scholar]
  61. Athreya, A.; Fishkind, D.E.; Tang, M.; Priebe, C.E.; Park, Y.; Vogelstein, J.T.; Levin, K.; Lyzinski, V.; Qin, Y. Statistical inference on random dot product graphs: A survey. J. Mach. Learn. Res. 2017, 18, 8393–8484. [Google Scholar]
  62. Li, B.; Wei, S.; Wang, Y.; Yuan, J. Topological and algebraic properties of Chernoff information between Gaussian graphs. In Proceedings of the 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 670–675. [Google Scholar]
  63. Tang, M.; Priebe, C.E. Limit theorems for eigenvectors of the normalized Laplacian for random graphs. Ann. Stat. 2018, 46, 2360–2415. [Google Scholar] [CrossRef] [Green Version]
  64. Calvo, M.; Oller, J.M. An explicit solution of information geodesic equations for the multivariate normal model. Stat. Risk Model. 1991, 9, 119–138. [Google Scholar] [CrossRef]
  65. Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  66. Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences. Commun. Math. Sci. 2008, 6, 915–926. [Google Scholar] [CrossRef] [Green Version]
  67. Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
  68. Csiszar, I. Eine information’s theoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoschen Ketten. Publ. Math. Inst. Hung. Acad. Sc. 1963, 3, 85–107. [Google Scholar]
  69. Deza, M.M.; Deza, E. Encyclopedia of distances. In Encyclopedia of Distances; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–583. [Google Scholar]
  70. Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef] [Green Version]
  71. Jian, B.; Vemuri, B.C. Robust point set registration using Gaussian mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 1633–1645. [Google Scholar] [CrossRef]
  72. Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 3621–3624. [Google Scholar]
Figure 1. Plot of the Bhattacharryya distance D B , α ( p : q ) (strictly concave, displayed in blue) and the log-normalizer F p q ( α ) of the induced LREF E p q (strictly convex, displayed in red) for two univariate normal densities p = p 0 , 1 ) (standard normal) and q = p 1 , 2 : The curves D B , α ( p : q ) = F p q ( α ) are horizontally mirror symmetric to each others. The Chernoff information optimal skewing value α * between these two univariate normal distributions can be calculated exactly in closed-form, see Section 5.2 (approximated numerically here for plotting the vertical grey line by α * 0.4215580558605244 ).
Figure 1. Plot of the Bhattacharryya distance D B , α ( p : q ) (strictly concave, displayed in blue) and the log-normalizer F p q ( α ) of the induced LREF E p q (strictly convex, displayed in red) for two univariate normal densities p = p 0 , 1 ) (standard normal) and q = p 1 , 2 : The curves D B , α ( p : q ) = F p q ( α ) are horizontally mirror symmetric to each others. The Chernoff information optimal skewing value α * between these two univariate normal distributions can be calculated exactly in closed-form, see Section 5.2 (approximated numerically here for plotting the vertical grey line by α * 0.4215580558605244 ).
Entropy 24 01400 g001
Figure 2. The best unique parameter α * defining the Chernoff information optimal skewing parameter is found by setting the derivative of the strictly convex function F p q ( α ) to zero. At the optimal value α * , we have D C [ p : q ] = D KL [ ( p q ) α * G : p ] = D KL [ ( p q ) α * G : q ] = F ( α * ) > 0 .
Figure 2. The best unique parameter α * defining the Chernoff information optimal skewing parameter is found by setting the derivative of the strictly convex function F p q ( α ) to zero. At the optimal value α * , we have D C [ p : q ] = D KL [ ( p q ) α * G : p ] = D KL [ ( p q ) α * G : q ] = F ( α * ) > 0 .
Entropy 24 01400 g002
Figure 3. The Chernoff information distribution ( P Q ) α * G with density ( p q ) α * G is obtained as the unique intersection of the exponential arc γ G ( p , q ) linking density p to density q of L 1 ( μ ) with the left-sided Kullback–Leibler divergence bisector Bi KL left ( p , q ) of p and q: ( p q ) α * G = γ G ( p , q ) Bi KL left ( p , q ) .
Figure 3. The Chernoff information distribution ( P Q ) α * G with density ( p q ) α * G is obtained as the unique intersection of the exponential arc γ G ( p , q ) linking density p to density q of L 1 ( μ ) with the left-sided Kullback–Leibler divergence bisector Bi KL left ( p , q ) of p and q: ( p q ) α * G = γ G ( p , q ) Bi KL left ( p , q ) .
Entropy 24 01400 g003
Figure 4. Illustration of the dichotomic search for approximating the optimal skewing parameter α * to within some prescribed numerical precision ϵ > 0 .
Figure 4. Illustration of the dichotomic search for approximating the optimal skewing parameter α * to within some prescribed numerical precision ϵ > 0 .
Entropy 24 01400 g004
Figure 5. Taxonomy of exponential families: Regular (and always steep) or steep (but not necessarily regular). The Kullback–Leibler divergence between two densities of a regular exponential family amounts to dual Bregman divergences.
Figure 5. Taxonomy of exponential families: Regular (and always steep) or steep (but not necessarily regular). The Kullback–Leibler divergence between two densities of a regular exponential family amounts to dual Bregman divergences.
Entropy 24 01400 g005
Figure 6. The Chernoff information optimal skewing parameter α * for two densities p θ 1 and p θ 2 of some regular exponential family E inducing an exponential family dually flat manifold M = ( { p θ } , g F = 2 F ( θ ) , m , e ) is characterized by the intersection of their e -flat exponential geodesic with their mixture bisector a m -flat right-sided Bregman bisector.
Figure 6. The Chernoff information optimal skewing parameter α * for two densities p θ 1 and p θ 2 of some regular exponential family E inducing an exponential family dually flat manifold M = ( { p θ } , g F = 2 F ( θ ) , m , e ) is characterized by the intersection of their e -flat exponential geodesic with their mixture bisector a m -flat right-sided Bregman bisector.
Entropy 24 01400 g006
Figure 7. Interpolation along the e-geodesic and the m-geodesic passing through two given multivariate normal distributions. No closed-form is known for Riemannian geodesic with respect to the metric Levi–Civita connection (shown in dashed style).
Figure 7. Interpolation along the e-geodesic and the m-geodesic passing through two given multivariate normal distributions. No closed-form is known for Riemannian geodesic with respect to the metric Levi–Civita connection (shown in dashed style).
Entropy 24 01400 g007
Figure 8. The natural parameter space of the non-regular full exponential family of singly truncated normal distributions is not regular (i.e., not open): The negative real axis corresponds to the exponential family of exponential distributions.
Figure 8. The natural parameter space of the non-regular full exponential family of singly truncated normal distributions is not regular (i.e., not open): The negative real axis corresponds to the exponential family of exponential distributions.
Entropy 24 01400 g008
Table 1. Summary of the optimal conditions characterizing the Chernoff exponent.
Table 1. Summary of the optimal conditions characterizing the Chernoff exponent.
Generic case
Primal LREF OC α : D KL [ ( p q ) α * G : p ] = D KL [ ( p q ) α * G : q ]
Dual LREF OC β : β ( α * ) = E ( p q ) α * G log p ( x ) q ( x ) = 0
Geometric OC ( p q ) α * G = γ G ( p , q ) Bi KL left ( p , q ]
Case of exponential families
Bregman OC EF : B F ( θ 1 : θ α * ) = B F ( θ 2 : θ α * )
Fenchel–Young OC YF : Y F , F * ( θ 1 : η α * ) = Y F , F * ( θ 2 : η α * )
Simplified OC SEF : F θ 1 , θ 2 ( α ) = 0
OC SEF : ( θ 2 θ 1 ) F ( θ 1 + α * ( θ 2 θ 1 ) ) = F ( θ 2 ) F ( θ 1 )
Geometric OC γ p q e ( α ) Bi m ( p , q )
1D EF α * = F 1 F ( θ 2 ) F ( θ 1 ) θ 2 θ 1 θ 2 θ 1 θ 2
Gaussian case
1D Gaussians OC Gaussian : μ 2 σ 2 2 μ 1 σ 1 2 m α 1 2 σ 2 2 1 2 σ 1 2 v α = 1 2 log σ 2 2 σ 1 2 + μ 2 2 2 σ 2 2 μ 1 2 2 σ 1 2
α * is root of quadratic polynomial in ( 0 , 1 )
Centered Gaussians OC Centered Gaussians : i = 1 d 1 λ i α * + ( 1 α * ) λ i + log λ i = 0
where λ i is the i-th eigenvalue of Σ 1 Σ 2 1
Centered Gaussians α * = s 1 log s ( s 1 ) log s ( 0 , 1 )
scaled covarianceswhen Σ 2 = s Σ 1
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Nielsen, F. Revisiting Chernoff Information with Likelihood Ratio Exponential Families. Entropy 2022, 24, 1400. https://doi.org/10.3390/e24101400

AMA Style

Nielsen F. Revisiting Chernoff Information with Likelihood Ratio Exponential Families. Entropy. 2022; 24(10):1400. https://doi.org/10.3390/e24101400

Chicago/Turabian Style

Nielsen, Frank. 2022. "Revisiting Chernoff Information with Likelihood Ratio Exponential Families" Entropy 24, no. 10: 1400. https://doi.org/10.3390/e24101400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop