Revisiting Chernoff Information with Likelihood Ratio Exponential Families

The Chernoff information between two probability measures is a statistical divergence measuring their deviation defined as their maximally skewed Bhattacharyya distance. Although the Chernoff information was originally introduced for bounding the Bayes error in statistical hypothesis testing, the divergence found many other applications due to its empirical robustness property found in applications ranging from information fusion to quantum information. From the viewpoint of information theory, the Chernoff information can also be interpreted as a minmax symmetrization of the Kullback–Leibler divergence. In this paper, we first revisit the Chernoff information between two densities of a measurable Lebesgue space by considering the exponential families induced by their geometric mixtures: The so-called likelihood ratio exponential families. Second, we show how to (i) solve exactly the Chernoff information between any two univariate Gaussian distributions or get a closed-form formula using symbolic computing, (ii) report a closed-form formula of the Chernoff information of centered Gaussians with scaled covariance matrices and (iii) use a fast numerical scheme to approximate the Chernoff information between any two multivariate Gaussian distributions.


Chernoff Information: Definition and Related Statistical Divergences
Let (X , A) denote a measurable space [1] with sample space X and finite σ-algebra A of events. A measure P is absolutely continuous with respect to another measure Q if P(A) = 0 whenever Q(A) = 0: P is said dominated by Q and written notationally for short as P Q. We shall write P Q when P is not dominated by Q. When P Q, we denote by dP dQ the Radon-Nikodym density [1] of P with respect to Q. The Chernoff information [2], also called Chernoff information number [3,4] or the Chernoff divergence [5,6], is the following symmetric measure of dissimilarity (see Appendix A for some background on statistical divergences) between any two comparable probability measures P and Q dominated by µ: is the α-skewed Bhattacharyya affinity coefficient [7] (a coefficient measuring the similarity of two densities). In the remainder, we shall use the following conventions: When a (dis)similarity is asymmetric (e.g., ρ α [P : Q]), we use the colon notation ":" to separate its arguments. When the (dis)similarity is symmetric (e.g., D C [P, Q]), we use the comma notation "," to separate its arguments.
The α-skewed Bhattacharyya distances are not metric distances since they can be asymmetric and do not satisfy the triangle inequality even when α = 1 2 . Thus, the Chernoff information is defined as the maximal skewed Bhattacharyya distance: Grünwald [9,10] called the skewed Bhattacharyya coefficients and distances the α-Rényi affinity and the unnormalized Rényi divergence, respectively, (see Section 19.6 of [9]) since the Rényi divergence [11,12] is defined by Thus D B,α [P : Q] = (1 − α) D R,α [P : Q] can be interpreted as the unnormalized Rényi divergence in [9]. However, let us notice that the Rényi α-divergences are defined in general for a wider range α ∈ [0, ∞]\{1} with lim α→1 D R,α [P : Q] = D KL [P : Q] but the skew Bhattacharyya distances are defined for α ∈ (0, 1) in general. The Chernoff information was originally introduced to upper bound the probability error of misclassification in Bayesian binary hypothesis testing [2] where the optimal skewing parameter α * such that D C [p, q] = D B,α * [p : q] is referred to in the statistical literature as the Chernoff error exponent [13][14][15] or Chernoff exponent [16,17] for short. The Chernoff information has found many other fruitful applications beyond its original statistical hypothesis testing scope like in computer vision [18], information fusion [19], time-series clustering [20], and more generally in machine learning [21] (just to cite a few use cases). It has been observed empirically that the Chernoff information exhibits superior robustness [22] compared to the Kullback-Leibler divergence in distributed fusion of Gaussian Mixtures Models [19] (GMMs) or in target detection in radar sensor network [23]. The Chernoff information has also been used for analysis deepfake detection performance of Generative Adversarial Networks [22] (GANs).

Prior Work and Contributions
The Chernoff information between any two categorical distributions (multinomial distributions with one trial also called "multinoulli" since they are extensions of the Bernoulli distributions) has been very well-studied and described in many reference textbooks of information theory or statistics (e.g., see Section 12.9 of [13]). The Chernoff information between two probability distributions of an exponential family was considered from the viewpoint of information geometry in [24], and in the general case from the viewpoint of unnormalized Rényi divergences in [11] (Theorem 32). By replacing the weighted geometric mean in the definition of the Bhattacharyya coefficient ρ α of Equation (2) by an arbitrary weighted mean, the generalized Bhattacharyya coefficient and its associated divergences including the Chernoff information was generalized in [25]. The geometry of the Chernoff error exponent was studied in [26,27] when dealing with a finite set of mutually absolutely Entropy 2022, 24, 1400 3 of 35 probability distributions P 1 , . . . , P n . In this case, the Chernoff information amounts to the minimum pairwise Chernoff information of the probability distributions [28]: D C [P 1 , . . . , P n ] := min i∈{1,...,n} =j∈{1,...,n} We summarize our contributions as follows: In Section 2, we study the Chernoff information between two given mutually non-singular probability measures P and Q by considering their "exponential arc" [29] as a special 1D exponential family termed a Likelihood Ratio Exponential Family (LREF) in [10]. We show that the optimal skewing value (Chernoff exponent) defining their Chernoff information is unique (Proposition 1) and can be characterized geometrically on the Banach vector space L 1 (µ) of equivalence classes of measurable functions (i.e., two functions f 1 and f 2 are said equivalent in L 1 (µ) if they are equal µ-almost everywhere, abbreviated as µ-a.e. in the remainder) for which their absolute value is Lebesgue integrable (Proposition 4). This geometric characterization allows us to design a generic dichotomic search algorithm (Algorithm 1) to approximate the Chernoff optimal skewing parameter, generalizing the prior work [24]. When P and Q belong to a same exponential family, we recover in Section 3 the results of [24]. This geometric characterization also allows us to reinterpret the Chernoff information as a minmax symmetrization of the Kullback-Leibler divergence, and we define by analogy the forward and reverse Chernoff-Bregman divergences in Section 4 (Definition 2). In Section 5, we consider the Chernoff information between Gaussian distributions: We show that the optimality condition for the Chernoff information between univariate Gaussian distributions can be solved exactly and report a closed-form formula for the Chernoff information between any two univariate Gaussian distributions (Proposition 10). For multivariate Gaussian distributions, we show how to implement the dichotomic search algorithms to approximate the Chernoff information, and report a closed-form formula for the Chernoff information between two centered multivariate Gaussian distributions with scaled covariance matrices (Proposition 11). Finally, we conclude in Section 7.

LREFs and the Chernoff Information
Recall that L 1 (µ) denotes the Lebesgue vector space of measurable functions f such that X | f |dµ < ∞. Given two prescribed densities p and q of L 1 (µ), consider building a uniparametric exponential family [30] E pq which consists of the weighted geometric mixtures of p and q: where denotes the normalizer (or partition function) of the geometric mixture Parameter space Θ is defined as the set of α values which yieds convergence of the definite integral Z pq (α): Let us express the density (pq) G α in the canonical form ( * ) of exponential families [30]: It follows from this decomposition that α ∈ Θ ⊂ R is the scalar natural parameter, t(x) = log p(x) q(x) denotes the sufficient statistic (minimal when p(x) = q(x) µ-a.e.), k(x) = log q(x) is an auxiliary carrier term wrt. measure µ (i.e., measure dν(x) = q(x)dµ(x)), and is the log-normalizer (or log-partition or cumulant function). Since the sufficient statistic is the logarithm of the likelihood ratio of p(x) and q(x), Grünwald [9] (Section 19.6) termed E pq a Likelihood Ratio Exponential Family (LREF). See also [31] for applications of LREFs to Markov chain Monte Carlo (McMC) methods. We have p = (pq) G 1 and q = (pq) G 0 . Thus, let α p = 1 and α q = 0, and let us interpret geometrically {(pq) G α , α ∈ Θ} as a maximal exponential arc [29,32,33] where Θ ⊆ R is an interval. We denote by E pq the open exponential arc with extremities p and q.
Since the log-normalizers F(θ) of exponential families are always strictly convex and real analytic [30] (i.e., F(θ) ∈ C ω (R)), we deduce that D B,α [p : q] = −F pq (α) is strictly concave and real analytic. Moreover, we have D B,0 [p : q] = D B,1 [p : q] = 0. Hence, the Chernoff optimal skewing parameter α * is unique when p = q µ-a.e., and we get the Chernoff information calculated as See Figure 1 for a plot of the strictly concave function D B,α [p : q] and the strictly convex function F pq (α) = −D B,α [p : q] when p = p 0,1 is the standard normal density and q = p 1,2 is a normal density of mean 1 and variance 2.  Figure 1. Plot of the Bhattacharryya distance D B,α (p : q) (strictly concave, displayed in blue) and the log-normalizer F pq (α) of the induced LREF E pq (strictly convex, displayed in red) for two univariate normal densities p = p 0,1 ) (standard normal) and q = p 1,2 : The curves D B,α (p : q) = −F pq (α) are horizontally mirror symmetric to each others. The Chernoff information optimal skewing value α * between these two univariate normal distributions can be calculated exactly in closed-form, see Section 5.2 (approximated numerically here for plotting the vertical grey line by α * ≈ 0.4215580558605244). Consider the full natural parameter space Θ pq of E pq : The natural parameter space Θ pq is always convex [30] and since ρ 0 (p : q) = ρ 1 (p : q) = 1, we necessarily have (0, 1) ∈ Θ pq but not necessarily [0, 1] ∈ Θ pq as detailed in the following remark: Remark 1. In order to be an exponential family, the densities (pq) G α shall have the same coinciding support for all values of α belonging to the natural parameter space. The support of the geometric mixture density (pq) G α is This condition is trivially satisfied when the supports of p and q coincide, and therefore [0, 1] ⊂ Θ pq in that case. Otherwise, we may consider the common support X pq = supp(p) ∩ supp(q) for α ∈ (0, 1). In this latter case, we are poised to restrict the natural parameter space to Θ pq = (0, 1) even if ρ α (p : q) < ∞ for some α outside that range.
To emphasize that α * depends on p and q, we shall use the notation α * (p : q) whenever necessary. We have α * (q : p) = 1 − α * (p : q), and since D B,α (p : q) = D B,1−α (q : p), and we check that Thus the skewing value α * (q : p) may be called the conjugate Chernoff exponent (i.e., depends on the convention chosen for interpolating on the exponential arc).
However, since the Chernoff information does not satisfy the triangle inequality, it is not a metric distance and the Chernoff information is called a quasi-distance.
Proposition 1 (Uniqueness of the Chernoff information optimal skewing parameter [11,12]). Let P and Q be two probability measures dominated by a positive measure µ with corresponding Radon-Nikodym densities p and q, respectively. The Chernoff information optimal skewing parameter α * (p : q) is unique when p = q µ-almost everywhere, and When p = q µ-a.e., we have D C [p : q] = 0 and α * is undefined since it can range in [0, 1].

Definition 1.
An exponential family is called regular [30]

Proof.
A reverse parameter divergence D * (θ 1 : θ 2 ) is a parameter divergence on the swapped parameter order: D * (θ 1 : θ 2 ) := D(θ 2 : θ 1 ). Similarly, a reverse statistical divergence D * [p : q] is a statistical divergence on the swapped parameter order: D * [p : q] := D[q : p]. We shall use the result pioneered in [34,35] that the KLD between two densities p θ 1 and p θ 2 of a regular exponential family E = {p θ : θ ∈ Θ} amounts to Entropy 2022, 24, 1400 6 of 35 a reverse Bregman divergence (B F ) * (i.e., a Bregman divergence on swapped parameter order) induced by the log-normalizer of the family: where B F is the Bregman divergence defined on domain D = dom(F) (see Definition 1 of [36]): where ri(D) denotes the relative interior of domain D. Bregman divergences are always finite and the only symmetric Bregman divergences are squared Mahalanobis distances [37] (i.e., with corresponding Bregman generators defining quadratic forms). For completeness, we recall the proof as follows: We have Thus we get using the linearity property of the expectation operator. When E is regular, we also have [38]), and therefore we get In our LREF setting, we thus have: and D KL [q : p] = B F pq (α p : α q ) = B F pq (1 : 0) where B F pq (α 1 : α 2 ) denotes the following scalar Bregman divergence: Similarly Notice that since B F pq (α 1 : α 2 ) > 0, we have F pq (1) > 0 and F pq (0) < 0 when p = q µ-almost everywhere (a.e.). Moreover, since F pq (α) is strictly convex, F pq (α) is strictly monotonically increasing, and therefore there exists a unique α * ∈ (0, 1) such that F pq (α * ) = 0.

Example 1.
When p and q belongs to a same regular exponential family E (e.g., p and q are two normal densities), their sided KLDs [37] are both finite. The LREF induced by two Cauchy distributions p l 1 ,s 1 and p l 2 ,s 2 is such that [0, 1] ⊂ Θ since the skewed Bhattacharyya distance is defined and finite for α ∈ R [39]. Therefore the KLDs between two Cauchy distributions are always finite [39], see the closed-form formula in [40].
where the skew Jensen divergence [42] is given by In the setting of the LREF, we have At the optimal value α * , we have The best unique parameter α * defining the Chernoff information optimal skewing parameter is found by setting the derivative of the strictly convex function F pq (α) to zero. At the optimal value Corollary 1. The Chernoff information optimal skewing value α * (p : q) ∈ (0, 1) can be used to calculate the Chernoff information D C [p, q] as a Bregman divergence induced by the LREF: In general, the divergence J C F (θ 1 , θ 2 ) = max α∈(0,1) J F,α (θ 1 : θ 2 ) is called a Jensen-Chernoff divergence.
Proposition 3 let us interpret the Chernoff information as a special symmetrization of the Kullback-Leibler divergence [43], different from the Jeffreys divergence or the Jensen-Shannon divergence [44]. Indeed, the Chernoff information can be rewritten as As such, we can interpret the Chernoff information as the radius of a minimum enclosing left-sided Kullback-Leibler ball on the space L 1 (µ). A related concept is the radius [12] of two densities p and q with respect to Rényi divergences of order α (see Equation (2) of [12]): When α = 1, the radius is called the Shannon radius [12] since the Rényi divergences of order 1 corresponds to the Kullback-Leibler divergence (relative entropy).

Geometric Characterization of the Chernoff Information and the Chernoff Information Distribution
Let us term the probability distribution (PQ) G α * µ with corresponding density (pq) G α * the Chernoff information distribution to avoid confusion with another concept of Chernoff distributions [45] used in statistics. We can characterize geometrically the Chernoff information distribution (pq) G α * on L 1 (µ) as the intersection of a left-sided Kullback-Leibler divergence bisector: with an exponential arc [29] γ G (p, q) : We thus interpret Proposition 3 geometrically by the following proposition (see Figure 3): Figure 3. The Chernoff information distribution (PQ) G α * with density (pq) G α * is obtained as the unique intersection of the exponential arc γ G (p, q) linking density p to density q of L 1 (µ) with the left-sided Kullback-Leibler divergence bisector Bi left KL (p, q) of p and q:

Proposition 4 (Geometric characterization of the Chernoff information).
On the vector space L 1 (µ), the Chernoff information distribution is the unique distribution The point (pq) G α * has been called the Chernoff point in [24]. Proposition 4 allows us to design a dichotomic search to numerically approximate α * as reported in pseudo-code in Algorithm 1 (see also the illustration in Figure 4).

Algorithm 1
Dichotomic search for approximating the Chernoff information by approximating the optimal skewing parameter valueα ≈ α * and reporting The search requires log 2 1 iterations to guarantee |α * −α| ≤ . input : Two densities p, q of L 1 (µ), and a numerical precision threshold > 0 Figure 4 for an illustration and Proposition 4 Figure 4. Illustration of the dichotomic search for approximating the optimal skewing parameter α * to within some prescribed numerical precision > 0.

Remark 3.
We do not need to necessarily handle normalized densities p and q since we have for α ∈ R\{0, 1}: where p(x) =p (x) Z p and q(x) =q (x) Z q withp andq denoting the computationally-friendly unnormalized positive densities. This property of geometric mixtures is used in Annealed Importance Sampling [46,47] (AIS), and for designing an asymptotically efficient estimator for computationallyintractable parametric densities [48]q θ (e.g., distributions learned by Boltzmann machines).

Dual Parameterization of LREFs
The densities (pq) G α of a LREF can also be parameterized by their dual moment parameter [30] (or mean parameter): When the LREF is regular (and therefore steep [38]), we have β(α) = F pq (α) and α = F * pq (β), where F * pq denotes the Legendre transform of F pq . At the optimal value α * , we have F pq (α * ) = 0. Therefore an equivalent condition of optimality is Notice that when [0, 1] ⊂ Θ • , we have finite forward and reverse Kullback-Leibler divergences: Proposition 5 (Dual optimality condition for the Chernoff information). The unique Chernoff information optimal skewing parameter α * is such that One can understand that the Chernoff information is more robust or stable than a skewed Bhattacharrya distance by considering the derivative of the corresponding skewed Bhattacharrya distance. Consider without loss of generality densities p θ 1 and p θ 2 of a 1D exponential family. Their skewed Bhattacharrya distances amounts to skew Jensen divergences, and we have: Further assuming without loss of generality that θ 2 − θ 1 = 1, we get |J F,α (θ 1 : As a side remark, let us notice that the Fisher information of a likelihood ratio exponential family E pq is

General Case
We shall now consider that the densities p and q (with respect to measure µ) belong to a same exponential family [30]: where θ(λ) denotes the natural parameter associated with the ordinary parameter λ, t(x) the sufficient statistic vector and F(θ(λ)) the log-normalizer. When θ(λ) = λ and t(x) = x, the exponential family is called a natural exponential family (NEF). The exponential family E is defined by µ and t(x), hence we may write when necessary E = E µ,t .

Example 2. The set of univariate Gaussian distributions
forms an exponential family with the following decomposition terms: denotes the set of positive real numbers and negative real numbers, respectively. Letting v = σ 2 be the variance parameter, we get the equivalent natural parameters µ v , − 1 2v . The log-normalizer can be written using the . See Appendix B for further details concerning this normal exponential family.
Notice that we can check easily that the LREF between two densities of an exponential family forms a 1D sub-exponential family of the exponential family: where J F denote the Jensen divergence induced by F. The optimal skewing value condition of the Chernoff information between two categorical distributions [13] was extended to densities p θ 1 and p θ 2 of an exponential family in [24]. The family of categorical distributions with d choices forms an exponential family with natural parameter of dimension d − 1. Thus, Proposition 7 generalizes the analysis in [13].
Let p = p θ 1 and q = p θ 2 . Then we have the property that exponential families are closed under geometric mixtures: Since the natural parameter space Θ is convex, we have The KLD between two densities p θ 1 and p θ 2 of a regular exponential family E amounts to a reverse Bregman divergence for the log-normalizer of E : where B F (θ 2 : θ 1 ) denotes the Bregman divergence: Thus, when the exponential family E is regular, both the forward and reverse KLD are finite, and we can rewrite Proposition 3 to characterize α * as follows: The Legendre-Fenchel transform of F(θ) yields the convex conjugate with η(θ) = ∇F(θ). Let H = {η(θ) : θ ∈ Θ} denote the dual moment parameter space also called domain of means. The Legendre transform associates to (Θ, F(θ)) the convex conjugate (H, F * (η)). In order for (H, F * (η)) to be of the same well-behaved type of (Θ, F(θ)), we shall consider convex functions F(θ) which are steep, meaning that their gradient diverges when nearing the boundary bd(Θ) [49] and thus ensures that domain H is also convex. Steep convex functions are said of Legendre-type, and ((Θ, F(θ)) * ) * = (Θ, F(θ)) (Moreau biconjugation theorem which shows that the Legendre transform is involutive). For Legendre-type functions, there is a one-to-one mapping between parameters θ(η) and parameters η(θ) as follows: and Exponential families with log-normalizers of Legendre-type are called steep exponential families [30]. All regular exponential families are steep, and the maximum likelihood estimator in steep exponential families exists and is unique [38] (with the likelihood equations corresponding to the method of moments for the sufficient statistics). The set of inverse Gaussian distributions form a non-regular but steep exponential family, and the set of singly truncated normal distributions form a non-regular and non-steep exponential family [50] (but the exponential family of doubly truncated normal distributions is regular and hence steep).
For Legendre-type generators F(θ), the Bregman divergence B F (θ 1 : θ 2 ) can be rewritten as the following Fenchel-Young divergence: Proposition 6 (KLD between densities of a regular (and steep) exponential family). The KLD between two densities p θ 1 and p θ 2 of a regular and steep exponential family can be obtained equivalently as where F(θ) and its convex conjugate F * (η) are Legendre-type functions. It follows that the optimal condition of Equation (18) can be restated as where . From the equality of Equation (22), we get the following simplified optimality condition: where η α * = ∇F(α * θ 1 + (1 − α * )θ 2 ).
non-regular non-steep non-exponential family distributions Figure 5. Taxonomy of exponential families: Regular (and always steep) or steep (but not necessarily regular). The Kullback-Leibler divergence between two densities of a regular exponential family amounts to dual Bregman divergences.

Remark 4.
We can recover (OC SEF ) by instantiating the equivalent condition E pθ α * log Indeed, since log Since the α-skewed Bhattacharyya distance amounts to a α-skewed Jensen divergence [8], we get the Chernoff information as where J F,α (θ 1 : θ 2 ) is the Jensen divergence: Notice that we have the induced LREF with log-normalizer expressed as the negative Jensen divergence induced the log-normalizer of E : We summarize the result in the following proposition: Proposition 7. Let p λ 1 and p λ 2 be two densities of a regular exponential family E with natural parameter θ(λ) and log-normalizer F(θ). Then the Chernoff information is where θ 1 = θ(λ 1 ), θ 2 = θ(λ 2 ), and the optimal skewing parameter α * is unique and satisfies the following optimality condition: where η α * = ∇F(α * θ 1 Figure 6 illustrates geometrically the Chernoff point [24] which is the geometric mixture (p θ 1 p θ 2 ) α * induced by two comparable probability measures P θ 1 , P θ 2 µ. Figure 6. The Chernoff information optimal skewing parameter α * for two densities p θ 1 and p θ 2 of some regular exponential family E inducing an exponential family dually flat manifold In information geometry [51], the manifold of densities M = {p θ : θ ∈ Θ} of this exponential family is a dually flat space [51] respect to the exponential connection ∇ e and the mixture connection ∇ m , where g F (θ) is the Fisher information metric expressed in the θ-coordinate system as ∇ 2 F(θ) (and in the dual moment parameter η as g F (η) = ∇ 2 F * (η)). Then the exponential geodesic ∇ e is flat and corresponds to the exponential arc of geometric mixtures when parameterized with the ∇ e -affine coordinate system θ.
The left-sided Kullback-Voronoi bisector: corresponds to a Bregman right-sided bisector [52] and is ∇ m flat (i.e., an affine subspace in the η-coordinate system): The Chernoff information distribution (p θ 1 p θ 2 ) G α * is called the Chernoff point on this exponential family manifold (see Figure 6). Since the Chernoff point is unique and since in general statistical manifolds (M, g, ∇, ∇ * ) can be realized by statistical models [53], we deduce the following proposition of interest for information geometry [51]: Proposition 8. Let (M, g, ∇, ∇ * ) be a dually flat space with corresponding canonical divergence a Bregman divergence B F . Let γ e pq (α) and γ m pq (α) be a e-geodesic and m-geodesic passing through the points p and q of M, respectively. Let Bi m (p, q) and Bi e (p, q) be the right-sided ∇ m -flat and left-sided ∇ e -flat Bregman bisectors, respectively. Then the intersection of γ e pq (α) with Bi m (p, q) and the intersection of γ m pq (α) with Bi e (p, q) are unique. The point γ e pq (α) ∩ Bi m (p, q) is called the Chernoff point and the point γ m pq (α) ∩ Bi e (p, q) is termed the reverse or dual Chernoff point.

Case of One-Dimensional Parameters
When the exponential family has one-dimensional natural parameter α ∈ Θ ⊂ R, we thus get from OC SEF : That is, α * can be obtained as the following closed-form formula: Example 3. Consider the exponential family {p v (x) : v > 0} of 0-centered scale univariate normal distributions with variance v = σ 2 and density The natural parameter corresponding to the sufficient statistic t( This result will be generalized in Proposition 11 to multivariate centered Gaussians with scaled covariance matrices. For multi-dimensional parameters θ, we may consider the one-dimensional LREF E p θ 1 p θ 2 induced by p θ 1 and p θ 2 with F θ 1 ,θ 2 (α) = F((1 − α)θ 1 + αθ 2 ), and write F pq (α) as the following directional derivative: using a first-order Taylor expansion. Thus, the optimality condition This is equivalent to Equation (8) of [24].

of 35
The functions F θ 1 ,θ 2 are 1D Bregman generators (i.e., strictly convex and C 1 ), and we have the directional derivative Similarly, we can reparameterize Bregman divergences on a k-dimensional simplex by k-dimensional Bregman divergences.

Remark 6.
Closing the loop: The Chernoff information although obtained from the one-dimensional likelihood ratio exponential family yields as a corollary the general multi-parametric exponential families which as a special instance includes the one-dimensional exponential families (e.g., LREFs!).

Forward and Reverse Chernoff-Bregman Divergences
In this section, we shall define Chernoff-type symmetrizations of Bregman divergences inspired by the study of Chernoff information, and briefly mention applications of these Chernoff-Bregman divergences in information theory.
The optimization problem in Equation (31) may be equivalently rewritten [43] as min θ R such that both B F (θ 1 : θ) ≤ R and B F (θ 2 : θ) ≤ R. Thus, the optimal value of α defines the circumcenter θ * = αθ 1 + (1 − α)θ 2 of the minimum enclosing right-sided Bregman sphere [55,56] and the Chernoff-Bregman divergence: corresponds to the radius of a minimum enclosing Bregman ball. To summarize, this Chernoff symmetrization is a min-max symmetrization, and we have the following identities: The second identity shows that the Chernoff symmetrization can be interpreted as a variational Jensen-Shannon-type divergence [54]. Notice that in general C F (θ 1 , θ 2 ) = C F * (η 1 , η 2 ) because the primal and dual geodesics do not coincide. Those geodesics coincide only for symmetric Bregman divergences which are squared Mahalanobis divergences [52].
Conditions for which C F (θ 1 , θ 2 ) a (with a > 0) becomes a metric have been studied in [43]: For example, C 1 e F Shannon is a metric distance [43] (i.e., a = 1 e 0.36787944117). It is also known that the square root of the Chernoff distance between two univariate normal distributions is a metric distance [57].
We can thus use the Bregman generalization of the Badoiu-Clarkson (BC) algorithm [55] (Algorithm 2) to compute an approximation of the smallest enclosing Bregman ball which in turn yields an approximation of the Chernoff-Bregman divergence: Algorithm 2 Approximating the circumcenter of the Bregman smallest enclosing ball of two parameters θ 1 and θ 2 . input : Two parameters θ 1 and θ 2 of Θ, and a number of iterations T ∈ N // Initialize circumcenter i ← 0; θ (i) ← θ 1 +θ 2 2 ; // Repeat T times while i < T do // Find which of θ 1 or θ 2 is the farthest parameter to θ (i) if B F (θ 1 : θ (i) ) > B F (θ 2 : θ (i) ) then f i = 1 end else f i = 2 end // Update the circumcenter by walking on θ-geodesic // This update corresponds to walking on the exponential arc (θ // Report the approximation of the circumcenter return θ (T) Notice that when there are only two points to compute their smallest enclosing Bregman ball, all the arcs (θ (i−1) θ f i ) G are sub-arcs of the exponential arc (θ 1 θ 2 ) G . See [55] for convergence results of this iterative algorithm. Let us notice that Algorithm 1 approximates α * while the Bregman BC Algorithm 2 approximates in spirit D C (θ 1 , θ 2 ) (and as a byproduct α * ).

Remark 7.
To compute the farthest point to the current circumcenter with respect to Bregman divergence, we need to find the sign of Thus we need to pre-calculate only once F(θ 1 ) and F(θ 2 ) which can be costly (e.g., − log det(Σ) functions need to be calculated only once when approximating the Chernoff information between Gaussians).

Reverse Chernoff-Bregman Divergence and Universal Coding
Similarly, we may define the reverse Chernoff-Bregman divergence by considering the minimum enclosing left-sided Bregman ball: Thus the reverse Bregman Chernoff divergence D R C [θ 1 , θ 2 ] = R * is the radius of a minimum enclosing left-sided Bregman ball.
This reverse Chernoff-Bregman divergence finds application in universal coding in information theory (chapter 13 of [13], pp. 428-433): Let X = {A 1 , . . . , A d } be a finite discrete alphabet of d letters, and X be a random variable with probability mass function p on X . Let p λ (x) denote the categorical distribution corresponding to X so that Pr(X = The Huffman codeword for x ∈ X is of length l(x) = − log p(x) (ignoring integer ceil rounding), and the expected codeword length of X is thus given by Shannon's entropy H(X) = − ∑ x p(x) log p(x).
If we code according to a distribution p λ instead of the true distribution p λ , the code is not optimal, and the redundancy R(p λ , p λ ) is defined as the difference between the expected lengths of the codewords for p λ and p λ : where D KL is the Kullback-Leibler divergence. Now, suppose that the true distribution p λ belong to one of two prescribed distributions that we do not know: p λ ∈ P = {p λ 1 , p λ 2 }. Then we seek for the minimax redundancy: The distribution p λ * achieving the minimax redundancy is the circumcenter of the rightcentered KL ball enclosing the distributions P.
Using the natural coordinates θ = (θ 1 , . . . , θ D ) ∈ R D with θ i = log λ i λ d of the lognormalizer of the categorical distributions (an exponential family of order D = d − 1), we end up with calculating the smallest left-sided Bregman enclosing ball for the Bregman generator [58]: This latter minimax problem is unconstrained since θ ∈ R D = R d−1 .

Invariance of Chernoff Information under the Action of the Affine Group
The d-variate Gaussian density p λ (x) with parameter λ = (λ v = µ, λ M = Σ) where µ ∈ R d denotes the mean (µ = E p λ [x]) and Σ is a positive-definite covariance matrix (Σ = Cov p λ [X] for X ∼ p λ ) is given by where | · | denotes the matrix determinant. The set of d-variate Gaussian distributions form a regular (and hence steep) exponential family with natural parameters and sufficient statistics t(x) = (x, xx ). The Bhattacharrya distance between two multivariate Gaussians distributions p µ 1 ,Σ 1 and p µ 2 ,Σ 2 is The Gaussian density can be rewritten as a multivariate location-scale family: denotes the standard multivariate Gaussian distribution. The matrix λ 1 2 M is the unique symmetric square-root matrix which is positive-definite when λ M is positive-definite.

Remark 8.
Notice that the product of two symmetric positive-definite matrices P 1 and P 2 may not be symmetric but P We may interpret the Gaussian family as obtained by the action of the affine group Aff(R d ) = R d GL d (R) on the standard density p std : Let the dot symbol "." denotes the group action. The affine group is equipped with the following (outer) semidirect product: and this group can be handled as a matrix group with the following mapping of its elements to matrices: Then we have We can show the following invariance of the skewed Bhattacharyya divergences: Proposition 9 (Invariance of the Bhattacharyya divergence and f -divergences under the action of the affine group (Equation (35))). We have Proof. The proof follows from the ( f , g)-form of Ali and Silvey's divergences [59]. We can express D B,α [p : q] = g(I h α [p : q]) where h α (u) = −u α (convex for α ∈ (0, 1)) and where I denotes the identity matrix.

Corollary 2 (Bhattacharyya divergence from canonical Bhattacharyya divergences).
We have It follows that the Chernoff optimal skewing parameter enjoys the same invariance property: As a byproduct, we get the invariance of the Chernoff information under the action of the affine group:

Corollary 3 (Invariance of the Chernoff information under the action of the affine group).
We have:

Thus, the formula for the Chernoff information between two Gaussians
can be written as a function of two terms µ 12 = Σ

Closed-Form Formula for the Chernoff Information between Univariate Gaussian Distributions
We shall report the exact solution for the Chernoff information between univariate Gaussian distributions by solving a quadratic equation. We can also report a complex closed-form formula by using symbolic computing because the calculations are lengthy and thus prone to human error.
One can also program these closed-form solutions in Python using the SymPy package (https://www.sympy.org/en/index.html (accessed on 30 July 2022)) for performing symbolic computations.
Let us report special cases with some illustrating examples.
In comparison, the bisection algorithm of [24]

Fast Approximation of the Chernoff Information of Multivariate Gaussian Distributions
In general, the Chernoff information between d-variate Gaussians distributions is not known in closed-form formula when d > 1, see for example [61][62][63]. We shall consider below some special cases: • When the Gaussians have the same covariance matrix Σ, the Chernoff information optimal skewing parameter is α = 1 2 and the Chernoff information is is the squared Mahalanobis distance. The Mahalanobis distance enjoys the following property by congruence transformation: Notice that we can rewrite the (squared) Mahalanobis distance as using the matrix trace cyclic property. Then we check that • The Chernoff information for the special case of centered multivariate Gaussians distributions was studied in [62]. The KLD between two centered Gaussians p µ,Σ 1 and p µ , Σ 2 is half of the matrix Burg distance: When d = 1, the Burg distance corresponds to the well-known Itakura-Saito divergence. The matrix Burg distance is a matrix spectral distance [62]: where the λ i 's are the eigenvalues of More generally, the f -divergences between centered Gaussian distributions are always matrix spectral divergences [60].
Otherwise, for the general multivariate case, we implement the dichotomic search of Algorithm 1 in Algorithm 3 with the KLD between two multivariate Gaussian distributions expressed as Algorithm 3 Dichotomic search for approximating the Chernoff information between two multivariate normal distributions p µ 1 ,Σ 1 and p µ 2 ,Σ 2 by approximating the optimal skewing parameter value α ≈ α * . input : Two normal densities p µ 1 ,Σ 1 and p µ 2 ,Σ 2 , and a numerical precision // Formula of the KLD between two normal distributions in Equation (50) if The m-interpolation of multivariate Gaussian distributions p µ 1 ,Σ 1 and p µ 2 ,Σ 2 with respect to the mixture connection ∇ m is given by To e-interpolation of multivariate Gaussian distributions p µ 1 ,Σ 1 and p µ 2 ,Σ 2 with respect to the exponential connection ∇ e is given by where In information geometry, both these eand m-connections defined with respect to an exponential family are shown to be flat. These geodesics correspond to linear interpolations in the ∇ e -affine coordinate system θ and in the dual ∇ m coordinate system η, respectively. Figure 7 displays these two e-geodesic and m-geodesic between two multivariate normal distributions. Notice that the Riemannian geodesic with the Levi-Civita metric connection ∇ e +∇ m 2 is not known in closed form for boundary value conditions. The expression of the Riemannian geodesic is known only for initial value conditions [64] (i.e., starting point with a given vector direction).
Let A, B = tr(A B) defines the inner product between two symmetric matrices A and B. Then we can write the centered Gaussian distribution p Σ (x) in the canonical form of exponential families: The function log det of a positive-definite matrix is strictly concave [65], and hence we check that F(θ) is strictly convex. Furthermore, we have ∇ X log det(X) = X − so that ∇ θ F(θ) = − 1 2 θ − . The optimality condition equation of Chernoff best skewing parameter α * becomes: When Σ 2 = sΣ 1 (and Σ −1 2 = 1 s Σ −1 1 ) for s > 0 and s = 1, we get a closed-form for α * using the fact that det I s = 1 s d and tr(I) = d for d-dimensional identity matrix I. Solving Equation (54) yields Therefore the Chernoff information between two scaled centered Gaussian distributions p µ,Σ and p µ,sΣ is available in closed form.
Notice that α * (p µ,Σ : p µ,sΣ ) = α * (p µ,Σ , p µ, 1 Example 7. Consider µ 1 = µ 2 = 0 and Σ 1 = I, Σ 2 = 1 2 I. We find that α * = 2 log 2−1 log 2 , which is independent of the dimension of the matrices. The Chernoff information depends on the dimension: Notice that when d = 1, we have s = , and we recover a special case of the closedform formula for the Chernoff information between univariate Gaussians. In [62], the following equation is reported for finding α * based on Equation (54): where the λ i 's are generalized eigenvalues of Σ 1 Σ −1 2 (this excludes the case of all λ i 's equal to one). The value of α * satisfying Equation (57) is unique. Let us notice that the product of two symmetric positive-definite matrices is not necessarily symmetric anymore.
We can derive Equation (57) by expressing Equation (54)  Remark 9. We can get closed-form solutions for α * and the corresponding Chernoff information in some particular cases. For example, when the dimension d = 2, we need to solve a quadratic equation to get α * . Thus, for d ≤ 4, we get a closed-form solution for α * by solving a polynomial equation characterizing the optimal condition, and obtain the Chernoff information in closed-form as a byproduct.
Example 8. Consider the Chernoff information between p 0,I and p 0,Λ with Λ = diag (1, 2, 3, 4). We get the exact Chernoff exponent value α * by taking the root of a quartic polynomial equation falling in (0, 1). By evaluating numerically this root, we find that α * 0.59694 and that the Chernoff information is D C [p 0,I , p 0,Λ ] 0.22076. See Appendix C for some symbolic computation code.

Chernoff Information between Densities of Different Exponential Families
be two distinct exponential families, and consider the Chernoff information between the densities p θ 1 and q θ 2 . The exponential arc induced by p θ 1 and q θ 2 is Let E 12 denote the exponential family with sufficient statistics (t 1 (x), t 2 (x)), log-normalizer F 12 (θ, θ ), and denote by Θ 12 its natural parameter space. Family E 12 can be interpreted as a product exponential family which yields an exponential family. We have Thus the induced LREF E p θ 1 q θ 2 with natural parameter space Θ p θ 1 q θ 2 can be interpreted as a 1D curved exponential family of the product exponential family E 12 . The optimal skewing parameter α * is found by setting the derivative of F 12 (αθ 1 , (1 − α)θ 2 ) with respect α to zero: Example 9. Let E 1 can be chosen as the exponential family of exponential distributions defined on the support X 1 = (0, ∞) and E 2 can be chosen as the exponential family of half-normal distributions with support X 2 = (0, ∞). The product exponential family corresponds to the singly truncated normal family [50] which is a non-regular (i.e., parameter space is not topologically an open set): with Θ 0 = {(θ, 0) : θ < 0} (the part corresponding to the exponential family of exponential distributions). This exponential family E 12 = {p θ 1 ,θ 2 } of singly truncated normal distributions is also non-steep [50]. The log-normalizer is where θ 1 = µ σ 2 and θ 2 = 1 2σ 2 , and Φ denotes the cumulative distribution function of the standard normal. Function F 12 is proven of class C 1 on Θ 12 (see Proposition 3.1 of [50]) with F 12 (θ, 0) = − log(−θ) for θ < 0.
Notice that the KLD between an exponential distribution and a half-normal distribution is +∞ since the definite integral diverges (hence D KL [e λ : h σ ] is not equivalent to a Bregman divergence, and Θ e θ 1 h is not open at 1) but the reverse KLD between a half-normal distribution and an exponential distribution is available in closed-form (using symbolic computing): Figure 8 illustrate the domain of the singly truncated normal distributions and displays an exponential arc between an exponential distribution and a half-normal distribution. Notice that we could have also considered a similar but different example by taking the exponential family of Rayleigh distributions which exhibit an additional extra carrier term k(x).
The Bhattacharyya α-skewed coefficient calculated using symbolic computing (see Appendix C) is where erf denotes the error function.

Conclusions
In this work, we revisited the Chernoff information [2] (1952) which was originally introduced to upper bound Bayes' error in binary hypothesis testing. A general characterization of Chernoff information between two arbitrary probability measures was given in [11] (Theorem 32) by considering Rényi divergences which can be interpreted as scaled skewed Bhattacharyya divergences. Since its inception, the Chernoff information has proven useful as a statistical divergence (Chernoff divergence) in many applications ranging from information fusion to quantum metrology due to its empirical robustness property [19]. Informally, we may observe empirically that in practice the skewed Bhattacharyya divergence is more stable around the Chernoff exponent α * than in other part of the range (0, 1). By considering the maximal extension of the exponential arc joining two densities p and q on a Lebesgue space L 1 (µ), we built full likelihood ratio exponential families [10] E pq (LREFs) in Section 2. When the LREF E pq is a regular exponential family (with coinciding support of p and q), both the forward and reverse Kullback-Leibler divergence are finite and can be rewritten as finite Bregman divergences induced by the log-normalizer F pq of E pq which amounts to minus skewed Bhattacharyya divergences. Since log-normalizers of exponential families are strictly convex, we deduced that the skewed Bhattacharyya divergences are strictly concave and their maximization yielding the Chernoff information is hence proven unique. As a byproduct, this geometric characterization in L 1 (µ) allowed us to prove that the intersection of a e-geodesic with a m-bisector is unique in dually flat subspaces of L 1 (µ), and similarly that the intersection of a m-geodesic with a e-bisector is unique (Proposition 8). We then considered the exponential families of univariate and multivariate normal distributions: We reported closed-form solutions for the Chernoff information of univariate normal distribution and centered normal distributions with scaled covariance matrices, and show how to implement efficiently a dichotomic search for approximating the Chernoff information between two multivariate normal distributions (Algorithm 3). Table 1 summarizes the various optimal condition studied characterizing the Chernoff exponent. Finally, inspired by the Chernoff information study, we defined in Section 4, the forward and reverse Bregman-Chernoff divergences [66], and show how these divergences are related to the capacity of a discrete memoryless channel and the minimax redundancy of universal coding in information theory [13].