Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences

By calculating the Kullback–Leibler divergence between two probability measures belonging to different exponential families dominated by the same measure, we obtain a formula that generalizes the ordinary Fenchel–Young divergence. Inspired by this formula, we define the duo Fenchel–Young divergence and report a majorization condition on its pair of strictly convex generators, which guarantees that this divergence is always non-negative. The duo Fenchel–Young divergence is also equivalent to a duo Bregman divergence. We show how to use these duo divergences by calculating the Kullback–Leibler divergence between densities of truncated exponential families with nested supports, and report a formula for the Kullback–Leibler divergence between truncated normal distributions. Finally, we prove that the skewed Bhattacharyya distances between truncated exponential families amount to equivalent skewed duo Jensen divergences.


Introduction 1.Exponential families
Let (X , Σ) be a measurable space, and consider a regular minimal exponential family [33] E of probability measures P θ all dominated by a base measure µ (P θ ≪ µ): The Radon-Nikodym derivatives of the probability measures P θ with respect to µ can be written canonically as where θ denotes the natural parameter, t(x) the sufficient statistic, and F (θ) the log-normalizer [33] (or cumulant function).The optional auxiliary term k(x) allows to change the base measure µ into the measure ν such that dν dµ (x) = e k(x) .The distributions P θ of the exponential family E can be interpreted as distributions obtained by tilting the base measure µ [12].Thus when t(x) = x, these natural exponential families [33] are also called tilted exponential families [15] in the literature.

Kullback-Leibler divergence between exponential family distributions
For two σ-finite probability measures P and Q on (X , Σ) such that P is dominated by Q (P ≪ Q), the Kullback-Leibler divergence between P and Q is defined by dP dQ dP = E P log dP dQ .
When P ̸ ≪ Q, we set D KL [P : Q] = +∞.Gibbs' inequality [9] D KL [P : Q] ≥ 0 shows that the KLD is always non-negative.The proof of Gibbs' inequality relies on Jensen's inequality and holds for the wide class of f -divergences [10] induced by convex generators f (u): The KLD is a f -divergence for the convex generator f (u) = − log u.

Kullback-Leibler divergence between exponential family densities
It is well-known that the KLD between two distributions P θ 1 and P θ 2 of E amounts to compute an equivalent Fenchel-Young divergence [3]: where η = ∇F (θ) = E P θ [t(x)] is the moment parameter [33], and the Fenchel-Young divergence is defined for a pair of strictly convex conjugate functions [32] F (θ) and F * (η) related by the Legendre-Fenchel transform by Amari (1985) first introduced this formula as the canonical divergence of dually flat spaces in information geometry [2] (Eq.3.21), and proved that the Fenchel-Young divergence is obtained as the KLD between densities belonging to a same exponential family [2] (Theorem 3.7).Azoury and Warmuth expressed the KLD D KL [P θ 1 : P θ 2 ] using dual Bregman divergences in [3] (2001): where a Bregman divergence [7] B F (θ 1 : θ 2 ) is defined for a strictly convex and differentiable generator F (θ) by: Acharyya termed the divergence Y F,F * the Fenchel-Young divergence in his PhD thesis [1] (2013), and Blondel et al. called them Fenchel-Young losses (2020) in the context of machine learning [6] (Eq. 9 in Definition 2).It was also called by the author the Legendre-Fenchel divergence in [19].The Fenchel-Young divergence stems from the Fenchel-Young inequality [32]: with equality iff.η 2 = ∇F (θ 1 ). Figure 1 visualizes the 1D Fenchel-Young divergence and gives a geometric proof that Y F,F * (θ 1 , η 2 ) ≥ 0 with equality if and only if η 2 = F ′ (θ 1 ).
The symmetrized Kullback-Leibler divergence D J [P θ 1 : P θ 2 ] between two distributions P θ 1 and P θ 2 of E is called the Jeffreys' divergence [17] and amounts to a symmetrized Bregman divergence [26]: Notice that the Bregman divergence B F (θ 1 : θ 2 ) can also be interpreted as a surface area: Figure 2 illustrates the sided and symmetrized Bregman divergences.

Contributions and paper outline
We recall in §2 the formula obtained for the Kullback-Leibler divergence between two exponential family densities equivalent to each other [20] (Eq.5).Inspired by this formula, we give a definition of the duo Fenchel-Young divergence induced by a pair of strictly convex functions F 1 and F 2 (Definition 1) in §3, and proves that the divergence is always non-negative provided that F 1 upper bounds F 2 .We then define the duo Bregman divergence (Definition 2) corresponding to the duo Fenchel-Young divergence.In §4, we show that the Kullback-Leibler divergence between a truncated density and a density of a same parametric exponential family amounts to a duo Fenchel-Young divergence or equivalently to a Bregman divergence on swapped parameters (Theorem 1).As an example, we report a formula for the Kullback-Leibler divergence between truncated normal distributions (Example 6).In §5, we further consider the skewed Bhattacharyya distance between nested exponential family densities and prove that it amounts to a duo Jensen divergence (Theorem 2).Finally, we conclude in §7.
Figure 2: Visualizing the sided and symmetrized Bregman divergences.

Kullback-Leibler divergence between different exponential families
Consider now two exponential families [33] P and Q defined by their Radon-Nikodym derivatives with respect to two positive measures µ P and µ Q on (X , Σ): The corresponding natural parameter spaces are where F P (θ) and F Q (θ ′ ) denote the corresponding log-normalizers of P and Q, respectively.
The functions F P and F Q are strictly convex and real analytic [33].Hence, those functions are infinitely many times differentiable on their open natural parameter spaces.
Consider the KLD between P θ ∈ P and Q θ ′ ∈ Q such that µ P = µ Q (and hence P θ ≪ Q θ ′ ).Then the KLD between P θ and Q θ ′ was first considered in [20]: Recall that the dual parameterization of an exponential family density P θ is P η with η = E P θ [t P (x)] = ∇F P (θ) [33], and that the Fenchel-Young equality is F (θ) + F * (η) = θ ⊤ η for η = ∇F (θ).Thus the KLD between P θ and Q θ ′ can be rewritten as This formula was reported in [20] and generalizes the Fenchel-Young divergence [1] obtained when P = Q (with t P (x) = t Q (x), k P (x) = k Q (x), and ).The formula of Eq. 5 was illustrated in [20] with two examples: The KLD between Laplacian distributions and zero-centered Gaussian distributions, and the KLD between two Weibull distributions.Both these examples use the Lebesgue base measure for µ P and µ Q .Let us report another example which uses the counting measure as the base measure for µ P and µ Q .
Example 1.Consider the KLD between a Poisson probability mass function (pmf ) and a geometric pmf.The canonical decomposition of the Poisson and geometric pmfs are summarized in Table 1.The KLD between a Poisson pmf p λ and a geometric pmf q p is equal to Since , we have counting measure counting measure ordinary parameter rate λ > 0 success probability p ∈ (0, 1) pmf Table 1: Canonical decomposition of the Poisson and the geometric discrete exponential families.
Notice that we can calculate the KLD between two geometric distributions Q p 1 and Q p 2 as We get: 3 The duo Fenchel-Young divergence and its corresponding duo Bregman divergence Inspired by formula of Eq. 5, we shall define the duo Fenchel-Young divergence using a dominance condition on a pair (F 1 (θ), F 2 (θ)) of strictly convex generators: When , and we retrieve the ordinary Fenchel-Young divergence [1]: Property 1 (Non-negative duo Fenchel-Young divergence).The duo Fenchel-Young divergence is always non-negative.
Proof.The proof relies on the reverse dominance property of strictly convex and differentiable conjugate functions: Namely, if This property is graphically illustrated in Figure 3.The reverse dominance property of the Legendre-Fenchel transformation can be checked algebraically as follows: Thus we have 1 is the ordinary Fenchel-Young divergence which is guaranteed to be non-negative from the Fenchel-Young's inequality.
We can express the duo Fenchel-Young divergence using the primal coordinate systems as a generalization of the Bregman divergence to two generators that we term the duo Bregman divergence (see Figure 4) : The duo Bregman divergence induced by two strictly convex and differentiable functions This generalized Bregman divergence is non-negative when F 1 (θ) ≥ F 2 (θ).Indeed, we check that Definition 2 (duo Bregman divergence).Let F 1 (θ) and F 2 (θ) be two strictly convex functions such that F 1 (θ) ≥ F 2 (θ) for any θ ∈ Θ 12 = dom(F 1 ) ∩ dom(F 2 ).Then the generalized Bregman divergence is defined by Example 2. Consider F 1 (θ) = a 2 θ 2 for a > 0. We have η = aθ, θ = η a , and when a ≥ 1.We can express the duo Fenchel-Young divergence in the primal coordinate systems as , half the squared Euclidean distance as expected.Figure 5 displays the graph plot of the duo Bregman divergence for several values a.
2 which shows that the divergence can be negative then since a < 1.

Convex functions Conjugate functions
Figure 6: The Legendre transform reverses the dominance ordering: We have with η 2 (θ) = 4θ 3 .Figure 6 plots the convex functions F 1 (θ) and F 2 (θ), and their convex conjugates We now state a property between dual duo Bregman divergences: Property 2 (Dual duo Fenchel-Young/Bregman divergences).We have Proof.From the Fenchel-Young equalities of the inequalities, we have , and therefore the dual duo Bregman divergence is non-negative: Observe that we have:

and the common natural parameter space is Θ
Notice that the reverse Kullback-Leibler divergence D * KL [p θ 1 : Theorem 1 (Kullback-Leibler divergence between nested exponential family densities).Let E 2 = {q θ 2 } be an exponential family with support X 2 , and E 1 = {p θ 1 } a truncated exponential family of E 2 with support X 1 ⊂ X 2 .Let F 1 and F 2 denote the log-normalizers of E 1 and E 2 and η 1 and η 2 the moment parameters corresponding to the natural parameters θ 1 and θ 2 .Then the Kullback-Leibler divergence between a truncated density of E 1 and a density of E 2 is . For example, consider the calculation of the KLD between an exponential distribution (view as half a Laplacian distribution) and a Laplacian distribution.
denote the exponential families of exponential distributions and Laplacian distributions, respectively.We have the sufficient statistic t(x) = −|x| and natural parameter θ = λ so that pθ (x) = exp(−|x|θ).The log-normalizers are Thus using the duo Bregman divergence, we have: Moreover, we can interpret that divergence using the Itakura-Saito divergence [16]: we have We check the result using the duo Fenchel-Young divergence: Next, consider the calculation of the KLD between a half-normal distribution and a (full) normal distribution: Example 5. Consider E 1 and E 2 be the scale family of the half standard normal distributions and the scale family of the standard normal distribution, respectively.We have pθ ( Let the sufficient statistic be t(x) = − x 2 2 so that the natural parameter is θ = 1 σ 2 ∈ R ++ .Here, we have both Θ 1 = Θ 2 = R ++ .For this example, we check that Z 1 (θ) = 1  2 Z 2 (θ).We have The KLD between two half scale normal distributions is Since F 1 (θ) and F 2 (θ) differ only by a constant and that the Bregman divergence is invariant under an affine term of its generator, we have Moreover, we can interpret those Bregman divergences as half of the Itakura-Saito divergence: where It follows that Notice that 1 2 log 4 = log 2 > 0 so that D KL [p θ 1 : q θ 2 ] ≥ D KL [q θ 1 : q θ 2 ].Thus the Kullback-Leibler divergence between a truncated density and another density of the same exponential family amounts to calculate a duo Bregman divergence on the reverse parameter order: . Notice that truncated exponential families are also exponential families but they may be nonsteep [11].
The next example shows how to compute the Kullback-Leibler divergence between two truncated normal distributions: where Z a,b (m, s) is related to the partition function [8] expressed using the cumulative distribution function (CDF) Φ m,s (x): where erf(x) is the error function: Thus we have erf(x) = 2 Φ( √ 2x) − 1 where Φ(x) = Φ 0,1 (x).The pdf can also be written as , where ϕ(x) denotes the standard normal pdf (ϕ(x) = p −∞,+∞ 0,1 and Φ(x) = Φ 0,1 (x) = The natural parameter space is Θ = R × R −− where R −− denotes the set of negative real numbers.The log-normalizer can be expressed using the source parameters (m, s) (which are not the mean and standard deviation when the support is truncated, hence the notation m and s): We shall use the fact that the gradient of the log-normalizer of any exponential family distribution amounts to the expectation of the sufficient statistics [33]: Parameter η is called the moment or expectation parameter [33].
of the truncated normal p a,b m,s can be expressed using the following formula [18,8] (page 25): Now consider two truncated normal distributions Notice that F m 2 ,s 2 (θ) ≥ F m 1 ,s 1 (θ).
This formula is valid for (1) the KLD between two nested truncated normal distributions, or for (2) the KLD between a truncated normal distribution and a (full support) normal distribution.Notice that formula depends on the erf function used in function Φ.Furthermore, when a 1 = a 2 = −∞ and b 1 = b 2 = +∞, we recover (3) the KLD between two univariate normal distributions since log The entropy of a truncated normal distribution (an exponential family [27]) is (an even function), and lim β→+∞ βϕ(β) = 0. Thus we recover the differential entropy of a normal distribution: Consider the Kullback-Leibler divergence extended to positive densities as follows: where p(x) and q(x) are positive densities, not necessarily normalized to one.When p(x) = exp(⟨θ 1 , t(x)⟩) = pθ 1 (x) and q(x) = exp(⟨θ 2 , t(x)⟩) = pθ 2 (x) are two unnormalized densities of a same exponential family with F (θ) = log Z(θ) and pθ (x) the unnormalized densities, we have where Z(θ) is the partition function (log-convex function hence convex [23]).Furthermore, if we truncate p′ θ 1 to support X 1 and p′ θ 2 to support X 2 with X 1 ⊂ X 2 ⊂ X , then we have where That is, the extended KLD between two truncated unnormalized exponential family densities with nested support amounts to a duo Bregman divergence with respect to convex partition function generators Z 1 and Z 2 .
Example 7. Consider the unnormalized exponential distributions with pθ (x) = exp(−θx) with sufficient statistics t(x) = −x and natural parameter θ > 0. The partition function is and p′ θ 1 and p′ θ 2 be the truncated exponential distributions on X 1 and X 2 , respectively.We have Similarly, we have Last, we may define a KLD-type pseudo-divergence between three densities as follows: Notice that depending on the choice of density r this 3-density KLD generalization may potentially be negative.This generalization is sometimes met in statistics and information theory [14].When the densities p = p θ 1 , q = p θ 2 , r = p θ all belong to the same exponential family, we have In particular, when θ = αθ 1 +(1−α)θ 2 for α ∈ (0, 1), i.e., r ∝ p α θ 1 p 1−α θ 2 is a normalized geometric mixture, we have This 3-density KL pseudo-divergence is related to the Bregman tangent divergence defined in [28,29] but it may be negative.

Bhattacharyya skewed divergence between nested densities of an exponential family
The Bhattacharyya α-skewed divergence [5,24] between two densities p(x) and q(x) with respect to µ is defined for a skewing scalar parameter α ∈ (0, 1) as: where X denotes the support of the distributions.Let I α [p : q] := X p(x) α q(x) 1−α dµ(x) denote the skewed affinity coefficient so that Consider an exponential family E = {p θ } with log-normalizer F (θ). Then it is well-known that the α-skewed Bhattacharyya divergence between two densities of an exponential family amounts to a skewed Jensen divergence [24] (originally called Jensen difference in [31]): where the skewed Jensen divergence is defined by The convexity of the log-normalizer F (θ) ensures that J F,α (θ 1 : θ 2 ) ≥ 0. The Jensen divergence can be extended to full real α by rescaling it by 1 α(1−α) , see [34].
Theorem 2. The α-skewed Bhattacharyya divergence for α ∈ (0, 1) between a truncated density of E 1 with log-normalizer F 1 (θ) and another density of an exponential family E 2 with log-normalizer F 2 (θ) amounts to a duo Jensen divergence: where J F 1 ,F 2 ,α (θ 1 : θ 2 ) is the duo skewed Jensen divergence induced by two strictly convex functions F 1 (θ) and F 2 (θ) such that F 2 (θ) ≥ F 1 (θ): In [24], it is reported that Indeed, using the first order Taylor expansion of when α → 0, we check that we have Moreover, we have Similarly, we can prove that lim which can be reinterpreted as

Concluding remarks
We considered the Kullback-Leibler divergence between two parametric densities p θ ∈ E 1 and q θ ′ ∈ E 2 belonging to nested exponential families E 1 and E 2 , and we showed that their KLD is equivalent to a duo Bregman divergence on swapped parameter order (Theorem 1).This result generalizes the study of Azoury and Warmuth [3].The duo Bregman divergence can be rewritten as a duo Fenchel-Young divergence using mixed natural/moment parameterizations of the exponential family densities (Definition 1).This second result generalizes the approach taken in information geometry [2].We showed how to calculate the Kullback-Leibler divergence between two truncated normal distributions as a duo Bregman divergence.More generally, we proved that the skewed Bhattacharyya distance between two parametric nested exponential family densities amount to a duo Jensen divergence (Theorem 2).We show asymptotically that scaled duo Jensen divergences tend to duo Bregman divergences generalizing a result of [34,24].This study of duo divergences induced by pair of generators was motivated by the formula obtained for the Kullback-Leibler divergence between two densities of two different exponential families originally reported in [20] (Eq.5).We called those duo divergences although they are pseudo-divergences since those divergences are always strictly greater than zero when the first generators are strictly majorizing the second generators.
It is interesting to find applications of the duo Fenchel-Young, Bregman, and Jensen divergences beyond the scope of calculating statistical distances between nested exponential family densities.Let us point out that nested exponential families have been seldom considered in the literature (see [30] for a recent work).Notice that in [25], the authors exhibit a relationship between densities with nested supports1 and quasi-convex Bregman divergences.Recently, Khan and Swaroop [13] used this duo Fenchel-Young divergence in machine learning for knowledge-adaptation priors in the so-called change regularizer task.

Example 6 .
Let N a,b (m, s) denote a truncated normal distribution with support the open interval (a, b) (a < b) and probability density function (pdf ) defined by: