1. Introduction: Background and Motivations
The goal of the author is to methodologically contribute to an extension of the Sibson’s information radius [
1] and also concentrate on analysis of the specified families of distributions called exponential families [
2].
Let 
 denote a measurable space [
3] with sample space 
 and 
-algebra 
 on the set 
. The Jensen-Shannon divergence [
4] (JSD) between two probability measures 
P and 
Q (or probability distributions) on 
 is defined by:
      where 
 denotes the Kullback–Leibler divergence [
5,
6] (KLD):
      where 
 means that 
P is absolutely continuous with respect to 
Q [
3], and 
 is the Radon–Nikodym derivative of 
P with respect to 
Q. Equation (
2) can be rewritten using the chain rule as:
Consider a measure 
 for which both the Radon–Nikodym derivatives 
 and 
 exist (e.g., 
). Subsequently the Kullback–Leibler divergence can be rewritten as (see Equation (2.5) page 5 of [
5] and page 251 of the Cover & Thomas’ textbook [
6]):    
	  Denote by 
 the set of all densities with full support 
 (Radon–Nikodym derivatives of probability measures with respect to 
): 
	  Subsequently, the Jensen-Shannon divergence [
4] between two densities 
p and 
q of 
 is defined by:
Often, one considers the Lebesgue measure [
3] 
 on 
, where 
 is the Borel 
-algebra, or the counting measure [
3]
 on 
 where 
 is a countable set, for defining the measure space 
.
The JSD belongs to the class of 
f-divergences [
7,
8,
9] which are known as the invariant decomposable divergences of information geometry (see [
10], pp. 52–57). Although the KLD is asymmetric (i.e., 
), the JSD is symmetric (i.e., 
). The notation ‘:’ is used as a parameter separator to indicate that the parameters are not permutation invariant, and that the order of parameters is important.
In this work, a distance  is a measure of dissimilarity between two objects  and , which do not need to be symmetric or satisfy the triangle inequality of metric distances. A distance only satisfies the identity of indiscernibles:  if and only if . When the objects  and  are probability densities with respect to , we call this distance a statistical distance, use the brackets to enclose the arguments of the statistical distance (i.e., ), and we have  if and only if -almost everywhere.
The 2-point JSD of Equation (
4) can be extended to a weighted set of 
n densities 
 (with positive 
’s normalized to sum up to unity, i.e., 
) thus providing a diversity index, i.e., a 
n-point JSD for 
:
      where 
 denotes the statistical mixture [
11] of the densities of 
. We have 
. We call 
 the Jensen-Shannon diversity index.
The KLD is also called the relative entropy since it can be expressed as the difference between the cross entropy 
 and the entropy 
:
      with the cross-entropy and entropy defined, respectively, by
      
	  Because 
, we may say that the entropy is the self-cross-entropy.
When 
 is the Lebesgue measure, the Shannon entropy is also called the differential entropy [
6]. Although the discrete entropy 
 (i.e., entropy with respect to the counting measure) is always positive and bounded by 
, the differential entropy may be negative (e.g., entropy of a Gaussian distribution with small variance).
The Jensen-Shannon divergence of Equation (
6) can be rewritten as:
The JSD representation of Equation (
12) is a Jensen divergence [
12] for the strictly convex negentropy 
, since the entropy function 
 is strictly concave. Therefore, it is appropriate to call this divergence the Jensen-Shannon divergence.
Because 
, it can be shown that the Jensen-Shannon diversity index is upper bounded by 
, the discrete Shannon entropy. Thus, the Jensen-Shannon diversity index is bounded by 
, and the 2-point JSD is bounded by 
, although the KLD is unbounded and it may even be equal to 
 when the definite integral diverges (e.g., KLD between the standard Cauchy distribution and the standard Gaussian distribution). Another nice property of the JSD is that its square root yields a metric distance [
13,
14]. This property further holds for the quantum JSD [
15]. The JSD has gained interest in machine learning. See, for example, the Generative Adversarial Networks [
16] (GANs) in deep learning [
17], where it was proven that minimizing the GAN objective function by adversarial training is equivalent to minimizing a JSD.
To delineate the different roles that are played by the factor 
 in the ordinary Jensen-Shannon divergence (i.e., in weighting the two KLDs and in weighting the two densities), let us introduce two scalars 
, and define a generic 
-skewed Jensen-Shannon divergence, as follows:
      where 
 and 
. This identity holds, because 
 is bounded by 
, see [
18]. Thus, when 
, we have 
, since the self-cross entropy corresponds to the entropy: 
.
A 
f-divergence [
9,
19,
20] is defined for a convex generator 
f, which is strictly convex at 1 (to satisfy the identity of the indiscernibles) and that satisfies 
, by
      
      where the right-hand-side follows from Jensen’s inequality [
20]. For example, the total variation distance 
 is a 
f-divergence for the generator 
: 
. The generator 
 is convex on 
, strictly convex at 1, and it satisfies 
.
The 
 divergence is a 
f-divergence
      
      for the generator:
We check that the generator 
 is strictly convex, since, for any 
 and 
, we have
      
      when 
.
The Jensen-Shannon principle of taking the average of the (Kullback–Leibler) divergences between the source parameters to the mid-parameter can be applied to other distances. For example, the Jensen–Bregman divergence is a Jensen-Shannon symmetrization of the Bregman divergence 
 [
12]:
      where the Bregman divergence [
21] 
 is defined by
      
The Jensen–Bregman divergence 
 can also be written as an equivalent Jensen divergence 
:
      where 
F is a strictly convex function ensuring 
 with equality if 
.
Because of its use in various fields of information sciences [
22], various generalizations of the JSD have been proposed: These generalizations are either based on Equation (
5) [
23] or Equation (
12) [
18,
24,
25]. For example, the (arithmetic) mixture 
 in Equation (
6) was replaced by an abstract statistical mixture with respect to a generic mean 
M in [
23] (e.g., the geometric mixture induced by the geometric mean), and the two KLDS defining the JSD in Equation (
5) was further averaged using another abstract mean 
N, thus yielding the following generic 
-Jensen-Shannon divergence [
23] (abbreviated as 
-JSD):
      where 
 denotes the statistical weighted 
M-mixture:
	  Notice that, when 
 (the arithmetic mean), Equation (
23) of the 
-JSD reduces to the ordinary JSD of Equation (
5). When the means 
M and 
N are symmetric, the 
-JSD is symmetric.
In general, a weighted mean 
 for any 
 shall satisfy the in-betweeness property [
26] (i.e., a mean should be contained inside its extrema):
The three Pythagorean means defined for positive scalars  and  are classic examples of means:
The arithmetic mean ,
the geometric mean , and
the harmonic mean .
These Pythagorean means may be interpreted as special instances of another parametric family of means: The power means
      
      defined for 
 (also called Hölder means). The power means can be extended to the full range 
 by using the property that 
. The power means are homogeneous means: 
 for any 
. We refer to the handbook of means [
27] to obtain definitions and principles of other means beyond these power means.
A weighted mean (also called barycenter) can be built from a non-weighted mean 
 (i.e., 
) by using the dyadic expansion of the real weight 
, see [
28]. That is, we can define the weighted mean 
 for 
 with 
 and 
k an integer. For example, consider a symmetric mean 
. Subsequently, we get the following weighted means when 
:
Let 
 be the unique dyadic expansion of the real number 
, where the 
’s are binary digits (i.e., 
). We define the weighted mean 
 of two positive reals 
p and 
q for a real weight 
 as
      
Choosing the abstract mean 
M in accordance with the family 
 of the densities allows one to obtain closed-form formula for the 
-JSDs that rely on definite integral calculations [
23]. For example, the JSD between two Gaussian densities does not admit a closed-form formula because of the log-sum integral, but the 
-JSD admits a closed-form formula when using geometric statistical mixtures (i.e., when 
). The calculus trick is to find a weighted mean 
, such that, for two densities 
 and 
, the weighted mean distribution 
, where 
 is the normalizing coefficient and 
. Thus, the integral calculation can be simply calculated as 
 since 
, and, therefore, 
. This trick has also been used in Bayesian hypothesis testing for upper bounding the probability of error between two densities of a parametric family of distributions by replacing the usual geometric mean (Section 11.7 of [
6], page 375) by a more general quasi-arithmetic mean [
29]. For example, the harmonic mean is well-suited to Cauchy distributions, and the power means to Student 
t-distributions [
29].
As an application of these generalized JSDs, Deasy et al. [
30] used the skewed geometric JSD (namely, the 
-JSD for 
), which admits a closed-form formula between normal densities [
23], and showed how regularizing an optimization task with this G-JSD divergence improved reconstruction and generation of Variational AutoEncoders (VAEs).
More generally, instead of using the KLD, one can also use any arbitrary distance 
D to define its JS-symmetrization, as follows:
	  These symmetrizations may further be skewed by using 
 and/or 
 for 
 and 
, yielding the definition [
23]:
	  With these notations, the ordinary JSD is 
, the 
 JS-symmetrization of the KLD with respect to the arithmetic means 
 and 
.
The JS-symmetrization can be interpreted as the 
-Jeffreys’ symmetrization of a generalization of Lin’s 
-skewed 
K-divergence [
4]
:
In this work, we consider symmetrizing an arbitrary distance 
D (including the KLD), generalizing the Jensen-Shannon divergence by using a variational formula for the JSD. Namely, we observe that the Jensen-Shannon divergence can also be defined as the following minimization problem:
      since the optimal density 
c is proven unique using the calculus of variation [
1,
31,
32] and it corresponds to the mid density 
, a statistical (arithmetic) mixture.
Proof.  Let . We use the method of the Lagrange multipliers for the constrained optimization problem  such that . Let us minimize . The density c realizing the minimum  satisfies the Euler–Lagrange equation , where  is the Lagrangian. That is,  or, equivalently, . Parameter  is then evaluated from the constraint : we get  since . Therefore, we find that , the mid density of  and . □
 Considering Equation (
32) instead of Equation (
5) for defining the Jensen-Shannon divergence is interesting, because it allows one to consider a novel approach for generalizing the Jensen-Shannon divergence. This variational approach was first considered by Sibson [
1] to define the 
-information radius of a set of weighted distributions while using Rényi 
-entropies that are based on Rényi principled 
-means [
33]. The 
-information radius includes the Jensen-Shannon diversity index when 
. Sibson’s work is our point of departure for generalizing the Jensen-Shannon divergence and proposing the Jensen-Shannon symmetrizations of arbitrary distances.
The paper is organized, as follows: in 
Section 2, we recall the rationale and definitions of the Rényi 
-entropy and the Rényi 
-divergence [
33], and explain the information radius of Sibson [
1], which includes, as a special case, the ordinary Jensen-Shannon divergence and that can be interpreted as generalized skew Bhattacharyya distances. We report, in Theorem 2, a closed-form formula for calculating the information radius of order 
 between two densities of an exponential family when 
 is an integer. It is noteworthy to point out that Sibson’s work (1969) includes, as a particular case of the information radius, a definition of the JSD, prior to the well-known reference paper of Lin [
4] (1991). In 
Section 3, we present the JS-symmetrization variational definition that is based on a generalization of the information radius with a generic mean (Equation (
88) and Definition 3). In 
Section 4, we constrain the mixture density to belong to a prescribed class of (parametric) probability densities, like an exponential family [
2], and obtain a relative information radius generalizing information radius and related to the concept of information projections. Our Definition 5 generalizes the (relative) normal information radius of Sibson [
1], who considered the multivariate normal family (Proposition 4). We illustrate this notion of relative information radius by calculating the density of an exponential family minimizing the reverse Kullback–Leibler divergence between a mixture of densities of that exponential family (Proposition 6). Moreover, we get a semi-closed-form formula for the Kullback–Leibler divergence between the densities of two different exponential families (Proposition 5), generalizing the Fenchel–Young divergence [
34]. As an application of these relative variational JSDs, we touch upon the problems of clustering and quantization of probability densities in 
Section 4.2. Finally, we conclude by summarizing our contributions and discussing related works in 
Section 5.
  2. Rényi Entropy and Divergence, and Sibson Information Radius
Rényi [
33] investigated a generalization of the four axioms of Fadeev [
35], yielding the unique Shannon entropy [
20]. In doing so, Rényi replaced the ordinary weighted arithmetic mean by a more general class of averaging schemes. Namely, Rényi considered the weighted quasi-arithmetic means [
36]. A weighted quasi-arithmetic mean can be induced by a strictly monotonous and continuous function 
g, as follows:
      where the 
’s and the 
’s are positive (the weights are normalized, so that 
). Because 
, we may assume without loss of generality that 
g is a strictly increasing and continuous function. The quasi-arithmetic means were investigated independently by Kolmogorov [
36], Nagumo [
37], and de Finetti [
38].
For example, the power means 
 introduced earlier are quasi-arithmetic means for the generator 
:
Rényi proved that, among the class of weighted quasi-arithmetic means, only the means induced by the family of functions
      
      for 
 and 
 yield a proper generalization of Shannon entropy, nowadays called the Rényi 
-entropy. The Rényi 
-mean is
      
	  The Rényi 
-means 
 are not power means: They are not homogeneous means [
31]. Let 
. Subsequently, we have 
 and 
. Indeed, we have
      
      using the following first-order approximations: 
 and 
.
To obtain an intuition of the Rényi entropy, we may consider generalized entropies derived from quasi-arithmetic means, as follows:
	  When 
, we recover Shannon entropy. When 
, we get 
, called the collision entropy, since 
, when 
 and 
 are independent and identically distributed random variables with 
 and 
. When 
, we get
      
	  The formula of Equation (
41) is the discrete Rényi 
-entropy [
33], which can be defined more generally on a measure space 
, as follows:
	  In the limit case 
, the Rényi 
-entropy converges to Shannon entropy: 
. Rényi 
-entropies are non-increasing with respect to increasing 
: 
 for 
. In the discrete case (i.e., counting measure 
 on a finite alphabet 
), we can further define 
 for 
 (also called max-entropy or Hartley entropy). The Rényi 
-entropy
      
      is also called the min-entropy, since the sequence 
 is non-increasing with respect to increasing 
.
Similarly, Rényi obtained the 
-divergences for 
 and 
 (originally called information gain of order 
):
      generalizing the Kullback–Leibler divergence, since 
. Rényi 
-divergences are non-decreasing with respect to increasing 
 [
39]: 
 for 
.
Sibson (Robin Sibson (1944–2017) is also renown for inventing the natural neighbour interpolation [
40]) [
1] considered both the Rényi 
-divergence [
33]
 and the Rényi 
-weighted mean 
 to define the information radius 
 of order 
 of a weighted set 
 of densities 
’s as the following minimization problem:
      where
      
The Rényi 
-weighted mean 
 can be rewritten as
      
      where function 
 denotes the log-sum-exp (convex) function [
41,
42].
Notice that 
, the Bhattacharyya 
-coefficient [
12] (also called Chernoff 
-coefficient [
43,
44]):
	  Thus, we have
      
	  The ordinary Bhattacharyya coefficient is obtained for 
: 
.
Sibson [
1] also considered the limit case 
 when defining the information radius:
Sibson reported the following theorem in his information radius study [
1]:
Theorem 1 (Theorem 2.2 and Corollary 2.3 of [
1])
. The optimal density  is unique, and we have: Observe that  does not depend on the (positive) weights.
The proof follows from the following decomposition of the information radius:
Because the proof is omitted in [
1], we report it here:
Proof.  Let . We handle the three cases, depending on the  values:
Case 
: Let 
. We have 
. We obtain
            
Case 
: we have 
 with 
. Because 
, we have
            
Case : we have , , and . We have  Thus, .
 □
 It follows that
      
	  Thus we have 
 since 
 is minimized for 
.
Notice that 
 is the upper envelope of the densities 
’s normalized to be a density. Provided that the densities 
’s intersect pairwise in at most 
s locations (i.e., 
 for 
), we can efficiently compute this upper envelope using an output-sensitive algorithm [
45] of computational geometry.
When the point set is 
 with 
, the information radius defines a (2-point) symmetric distance, as follows:
This family of symmetric divergences may be called the Sibson’s -divergences, and the Jensen-Shannon divergence is interpreted as a limit case when . Notice that, since we have  and , we have . Notice that, for , the integral and logarithm operations are swapped as compared to  for .
Theorem 2. When  for an integer , the Sibson α-divergences between two densities  and  of an exponential family  with cumulant function  is available in closed form:  Proof.  Let 
 and 
 be two densities of an exponential family [
2] with cumulant function 
 and natural parameter space 
. Without a loss of generality, we may consider a natural exponential family [
2] with densities written canonically as 
 for 
. It can be shown that the cumulant function 
 is strictly convex and analytic on the open convex natural parameter space 
 [
2].
When 
 (i.e., 
), we have:
        
        where 
 is the Bhattacharyya coefficient (with 
). Using Theorem 3 of [
12], we have
        
        so that we obtain the following closed-form formula:
        
Now, assume that 
 is an arbitrary integer, and let us apply the binomial expansion for 
 in the spirit of [
46,
47]:
        
Let 
. Because 
 for 
, we get by following the calculation steps in [
12]:
        
		Notice that 
, and 
.
Thus, we get the following closed-form formula:
        
 □
 This closed-form formula applies, in particular, to the family 
 of (multivariate) normal distributions: In this case, the natural parameters 
 are expressed using both a vector parameter component 
v and a matrix parameter component 
M:
      and the cumulant function is:
      where 
 denotes the matrix determinant.
In general, the optimal density 
 yielding the information radius 
 can be interpreted as a generalized centroid (extending the notion of Fréchet means [
48]) with respect to 
, where a 
-centroid is defined by:
Definition 1 (
-centroid)
. Let  be a normalized weighted parameter set, M a mean, and D a distance. Subsequently, the -centroid is defined as Here, we give a general definition of the 
-centroid for an arbitrary distance (not necessarily a symmetric nor metric distance). The parameter set can either be probability measures having densities with respect to a given measure 
 or a set of vectors. In the first case, the distance 
D is called a statistical distance. When the densities belong to a parametric family of densities 
, the statistical distance 
 amounts to a parameter distance: 
. For example, when all of the densities 
’s belong to a same natural exponential family [
2]
      
      with cumulant function 
 (i.e., 
) and sufficient statistic vector 
, we have 
, where 
 denotes the reverse Bregman divergence (by parameter order swapping) the Bregman divergence [
21]
 defined by
      
	  Thus, we have 
.
Let 
 be the parameter set corresponding to 
. Define
      
	  Subsequently, we have the equivalent decomposition of Proposition 1:
      with 
. (this decomposition is used to prove Proposition 1 of [
21]). The quantity 
 was termed the Bregman information [
21,
49]. The Bregman information generalizes the variance that was obtained when the Bregman divergence is the squared Euclidean distance. 
 could also be called Bregman information radius according to Sibson. Because 
, we can interpret the Bregman information as a Sibson’s information radius for densities of an exponential family with respect to the arithmetic mean 
 and the reverse Kullback–Leibler divergence: 
. This observation yields us the JS-symmetrization of distances based on generalized information radii in 
Section 3.
More generally, we may consider the densities belonging to a deformed 
q-exponential family (see [
10], page 85–89 and the monograph [
50]). Deformed 
q-exponential families generalize the exponential families, and include the 
q-Gaussians [
10]. A common way to measure the statistical distance between two densities of a 
q-exponential family is the 
q-divergence [
10], which is related to Tsallis’ entropy [
51]. We may also define another statistical divergence between two densities of a 
q-exponential family which amounts to Bregman divergence. For example, we refer to [
52] for details concerning the family of Cauchy distributions, which are 
q-Gaussians for 
.
Sibson proved that the information radii of any order are all upper bounded (Theorem 2.8 and Theorem 2.9 of [
1]) as follows: 
We interpret Sibson’s upper bounds of Equations (
73)–(
75), as follows:
Proposition 2 (Information radius upper bound). The information radius of order α of a weighted set of distributions is upper bounded by the discrete Rényi entropy of order  of the weight distribution: , where .
   3. JS-Symmetrization of Distances Based on Generalized Information Radius
Let us give the following definitions generalizing the information radius (i.e., Jensen-Shannon symmetrization of the distance when ) and the ordinary Jensen-Shannon divergence:
Definition 2 (
-information radius)
. Let M be a weighted mean and D a distance. Subsequently, the  generalized information radius for a weighted set of points (e.g., vectors or densities)  is: Recall that we also defined the 
-centroid in Definition 1 as follows:
When 
, we recover the notion of Fréchet mean [
48]. Notice that, although the minimum 
 is unique, several generalized centroids 
 may potentially exist, depending on 
. In particular, Definition 2 and Definition 1 apply when 
D is a statistical distance, i.e., a distance between densities (Radon–Nikodym derivatives of corresponding probability measures with respect to a dominating measure 
).
The generalized information radius can be interpreted as a diversity index or an n-point distance. When , we get the following (2-point) distances, which are considered as a generalization of the Jensen-Shannon divergence or Jensen-Shannon symmetrization:
Definition 3 (
M-vJS symmetrization of 
D)
. Let M be a mean and D a statistical distance. Subsequently, the variational Jensen-Shannon symmetrization of D is defined by the formula of a generalized information radius: We use the acronym 
 to distinguish it with the JS-symmetrization reported in [
23]:
We recover Sibson’s information radius 
 induced by two densities 
p and 
q from Definition 3 as the 
-vJS symmetrization of the Rényi divergence . We have 
, which is the Bregman information [
21]. Notice that we may skew these generalized JSDs by taking weighted mean 
 instead of 
M for 
, yielding the general definition:
Definition 4 (Skew 
-vJS symmetrization of 
D)
. Let  be a weighted mean and D a statistical distance. Subsequently, the variational skewed Jensen-Shannon symmetrization of D is defined by the formula of a generalized information radius: Example 1. For example, the skewed Jensen–Bregman divergence of Equation (20) can be interpreted as a Jensen-Shannon symmetrization of the Bregman divergence  [12] since we have:Indeed, the Bregman barycenter  is unique and it corresponds to , see [21]. The skewed Jensen–Bregman divergence  can also be rewritten as an equivalent skewed Jensen divergence (see Equation (22)):  Example 2. Consider a conformal Bregman divergence [53] that is defined bywhere  is a conformal factor. Subsequently, we havewhere  and .  Notice that this definition is implicit and it can be made explicit when the centroid 
 is unique:
In particular, when 
, the KLD, we obtain generalized skewed Jensen-Shannon divergences for 
 a weighted mean with 
:
Example 3. Amari [31] obtained the -information radius and its corresponding unique centroid for , the α-divergence of information geometry [10] (page 67).  Example 4. Brekelmans et al. [54] studied the geometric path  between two distributions  and  of , where  (with ) is the weighted geometric mean. They proved the variational formula: That is,  is a - centroid, where  is the reverse KLD. The corresponding -vJSD is studied is [23] and it is used in deep learning in [30]. It is interesting to study the link between -variational Jensen-Shannon symmetrization of D and the -JS symmetrization of [23]. In particular, the link between  for averaging in the minimization and  the mean for generating abstract mixtures. More generally, Brekelmans et al. [55] considered the α-divergences extended to positive measures (i.e., a separable divergence built as the different between a weighted arithmetic mean and a geometric mean [56]):and proved thatis a density of a likelihood ratio q-exponential family:  for . That is,  is the -generalized centroid, and the corresponding information radius is the variational JS symmetrization:  Example 5. The q-divergence [57] between two densities of a q-exponential family amounts to a Bregman divergence [10,57]. Thus,  for  is a generalized information radius that amounts to a Bregman information.  For the case  in Sibson’s information radius, we find that the information radius is related to the total variation:
Proposition 3 (Lemma 2.4 [
1])
. :where  denotes the total variation Proof.  Because 
, it follows that we have:
        
From Theorem 1, we have  and, therefore, . □
 Notice that, when 
 is a quasi-arithmetic mean, we may consider the divergence 
, so that the centroid of the 
-JS symmetrization is:
The generalized 
-skewed Bhattacharyya divergence [
29] can also be considered with respect to a weighted mean 
:
	  In particular, when 
 is a quasi-arithmetic weighted mean that is induced by a strictly continuous and monotone function 
g, we have
      
	  Because 
, 
 and 
, we deduce that we have:
The information radius of Sibson for  may be interpreted as generalized scaled -skewed Bhattacharyya divergences with respect to the power means , since we have .
  5. Conclusions
To summarize, the ordinary Jensen-Shannon divergence has been defined in three equivalent ways in the literature: 
The JSD Equation (
133) was studied by Sibson in 1969 within the wider scope of information radius [
1]: Sibson relied on the Rényi 
-divergences (relative Rényi 
-entropies [
77]) and recovered the ordinary Jensen-Shannon divergence as a particular case of the 
-information radius when 
 and 
 points. The 
-information radii are related to generalized Bhattacharyya distances with respect to power means and the total variation distance in the limit case of 
.
Lin [
4] investigated the JSD Equation (
134) in 1991 with its connection to the JSD defined in Equation (
134)). In Lin [
4], the JSD is interpreted as the arithmetic symmetrization of the 
K-divergence [
24]. Generalizations of the JSD based on Equation (
134) were proposed in [
23] using a generic mean instead of the arithmetic mean. One motivation was to obtain a closed-form formula for the geometric JSD between multivariate Gaussian distributions, which relies on the geometric mixture (see [
30] for a use case of that formula in deep learning). Indeed, the ordinary JSD between Gaussians is not available in closed-form (not analytic). However, the JSD between Cauchy distributions admit a closed-form formula [
78], despite the calculation of a definite integral of a log-sum term. Instead of using an abstract mean to define a mid-distribution of two densities, one may also consider the mid-point of a geodesic linking these two densities (the arithmetic means 
 is interpreted as a geodesic midpoint). Recently, Li [
79] investigated the transport Jensen-Shannon divergence as a symmetrization of the Kullback–Leibler divergence in the 
-Wasserstein space. See Section 5.4 of [
79] and the closed-form formula of Equation (
18) obtained for the transport Jensen-Shannon divergence between two multivariate Gaussian distributions.
The generalization of the identity between the JSD of Equation (
134) and the JSD of Equation (
135) was studied while using a skewing vector in [
18]. Although the JSD is a 
f-divergence [
8,
18], the Sibson-
M Jensen-Shannon symmetrization of a distance does not belong, in general, to the class of 
f-divergences. The variational JSD definition of Equation (
133) is implicit, while the definitions of Equations (
134) and (
135) are explicit because the unique optimal centroid 
 has been plugged into the objective function that was minimized by Equation (
133).
In this paper, we proposed a generalization of the Jensen-Shannon divergence based on the variational definition of the ordinary Jensen-Shannon divergence based on the variational JSD definition of Equation (
133): 
. We introduced the Jensen-Shannon symmetrization of an arbitrary divergence 
D by considering a generalization of the information radius with respect to an abstract weighted mean 
: 
. Notice that, in the variational JSD, the mean 
 is used for averaging divergence values, while the mean 
 in the 
 JSD is used to define generic statistical mixtures. We also consider relative variational JS symmetrization when the centroid has to belong to a prescribed family of densities. For the case of exponential family, we showed how to compute the relative centroid in closed form, thus extending the pioneering work of Sibson, who considered the relative normal centroid used to calculate the relative normal information radius. 
Figure 2 illustrates the three generalizations of the ordinary skewed Jensen-Shannon divergence. Notice that, in general, the 
-JSDs and the variational JDSs are not 
f-divergences (except in the ordinary case).
In a similar vein, Chen et al. [
80] considered the following minimax symmetrization of the scalar Bregman divergence [
81]:
      where 
 denotes the scalar Bregman divergence induced by a strictly convex and smooth function 
f:
They proved that 
 yields a metric when 
, and extend the definition to the vector case and conjecture that the square-root metrization still holds in the multivariate case. In a sense, this definition geometrically highlights the notion of radius, since the minmax optimization amount to find a smallest enclosing ball enclosing [
82] the source distributions. The circumcenter, also called the Chebyshev center [
83], is then the mid-distribution instead of the centroid for the information radius. The term "information radius” is well-suited to measure the distance between two points for an arbitrary distance 
D. Indeed, the JS-symmetrization of 
D is defined by 
. When 
 is the Euclidean distance, we have 
, and  
 (i.e., the radius being half of the diameter 
). Thus, 
; hence, the term chosen by Sibson [
1] for 
: information radius. Besides providing another viewpoint, variational definitions of divergences have proven to be useful in practice (e.g., for estimation). For example, a variational definition of the Rényi divergence generalizing the Donsker–Varadhan variational formula of the KLD is given in [
84], which is used to estimate the Rényi Divergences.