On H\"older projective divergences

We describe a framework to build distances by measuring the tightness of inequalities, and introduce the notion of proper statistical divergences and improper pseudo-divergences. We then consider the H\"older ordinary and reverse inequalities, and present two novel classes of H\"older divergences and pseudo-divergences that both encapsulate the special case of the Cauchy-Schwarz divergence. We report closed-form formulas for those statistical dissimilarities when considering distributions belonging to the same exponential family provided that the natural parameter space is a cone (e.g., multivariate Gaussians), or affine (e.g., categorical distributions). Those new classes of H\"older distances are invariant to rescaling, and thus do not require distributions to be normalized. Finally, we show how to compute statistical H\"older centroids with respect to those divergences, and carry out center-based clustering toy experiments on a set of Gaussian distributions that demonstrate empirically that symmetrized H\"older divergences outperform the symmetric Cauchy-Schwarz divergence.


.1 Statistical divergences from inequality gaps
An inequality is denoted mathematically by lhs ≤ rhs, where lhs and rhs denote respectively the lefthand-side and right-hand-side of the inequality. One can build dissimilarity measures from inequalities lhs ≤ rhs by measuring the inequality tightness: For example, we may quantify the tightness of an inequality by its difference gap: When lhs > 0, the inequality tightness can also be gauged by the log-ratio gap: We may further compose this inequality tightness value measuring non-negative gaps with a strictly monotonically increasing function f (with f (0) = 0).
A bi-parametric inequality lhs(p, q) ≤ rhs(p, q) is called proper if it is strict for p = q (i.e., lhs(p, q) < rhs(p, q), ∀p = q) and tight if and only if (iff) p = q (i.e., lhs(p, q) = rhs(p, q), ∀p = q). Thus a proper bi-parametric inequality allows one to define dissimilarities such that D(p, q) = 0 iff p = q. Such a dissimilarity is called proper. Otherwise, an inequality or dissimilarity is said improper. Note that there are many equivalent words used in the literature instead of (dis-)similarity: distance (although often assumed to have metric properties), pseudo-distance, discrimination, proximity, information deviation, etc.
A statistical dissimilarity between two discrete or continuous distributions p(x) and q(x) on a support X can thus be defined from inequalities by summing up or taking the integral for the inequalities instantiated on the observation space X : ∀x ∈ X , D x (p, q) = rhs(p(x), q(x)) − lhs(p(x), q(x)) ⇒ D(p, q) = x∈X rhs(p(x), q(x)) − lhs(p(x), q(x)) discrete case, X rhs(p(x), q(x)) − lhs(p(x), q(x)) dx continuous case. (4) In such a case, we get a separable divergence. Some non-separable inequalities induce a non-separable divergence. For example, the renown Cauchy-Schwarz divergence [8] is not separable because in the inequality: the rhs is not separable. Furthermore, a proper dissimilarity is called a divergence in information geometry [2] when it is C 3 (i.e., three times differentiable thus allowing to define a metric tensor [34] and a cubic tensor [2]).
Many familiar distances can be reinterpreted as inequality gaps in disguise. For example, Bregman divergences [4] and Jensen divergences [28] (also called Burbea-Rao divergences [9,22]) can be reinterpreted as inequality difference gaps, and the Cauchy-Schwarz distance [8] as an inequality log-ratio gap: Example 1 (Bregman divergence as a Bregman score-induced gap divergence). A proper score function [14] S(p : q) induces a gap divergence D(p : q) = S(p : q) − S(p : p) ≥ 0. A Bregman divergence [4] B F (p : q) for a strictly convex and differentiable real-valued generator F (x) is induced by the Bregman score S F (p : q). Let S F (p : q) = −F (q) − p − q, ∇F (q) denote the Bregman proper score minimized for p = q. Then the Bregman divergence is a gap divergence: B F (p : q) = S F (p : q) − S F (p : p) ≥ 0. When F is strictly convex, the Bregman score is proper, and the Bregman divergence is proper.
Example 2 (Cauchy-Schwarz distance as a log-ratio gap divergence). Consider the Cauchy-Schwarz inequality X p(x)q(x)dx ≤ X p(x) 2 dx X q(x) 2 dx . Then the Cauchy-Schwarz distance [8] between two continuous distributions is defined by CS(p(x) : q(x)) = − log Note that we use the modern notation D(p(x) : q(x)) to emphasize that the divergence is potentially asymmetric: D(p(x) : q(x)) = D(q(x) : p(x)), see [2]. In information theory [11], the older notation "||" is often used instead of ":" that is used in information geometry [2].
The Cauchy-Schwarz distance is a projective divergence. Another example of such a projective divergence is the parametric γ-divergence [13].
The γ-divergences have been proven useful for robust statistical inference [13] in the presence of heavy outlier contamination.

Pseudo-divergences and the axiom of indiscernability
Consider a broader class of statistical pseudo-divergences based on improper inequalities, where the tightness of lhs(p, q) ≤ rhs(p, q) does not imply that p = q. This family of dissimilarity measures have interesting properties which have not been studied before.
Formally, statistical pseudo-divergences are defined with respect to density measures p(x) and q(x) with x ∈ X , where X denotes the support. By definition, pseudo-divergences satisfy the following three fundamental properties: 1. Non-negativeness: D(p(x) : q(x)) ≥ 0 for any p(x), q(x); 2. Reachable indiscernability: As compared to statistical divergence measures such as the Kullback-Leibler (KL) divergence: pseudo-divergences do not require D(p(x) : p(x)) = 0. Instead, any pair of distributions p(x) and q(x) with D(p(x) : q(x)) = 0 only has to be "positively correlated" such that p( automatically satisfies this weaker condition, and therefore any divergence belongs to the broader class However the converse is not true. As we shall describe in the remainder, the family of pseudo-divergences is not limited to proper divergence measures. In the remainder, the term "pseudo-divergence" refers to such divergences that are not proper divergence measures. We study two novel statistical dissimilarity families: One family of statistical improper pseudodivergences and one family of proper statistical divergences. Within the class of pseudo-divergences, this work concentrates on defining a one-parameter family of dissimilarities called Hölder log-ratio gap divergence that we concisely abbreviate as HPD for "Hölder pseudo divergence" in the remainder. We also study its proper divergence counterpart termed HD for "Hölder divergence."

Prior work and contributions
The term "Hölder divergence" has first been coined in 2014 based on the definition of the Hölder score [20,19]: The score-induced Hölder divergence D(p(x) : q(x)) is a proper gap divergence that yields a scaleinvariant divergence [20,19]. Let p a,σ (x) = aσp(σx) for a, σ > 0 be a transformation. Then a scaleinvariant divergence [20,19] satisfies D(p a,σ (x) : q a,σ (x)) = κ(a, σ)D(p(x) : q(x)) for a function κ(a, σ) > 0. This gap divergence is proper since it is based on the so-called Hölder score [20,19] but is not projective and does not include the Cauchy-Schwarz divergence. Due to these differences the Hölder log-ratio gap divergence introduced here shall not be confused with the Hölder gap divergence induced by the Hölder score [19,20] that relies both on a scalar γ and a function φ(·).
We shall introduce two novel families of log-ratio projective gap divergences based on Hölder ordinary (or forward) and reverse inequalities that extend the Cauchy-Schwarz divergence, study their properties, and consider as an application clustering Gaussian distributions: We experimentally show better clustering results when using symmetrized Hölder divergences than using the Cauchy-Schwarz divergence. To contrast with the "Hölder composite score-induced divergences" of [19], our Hölder divergences admit closed-form expressions between distributions belonging to the same exponential families [23] provided that the natural parameter space is a cone or affine.
Our main contributions are summarized as follows: • Define the uni-parametric family of Hölder improper pseudo-divergences (HPDs) in §2 and the biparametric family of Hölder proper divergences in §3 (HDs) for positive and probability measures, and study their properties (including their relationships with skewed Bhattacharrya distances [22] via escort distributions); • Report closed-form expressions of those divergences for exponential families when the natural parameter space is a cone or affine (include but not limited to the cases of categorical distributions and multivariate Gaussian distributions) in §4; • Provide approximation techniques to compute those divergences between mixtures based on logsum-exp inequalities in §4.6; • Describe a variational center-based clustering technique based on the convex-concave procedure for computing Hölder centroids, and report our experimental results in §5.

Organization
This paper is organized as follows: §2 introduces the definition and properties of Hölder pseudodivergences (HPDs). It is followed by §3 that describes Hölder proper divergences (HDs). In §4, closed-form expressions for those novel families of divergences are reported for the categorical, multivariate Gaussian, Bernoulli, Laplace and Wishart distributions. §5 defines Hölder statistical centroids and presents a variational k-means clustering technique: We show experimentally that using Hölder divergences improve over the Cauchy-Schwarz divergence. Finally, 6 concludes this work and hints at further perspectives from the viewpoint of statistical estimation and manifold learning. In Appendix A, we recall the proof of the ordinary and reverse Hölder's inequalities.

Hölder pseudo-divergence: Definition and properties
Hölder's inequality (see [18] and Appendix A for a proof) states for positive real-valued functions 1 p(x) and q(x) defined on the support X that: where exponents α and β satisfy αβ > 0 as well as the exponent conjugacy condition: 1 α + 1 β = 1. We also write β =ᾱ = α α−1 meaning that α and β are conjugate Hölder exponents. We check that α > 1 and β > 1. Hölder inequality holds even if the lhs is infinite (meaning that the integral diverges) since the rhs is also infinite in that case.
When α = β = 2, the HPD becomes the Cauchy-Schwarz divergence CS [16]: which has been proved useful to get closed-form divergence formulas between mixtures of exponential families with conic or affine natural parameter spaces [21]. The Cauchy-Schwarz divergence is proper for probability densities since the Cauchy-Schwarz inequality becomes an equality iff q(x) = λp(x) α−1 = λp(x) implies that λ = X λp(x)dx = X q(x)dx = 1. It is however not proper for positive densities.
Fact 1 (CS is only proper for probability densities). The Cauchy-Schwarz divergence CS(p(x) : q(x)) is proper for square-integrable probability densities p(x), q(x) ∈ L 2 (X , µ) but not proper for positive squareintegrable densities.

Properness and improperness
In the general case, when α = 2, the divergence D H α is not even proper for normalized (probability) densities, not to mention general unnormalized (positive) densities. Indeed, when p(x) = q(x), we have: Let us consider the general case. For unnormalized positive distributionsp(x) andq(x) (the tilde notation stems from the notation of homogeneous coordinates in projective geometry), the inequality becomes an equality when: Fact 2 (HPD is improper). The Hölder pseudo-divergences are improper statistical distances.
An arithmetic symmetrization of the HPD yields a symmetric HPD S H α , given by:

HPD is a projective divergence
In the above definition, densities p(x) and q(x) can either be positive or normalized probability distributions. Letp(x) andq(x) denote positive (not necessarily normalized) measures, and w(p) = Xp (x)dx the overall mass so that p(x) =p (x) w(p) is the corresponding normalized probability measure. Then we check that HPD is a projective divergence [13] since: or in general: for all prescribed constants λ, λ > 0. Projective divergences may also be called "angular divergences" or "cosine divergences" since they do not depend on the total mass of the measure densities.

Escort distributions and skew Bhattacharyya divergences
Let us define with respect to the probability measures p(x) ∈ L 1 α (X , µ) and q(x) ∈ L 1 β (X , µ) the following escort probability distributions [2]: and Since HPD is a projective divergence, we compute with respect to the conjugate exponents α and β the Hölder Escort Divergence (HED): which turns out to be the familiar skew Bhattacharyya divergence B 1/α (p(x) : q(x)), see [22].

Definition
Let p(x) and q(x) be positive measures in L γ (X , µ) for a prescribed scalar value γ > 0. Plugging the positive measures p(x) γ/α and q(x) γ/β into the definition of HPD D H α , we get the following definition: Definition 2 (Proper Hölder Divergence, HD). For conjugate exponents α, β > 0 and γ > 0, the proper Hölder divergence between two densities p(x) and q(x) is defined by: By definition, D H α,γ (p : q) is a two-parameter family of dissimilarity statistical measures. Following Hölder's inequality, we can check that This says that HD is a proper divergence for probability measures, and it becomes a pseudo-divergence for positive measures. Note that we have abused the notation D H to denote both the Hölder pseudo-divergence (with one subscript) and the Hölder divergence (with two subscripts).
Similar to HPD, HD is asymmetric when α = β with the following reference duality: HD can be symmetrized as: Furthermore, one can easily check that HD is a projective divergence. For conjugate exponents α, β > 0 and γ > 0, we rewrite the definition of HD as: Therefore HD can be reinterpreted as the skew Bhattacharyya divergence [22] between the escort distributions. In particular, when γ = 1, we get: Fact 6. The two-parametric family of statistical Hölder divergence D H α,γ passes through the oneparametric family of skew Bhattacharyya divergences when γ = 1.

Special case: The Cauchy-Schwarz divergence
We consider the intersection of the uni-parametric class of Hölder pseudo-divergences (HPD) with the bi-parametric class of proper Hölder divergences (HD): That is, the class of divergences which belong to both HPD and HD. Then we must have γ/α = γ/β = 1. Since 1/α + 1/β = 1, we get α = β = γ = 2. Therefore the Cauchy-Schwarz (CS) divergence is the unique divergence belonging to both HPD and HD classes: In fact, the CS divergence is the intersection of the four classes HPD, symmetric HPD, HD, and symmetric HD. Figure 1 displays a diagram of those divergence classes with their inclusion relationships.  Figure 1: Hölder proper divergence (bi-parametric) and Hölder improper pseudo-divergence (uni-parametric) intersect at the unique non-parametric Cauchy-Schwarz divergence. By using escort distributions, Hölder divergences encapsulates the skew Bhattacharyya distances.
As stated earlier, notice that the Cauchy-Schwarz inequality: is not proper as it is an equality when p(x) and q(x) are linearly dependent (i.e., p(x) = λq(x) for λ > 0). The arguments of the CS divergence are square-integrable real-valued density functions p(x) and q(x).
Thus the Cauchy-Schwarz divergence is not proper for positive measures but is proper for normalized probability distributions since p(x)dx = λq(x)dx = 1 implies that λ = 1.

Limit cases of Hölder divergences and statistical estimation
Let us define the inner product of unnormalized densities as: (for L 2 (X , µ) integrable functions), and define the L α norm of densities as p(x) α = ( Xp (x) α dx) 1/α for α ≥ 1. Then the CS divergence can be concisely written as: and the Hölder pseudo-divergence writes as: When α → 1 + , we haveᾱ = α/(α − 1) → +∞. Then it comes that: When α → +∞ andᾱ → 1 + , we have: Now consider a pair of probability densities p(x) and q(x). We have: In an estimation scenario, p(x) is fixed and q(x | θ) = q θ (x) is free along a parametric manifold M, then minimizing Hölder divergence reduces to: Therefore when θ varies from 1 to +∞, only the regularizer in the minimization problem changes. In any case, Hölder divergence always has the term − log p(x), q(x) , which shares a similar form with the Bhattacharyya distance [6]: HPD betweenp(x) andq(x) is also closely related to their cosine similarity When α = 2, HD is exactly the cosine similarity after a non-linear transformation.

Closed-form expressions of HPD and HD for conic and affine exponential families
We report closed-form formulas for the HPD and HD between two distributions belonging to the same exponential family provided that the natural parameter space is a cone or affine. A cone Ω is a convex domain such that for P, Q ∈ Ω and any λ > 0, we have P + λQ ∈ Ω. For example, the set of positive measures absolutely continuous with a base measure µ is a cone. Recall that an exponential family [23] has a density function p(x; θ) that we be written canonically as: In this work, we consider the auxiliary carrier measure term k(x) = 0. The base measure is either the Lebesgue measure µ or the counting measure µ C . A Conic or Affine Exponential Family (CAEF) is an exponential family with the natural parameter space Θ a cone or affine. The log-normalizer F (θ) is a strictly convex function also called cumulant generating function [2].
Lemma 1 (HPD and HD for CAEFs). For distributions p(x; θ p ) and p(x; θ q ) belonging to the same exponential family with conic or affine natural parameter space [21], both the HPD and HD are available in closed-form: Proof. Consider k(x) = 0 and a conic or affine natural space Θ (see [21]), then for all a, b > 0, we have: since aθ p ∈ Θ. Indeed, we have: Similarly, we have for all a, b > 0 (details omitted), since aθ p + bθ q ∈ Θ. Therefore, we get: When 1 > α > 0, we have β = α α−1 < 0. To get similar results for the reverse Hölder divergence, we need the natural parameter space to be affine (eg., isotropic Gaussians or multinomials, see [27]).
Note that the Hölder score-induced divergence [19] does not admit in general closed-form formulas for exponential families since it relies on a function φ(·) (see Definition 4 of [19]).
Note that CAEF convex log-normalizers satisfy: A necessary condition is that F (λθ) ≥ λF (θ) for λ > 0 (take θ p = θ, θ q = 0 and F (0) = 0 in the above equality). The escort distribution for an exponential family is given by: The Hölder equality holds when p(x) α ∝ q(x) β or p(x) α q(x) −β ∝ 1. For exponential families, this condition is satisfied when αθ p − βθ q ∈ Θ. That is, we need to have: Thus we may choose small enough α = 1 + > 1 so that the condition is not satisfied for fixed θ p and θ q for many exponential distributions. Since multinomials have affine natural space [27], this condition is always met, but not for non-affine natural parameter spaces like normal distributions.
Notice the following fact: Fact 7 (Density of a CAEF in L γ (X , µ)). The density of exponential families with conic or affine natural parameter space belongs to L γ (X , µ) for any γ > 0.
In practice, even when the log-normalizer is computationally intractable, we may still estimate the HD by Monte-Carlo techniques: Indeed, we can sample a distributionp(x) either by rejection sampling [29] or by the Markov Chain Monte-Carlo (MCMC) Metropolis-Hasting technique: It just requires to be able to sample a proposal distribution that has the same support.
We shall now instantiate the HPD and HD formulas for several exponential families with conic or affine natural parameter spaces.

Case study: Categorical distributions
Let p = (p 0 , · · · , p m ) and q = (q 0 , · · · , q m ) be two categorical distributions in the m-dimensional probability simplex ∆ m . We rewrite p in the canonical form of exponential families [23] as: with the redundant parameter: From Eq. 53, the convex cumulant generating function has the form F (θ) = log (1 + m i=1 exp(θ p ) i ). The inverse transformation from p to θ is therefore given by:  The natural parameter space Θ is affine (hence conic), and by applying Lemma 1, we get the following closed-form formula: To get some intuitions, Fig. 2 shows the Hölder divergence from any categorical distribution (p 0 , p 1 , p 2 ) to a given reference distribution p r in the 2D probability simplex ∆ 2 . A main observation is that the Kullback-Leibler (KL) divergence exhibits a barrier near the boundary ∂∆ 2 with large values. This is not the case for Hölder divergences: D H α (p : p r ) does not have a sharp increase near the boundary (although it still penalizes the corners of ∆ 2 ). For example, let p = (0, 1/2, 1/2), p r = (1/3, 1/3, 1/3), then KL(p r : p) → ∞ but D H 2 (p r : p) = 2/3. Another observation is that the minimum D(p : p r ) can be reached at some point p = p r (see for example D H 4 (p : p r ) in Fig. 2b, the bluest area corresponding to the minimum of D(p : p r ) is not in the same location as the reference point).
Consider a HPD ball of center c and prescribed radius r wrt the HPD. Since p(x) α−1 for α = 2 does not belong to the probability manifold but to the positive measure manifold and since the distance is projective, we deduce that the displaced ball center c of a ball c lying the probability manifold can be computed as the intersection of the ray 0p α−1 with the probability manifold. For the discrete probability simplex ∆, since we have λ x∈X p(x) α−1 = 1, we deduce that the displaced ball center is at: This center is displayed as "•" in Figure 2.
In general, the HPD bisector [26] between two distributions belonging to the same CAEF is defined by

Case study: Bernoulli distribution
Bernoulli distribution is just a special case of the category distribution when the number of categories is 2 (i.e., m = 1). To be consistent with the previous section, we rewrite a Bernoulli distribution p = (p 0 , p 1 ) in the canonical form: so that Then the cumulant generating function becomes F (θ p ) = log (1 + exp(θ p )). By Lemma 1,

Case study: MultiVariate Normal distributions (MVNs)
Let us now instantiate the formulas for multivariate normals (Gaussian distributions). We have the log-normalizer F (θ) expressed using the usual parameters as [25]: Since It follows that: Therefore, we have: We thus get the following closed-form formula for p ∼ N (µ p , Σ p ) and q ∼ N (µ q , Σ q ):  Figure 3 shows HPD and HD for univariate Gaussian distributions as compared to the KL divergence.

Case study: Wishart distribution
The Wishart distribution is defined on the d × d positive definite cone with the density: where n > d − 1 is the degree of freedom, and S 0 is a scale matrix. We rewrite it into the canonical form: We can see that θ = (θ 1 , θ 2 ), , and The resulting D H α (p(x) : q(x)) and D H α,γ (p(x) : q(x)) are straightforward from the above expression of F (θ) and Lemma 1. We will omit these tedious expressions for brevity.

Approximating Hölder projective divergences for statistical mixtures
Given two finite mixture models m(x) = k i=1 w i p i (x) and m (x) = k j=1 w j p j (x), we derive analytic bounds of their Hölder divergences. When approximation is only needed, one may compute Hölder divergences based on Monte-Carlo stochastic sampling.
We rewrite the Hölder divergence into the form: To compute the first term, we observe that a product of mixtures is also a mixture: which can be computed in O(kk ) time.
The second and third terms in Eq. 76 are not straightforward to calculate and shall be bounded. Based on computational geometry, we adopt the log-sum-exp bounding technique of [30] and divide the support X into L pieces of elementary intervals X = L l=1 I l . In each interval I l , the indices: representing the unique dominating component and the dominated component. Then we bound as follows: All terms on the lhs and rhs of Eq. 79 can be computed exactly by noticing that: When αθ ∈ Θ where Θ denotes the natural parameter space, the integral I p(x; αθ i )dx converges, see [30] for further details.
Then the bounds of X m(x) α dx can be obtained by summing the bounds in Eq. 79 over all elementary intervals. Thus D H α,β (m(x) : m (x)) can be both lower and upper bounded.

Hölder centroids
We study center-based clustering of a finite set of distributions: Given a list of distributions belonging to the same conic exponential family with natural parameters {θ 1 , · · · , θ n } and their associated positive weights {w 1 , · · · , w n } with n i=1 w i = 1, consider their centroids based on HPD and HD as follows: By abuse of notation, C denotes both the HPD centroid and HD centroid. When the context is clear, the parameters in parentheses can be omitted so that these centroids are simply denoted as C α and C α,γ . Both of them are defined as the right-sided centroids. The corresponding left-handed centroids are obtained according to reference duality, i.e., By Lemma 1, these centroids can be obtained for distributions belonging to the same exponential family as follows: Let γ = α, we get: meaning that the HPD centroid is just a special case of HD centroid up to a scaling transformation in the natural parameters space. Let γ = β, we get: (88) Let us consider the general HD centroid C α,γ . Since F is convex, the minimization energy is the sum We can therefore use the concave-convex procedure (CCCP) [22] that optimizes Difference of Convex Programs (DCPs): We start with C 0 α,γ = n i=1 w i θ i (the barycenter, belonging to Θ) and then update: for t = 0, 1, · · · until convergence. This can be done by noting that η = ∇F (θ) are the dual parameters that are also known as the expectation parameters (or moment parameters). Therefore ∇F and (∇F ) −1 can be computed through Legendre transformations between the natural parameter space and the dual parameter space. This iterative optimization is guaranteed to converge to a local minimum, with a main advantage of bypassing the learning rate parameter of gradient descent algorithms. Since F is strictly convex, ∇F is monotonous, and the rhs expression can be interpreted as a multi-dimensional quasi-arithmetic mean. In fact, it is a barycenter on non-normalized weights scaled by β =ᾱ.
For exponential families, the symmetric HPD centroid is: In this case, CCCP update rule is not in closed form because we cannot easily inverse the sum of gradients (but when α = β, the two terms collapse so CS centroid can be calculated using CCCP). Nevertheless, we can implement the reciprocal operation numerically. Interestingly, the symmetric HD centroid can be solved by CCCP! It amounts to solve: (91) One can apply CCCP to iteratively update the centroids based on: Notice the similarity with the updating procedure of C t α,γ . Once the centroid, say O α,γ , has been computed, we calculate the associated Hölder information: The Hölder information generalizes the notion of variance and Bregman information [4] to the case of Hölder distances.

Clustering based on symmetric Hölder divergences
Given a set of fixed densities {p 1 , · · · , p n }, we can perform variational k-means [28] with respect to the Hölder divergence to minimize the cost function: where O 1 , · · · , O L are the cluster centers, and l i ∈ {1, · · · , L} is the cluster label of p i . The algorithm is given by Algorithm 1. Notice that one does not need to wait for the CCCP iterations to converge. It only has to improve the cost function E before updating the assignment. We have implemented the algorithm based on the symmetric HD. One can easily modify it based on HPD and other variants.
To reduce the number of parameters that has to be tuned, we only investigate the case α = γ. If we choose α = γ = 2, then S H α,γ becomes the CS divergence and Algorithm 1 reduces to traditional CS clustering. From Fig. 4 we can observe that the clustering result does vary with the settings of α and γ. We performed clustering experiments on two different settings of the sample size. Table 1 shows the clustering accuracy measured by the percentage of "correctly clustered" Gaussians, i.e., the output label by clustering algorithms that coincides with the true label corresponding to the data generating process. We see that the symmetric Hölder divergence can give better clustering results as compared to CS clustering. This hints that one could consider the general Hölder divergence to replace CS in similar clustering applications [16,33]. Although one faces the problem of tuning the parameter α and γ, Hölder divergences can potentially give better results. This is expected because CS is just one particular case of the class of Hölder divergences.

Conclusion and perspectives
We introduced the notion of pseudo-divergences that generalizes the concept of divergences in information geometry [2] that are smooth non-metric statistical distances that are not required to obey the law of the indiscernibles. Pseudo-divergences can be built from inequalities by considering the inequality difference gap or its log-ratio gap. We then defined two classes of statistical measures based on Hölder's ordinary and reverse inequalities: The singly parametric family of Hölder pseudo-divergences and the doubly parametric family of Hölder divergences. By construction, the Hölder divergences are proper divergences between probability densities. Both statistical Hölder distance families are projective divergences that do not require distributions to be normalized, and admit closed-form expressions when considering exponential families with conic or affine natural parameter space (like multinomials or multivariate normals). Those two families of distances can be symmetrized and intersect at the unique Cauchy-Schwarz divergence. Since the Cauchy-Schwarz divergence is often used in distribution clustering applications [16], we carried out preliminary experiments demonstrating experimentally that thesymmetrized Hölder divergences improved over the Cauchy-Schwarz divergence for a toy dataset of Gaussians. We briefly touched upon the use of these novel divergences in statistical estimation theory. These projective Hölder (pseudo-)divergences are different from the recently introduced compositive scored-induced Hölder di-vergences [20,19] that are not projective divergences and do not admit closed-form expressions for exponential families in general. We elicited the special role of escort distributions [2] for Hölder divergences in our framework: Escort distributions transform distributions to allow one: • To reveal that Hölder pseudo-divergences on escort distributions amount to skew Bhattacharyya divergences [22], • To transform the improper Hölder pseudo-divergences into proper Hölder divergences, and vice versa.
Let us conclude with a perspective note on pseudo-divergences, statistical estimators, and manifold learning. Proper divergences have been widely used in statistical estimators to build families of estimators [32,5]. Similarly, given a prescribed density p 0 (x), a pseudo-divergence yields a corresponding estimator by minimizing D(p 0 (x) : q(x)) with respect to q(x). However in this case the resulting q(x) is potentially biased and is not guaranteed to recover the optimal input p 0 (x). Furthermore, the minimizer of D(p 0 (x) : q(x)) may not be unique, i.e., there could be more than one probability density q(x) yielding D(p 0 (x) : q(x)) = 0.
How can pseudo-divergences be useful? We have the following two simple arguments: • In an estimation scenario, we can usually pre-compute p 1 (x) = p 0 (x) according to D(p 1 (x) : p 0 (x)) = 0. Then the estimation q(x) = arg min q D(p 1 (x) : q(x)) will automatically target at p 0 (x). We call this technique pre-aim.
This means that the pre-aim technique of HPD is equivalent to HD D H α,γ when we set γ = β. As an alternative implementation of pre-aim, since D H α (p(x) : p α−1 (x)) = 0, a proper divergence between p(x) and q(x) can be constructed by measuring: turning out again to belong to the class of HD.
In practice, HD as a two-parameter family is less used than HPD with pre-aim because of the difficulty in choosing the parameter γ, and because that HD has a slightly more complicated expression. The family of HD connecting CS divergence with skew Bhattacharyya divergence [22] is nevertheless of theoretical importance.
• In manifold learning [17,10,38], it is an essential topic to align two category distributions p 0 (x) and q(x) corresponding respectively to the input and output [38], both for learning and for performance evaluation. In this case, the dimensionality of the statistical manifold that encompasses p 0 (x) and q(x) is so high that to preserve monotonically p 0 (x) in the resulting q(x) is already a difficult non-linear optimization and could be sufficient for the application, while preserving perfectly the input p 0 (x) is not so meaningful because of the input noise. It is then much easier to define pseudodivergences using inequalities which are not necessarily to be proper with potentially more choices.
On the other hand, projective divergences including Hölder divergences introduced in this work are more meaningful in manifold learning than KL divergence (which is widely used) because they give scale invariance of the probability densities, meaning that one can define positive similarities then directly align these similarities, which is guaranteed to be equivalent to align the corresponding distributions. This could potentially give unified perspectives in between the two approaches of similarity based manifold learning [10] and the probabilistic approach [17].