α-Geodesical Skew Divergence

The asymmetric skew divergence smooths one of the distributions by mixing it, to a degree determined by the parameter λ, with the other distribution. Such divergence is an approximation of the KL divergence that does not require the target distribution to be absolutely continuous with respect to the source distribution. In this paper, an information geometric generalization of the skew divergence called the α-geodesical skew divergence is proposed, and its properties are studied.


Introduction
Let (X , F , µ) be a measure space where X denotes the sample space, F the σ-algebra of measurable events, and µ a positive measure. The set of the strictly positive probability measure P is defined as P := f (x) > 0 (∀x ∈ X ), and X f (x)dµ(x) = 1 , (1) and the set of nonnegative probability measure P + is defined as P + := f (x) ≥ 0 (∀x ∈ X ), and X f (x)dµ(x) = 1 .
Then a number of divergences that appear in statistics and information theory [1,2] are introduced.
Definition 1 (Kullback-Leibler divergence [3]). The Kullback-Leibler divergence or KL-divergence D KL : P + × P → [0, ∞] is defined between two Radon-Nikodym densities p and q of µ-absolutely continuous probability measures by KL-divergence is a measure of the difference between two probability distributions in statistics and information theory [4][5][6][7]. This is also called the relative entropy and is known not to satisfy the axiom of distance. Because the KL-divergence is asymmetric, several symmetrizations have been proposed in the literature [8][9][10].
Definition 2 (Jensen-Shannon divergence [8]). The Jensen-Shannon divergence or JS-divergence D JS : P × P → [0, ∞) is defined between two Radon-Nikodym densities p and q of µ-absolutely continuous probability measures by D JS [p q] := 1 2 D KL p p + q 2 + D KL q p + q 2 The JS-divergence is a symmetrized and smoothed version of the KL-divergence, and it is bounded as 0 ≤ D JS [p q] ≤ ln 2.
This property contrasts with the fact that KL-divergence is unbounded.
Definition 3 (Jeffreys divergence [11]). The Jeffreys divergence D J [p q] : P × P → [0, ∞] is defined between two Radon-Nikodym densities p and q of µ-absolutely continuous probability measures by Such symmetrized KL-divergences have appeared in various pieces of literature [12][13][14][15][16][17][18]. For continuous distributions, the KL-divergence is known to have computational difficulty. To be more specific, if q takes a small value relative to p, the value of D KL [p q] may diverge to infinity. The simplest idea to avoid this is to use very small > 0 and modify D KL [p q] as follows: However, such an extension is unnatural in the sense that q + no longer satisfies the condition for a probability measure: X + q(x)dµ(x) = 1. As a more natural way to stabilize KL-divergence, the following skew divergences have been proposed: Definition 4 (Skew divergence [8,19]). The skew divergence D (λ) S [p q] : P × P → [0, ∞] is defined between two Radon-Nikodym densities p and q of µ-absolutely continuous probability measures by where λ ∈ [0, 1].
Skew divergences have been experimentally shown to perform better in applications such as natural language processing [20,21], image recognition [22,23] and graph analysis [24,25]. In addition, there is research on quantum generalization of skew divergence [26].
The main contributions of this paper are summarized as follows: • Several symmetrized divergences or skew divergences are generalized from an information geometry perspective. • It is proved that the natural skew divergence for the exponential family is equivalent to the scaled KL-divergence. • Several properties of geometrically generalized skew divergence are proved. Specifically, the functional space associated with the proposed divergence is shown to be a Banach space.

α-Geodesical Skew Divergence
The skew divergence is generalized based on the following function.
The f -mean function satisfies It is easy to see that this family includes various known weighted means including the e-mixture and m-mixture for α = ±1 in the literature of information geometry [28]: The inverse function f −1 α is convex when α ∈ [−1, 1], and concave when α ∈ (−∞, −1] ∪ (1, ∞). It is worth noting that the f -interpolation is a special case of the Kolmogorov-Nagumo average [29][30][31] when α is restricted in the interval [−1, 1].
In order to consider the geometric meaning of this function, the notion of the statistical manifold is introduced.

Statistical Manifold
be a family of probability distribution on X , where each element p ξ is parameterized by n real-valued variables ξ = (ξ 1 , . . . , ξ n ) ∈ Ξ ⊂ R n . The set S is called a statistical model and is a subset of P. We also denote (S, g ij ) as a statistical model equipped with the Riemannian metric g ij . In particular, let g ij be the Fisher-Rao metric, which is the Riemannian metric induced from the Fisher information matrix [32].
In the rest of this paper, the abbreviations Definition 6 (Christoffel symbols). Let g ij be a Riemannian metric, particularly the Fisher information matrix, then the Christoffel symbols are given by Definition 7 (Levi-Civita connection). Let g be a Fisher-Riemannian metric on S which is a 2-covariant tensor defined locally by vector fields in the 0-representation on S. Then, its associated Levi-Civita connection ∇ (0) is defined by The fact that ∇ (0) is a metrical connection can be written locally as It is worth noting that the superscript α of ∇ (α) corresponds to a parameter of the connection. Based on the above definitions, several connections parameterized by the parameter α are introduced. The case α = 0 corresponds to the Levi-Civita connection induced by the Fisher metric. Definition 8 (∇ (1) -connection). Let g be the Fisher-Riemannian metric on S, which is a 2covariant tensor. Then, the ∇ (1) -connection is defined by It can also be expressed equivalently by explicitly writing as the Christoffel coefficients Definition 9 (∇ (−1) -connection). Let g be the Fisher-Riemannian metric on S, which is a 2-covariant tensor. Then, the ∇ (−1) -connection is defined by In the following, the ∇-flatness is considered with respect to the corresponding coordinates system. More details can be found in [28].
Proof. It suffices to show From the definitions of Γ (−1) and Γ (1) , which proves the proposition.
The connections ∇ (−1) and ∇ (1) are two special connections on S with respect to the mixture family and the exponential family, respectively. Moreover, they are related by the duality condition, and the following 1-parameter family of connections is defined.
Definition 10 (∇ (α) -connection). For α ∈ R, the ∇ (α) -connection on the statistical model S is defined as ij,k can be written as The α-coordinate system associated with the ∇ (α) -connection is endowed with the α-geodesic, which is a straight line on the corresponding coordinates system. Then, we introduce some relevant notions.
Definition 11 (α-divergence [33]). Let α be a real parameter. The α-divergence between two probability vectors p and q is defined as The KL-divergence, which is a special case with α = 1, induces the linear connection ∇ (1) as follows.
Proposition 7. The diagonal part of the third mixed derivatives of the KL-divergence is the negative of the Christoffel symbol: Proof. The second derivative in the argument ξ is given by Then, considering the diagonal part, one yields More generally, the α-divergence with α ∈ R induces the ∇ (α) -connection.
Definition 12 (α-representation [34]). For some positive measure m and θ i is called the α-representation of a positive measure m Definition 13 (α-geodesic [28]). The α-geodesic connecting two probability vectors p(x) and q(x) is defined as where c(t) is determined as It is known that the appropriate reparameterizations for the parameter t are necessary for a rigorous discussion in the space of probability measures [35,36]. However, as mentioned in the literature [35], an explicit expression for the reparametrizations τ p,a and τ p,q is unknown. A similar discussion has been made in the derivation of the φ β -path [37], where it is mentioned that the normalizing factor is unknown in general. Furthermore, the f -mean is not convex depending on the α. For these reasons, it is generally difficult to discuss α-geodesics in probability measures by normalization or reparameterization, and to avoid unnecessary complexity, the parameter t is assumed to be appropriately reparameterized.

Generalization of Skew Divergences
From Definition 13, the f -interpoloation is considered as an unnormalized version of the α-geodesic. Using the notion of geodesics, skew divergence is generalized in terms of information geometry as follows.
Some special cases of α-geodesical skew divergence are listed below: Furthermore, α-geodesical skew divergence is a special form of the generalized skew K-divergence [10,38], which is a family of abstract means-based divergences. In this paper, the skew K-divergence touched upon in [10] is characterized in terms of α-geodesic on positive measures, and its geometric and functional analytic properties are investigated. When the Kolmogorov-Nagumo average (i.e., when the function f −1 in Equation (8) is a strictly monotone convex function) the geodesic has been shown to be well-defined [37].

Symmetrization of α-Geodesical Skew Divergence
It is easy to symmetrize the α-geodesical skew divergence as follows.
It is seen thatD The last one is the λ-JS-divergence [39], which is a generalization of the JS-divergence.

Properties of α-Geodesical Skew Divergence
In this section, the properties of the α-geodesical skew divergence are studied.
Proof. When λ is fixed, the f -interpolation has the following inverse monotonicity with respect to α: From Gibbs' inequality [40] and Equation (29), one obtains Proposition 9 (Asymmetry of the α-geodesical skew divergence). α-Geodesical skew divergence is not symmetric in general: Proof. For example, if λ = 1, then ∀α ∈ R, it holds that and the asymmetry of the KL-divergence results in an asymmetry of the geodesic skew divergence.
Proof. Obvious from the inverse monotonicity of the f -interpolation (29) and the monotonicity of the logarithmic function. Figure 1 shows the monotonicity of the geodesic skew divergence with respect to α. In this figure, divergence is calculated between two binomial distributions. Proposition 12 (Subadditivity of the α-geodesical skew divergence with respect to α). α-Geodesical skew divergence satisfies the following inequality for all α, β ∈ R, λ ∈ [0, 1] Proof. For some α and λ, m (λ,α) f takes the form of the Kolmogorov mean [29], which is obvious from its continuity, monotonicity and self-distributivity.
Proof. We can prove from the continuity of the KL-divergence and the Kolmogorov mean. Figure 2 shows the continuity of the geodesic skew divergence with respect to α and λ. Both source and target distributions are binomial distributions. From this figure, it can be seen that the divergence changes smoothly as the parameters change. Proof. Let u = 1−α 2 . Then lim α→−∞ u = ∞. Assuming p 0 ≤ p 1 , it holds that Then, the following equality holds.
Proof. It follows from the definition of the inverse monotonicity of f -interpolation (29) and Lemma 2.
Proof. It follows from the definition of the f -interpolation (29) and Lemma 1.
Theorem 1 (Strong convexity of the α-geodesical skew divergence). α-Geodesical skew divergence D (α,λ) GS [p q] is strongly convex in p with respect to the total variation norm.

Natural α-Geodesical Skew Divergence for Exponential Family
In this section, the exponential family is considered in which the probability density function is given by where x is a random variable. In the above equation, θ = (θ 1 , . . . , θ n ) is an n-dimensional vector parameter to specify distribution, k(x) is a function of x and ψ corresponds to the normalization factor. In skew divergence, the probability distribution of the target is a weighted average of the two distributions. This implicitly assumes that interpolation of the two probability distributions is properly given by linear interpolation. Here, in the exponential family, the interpolation between natural parameters rather than interpolation between probability distributions themselves is considered. Namely, the geodesic connecting two distributions p(x; θ p ) and q(x; θ q ) on the θ-coordinate system is considered: where λ ∈ [0, 1] is the parameter. The probability distributions on the geodesic θ(λ) are Hence, a geodesic itself is a one-dimensional exponential family, where λ is the natural parameter. A geodesic consists of a linear interpolation of the two distributions in the logarithmic scale because This corresponds to the case α = 1 on the f -interpolation with normalization factor This induces the natural geodesic skew divergence with α = 1 as and this is equal to the scaled KL divergence.
More generally, let θ (α) P and θ (α) Q be the parameter representations on the α-coordinate system of probability distributions P and Q. Then, the geodesics between them are represented as in Figure 3, and it induces the α-geodesical skew divergence.

Function Space Associated with the α-Geodesical Skew Divergence
To discuss the functional nature of the α-geodesical skew divergence in more depth, the function space it constitutes is considered. For an α-geodesical skew divergence GS [p q] with one side of the distribution fixed, let the entire set be converges absolutely almost everywhere to | lim n→∞ f n(n) | ≤ lim n→∞ g n , a.e.. That is, lim n→∞ f n(n) ∈ V. Then lim n→∞ f n − f n(n) ≤ lim n→∞ g n and from the superior convergence theorem, we can obtain lim n→∞ lim n→∞ f n − f n(n) p = 0 We have now confirmed the completeness of V.
Proof. If we restrict λ ∈ (0, 1], D (α,λ) GS [u q] = 0 if and only if u = q. Then, V + has the unique identity element, and then V + is a complete norm space.
Consider the second argument Q of D (α,λ) GS (P||Q) is fixed, which is referred to as the reference distribution. Figure 4 shows values of the α-geodesical skew divergence for a fixed reference Q, where both P and Q are restricted to be Gaussian. In this figure, the reference distribution is N (0, 0.5) and the parameters of input distributions are varied in µ ∈ [0, 4.5] and σ 2 ∈ [0.5, 2.3]. From this figure, one can see that a larger value of α emphasizes the discrepancy between distributions P and Q. Figure 5 illustrates a coordinate system associated with the α-geodesical skew divergence for different α. As seen from the figure, for the same pair of distributions P and Q, the value of divergence with α = 3 is larger than that with α = −1. . α-geodesical skew divergence between two normal distributions. The reference distribution is Q = N (0, 0.5). For P 1 , P 2 , . . . , P j , (j = 1, 2, . . . , 10), let their mean and variance be µ j and σ 2 j , respectively, where µ j+1 − µ j = 0.5 and σ 2 j+1 − σ 2 j = 0.2. Figure 5. Coordinate system of F q or F + . Such a coordinate system is not Euclidean.

Conclusions and Discussion
In this paper, a new family of divergence is proposed to address the computational difficulty of KL-divergence. The proposed α-geodesical skew divergence is a natural derivation from the concept of α-geodesics in information geometry and generalizes many existing divergences.
Furthermore, α-geodesical skew divergence leads to several applications. For example, the new divergence can be applied to the annealed importance sampling by the same analogy as in previous studies using q-paths [41]. It could also be applied to linguistics, a field in which skew divergence was originally used [19].