Extended Divergence on a Foliation by Deformed Probability Simplexes

This study considers a new decomposition of an extended divergence on a foliation by deformed probability simplexes from the information geometry perspective. In particular, we treat the case where each deformed probability simplex corresponds to a set of q-escort distributions. For the foliation, different q-parameters and the corresponding α-parameters of dualistic structures are defined on each of the various leaves. We propose the divergence decomposition theorem that guides the proximity of q-escort distributions with different q-parameters and compare the new theorem to the previous theorem of the standard divergence on a Hessian manifold with a fixed α-parameter.

On a set of probability distributions, divergences are usually defined for a fixed αparameter of the dualistic structure. Using those results, we defined an extended divergence on a foliation by sets of probability distributions, setting different α-parameters on each leaf. In particular, we treated a foliation by deformed probability simplexes [15].
In this paper, we also study deformed probability simplexes corresponding to sets of escort distributions with q-parameters, which satisfy q = (1 − α)/2 for α-parameters of information geometry. We clarify the relationship among affine spaces, affine immersions and the extended divergence more than in our previous paper. A comparison with the extended divergence and the duo Bregman divergence used in machine learning is also described [16].
First, we explain the dualistic structures, α-divergences, and the Tsallis relative entropy on the probability simplex, using the concept of affine geometry and information geometry. The relationship between an α-parameter and the Tsallis q-parameter is stated. Next, we describe the dualistic structures and the divergences generated by affine immersions on the deformed probability simplexes corresponding to sets of escort distributions. It also includes topics about Hessian manifolds and their level surfaces. We then define an extended divergence on a foliation by deformed probability simplexes. Finally, we propose a new decomposition of an extended divergence on the foliation.

The Tsallis Relative Entropy and the Kullback-Leibler Divergence on the Probability Simplex
In this section, we explain dualistic structures, α-divergences, and the Tsallis relative entropy on the probability simplex [4,5,12].
Let A n+1 be an (n + 1)-dimensional real affine space and {x 1 , . . . , x n+1 } be the canonical affine coordinate system on A n+1 , i.e.,Ddx = 0, whereD is the canonical flat affine connection on A n+1 . Let S n be a simplex in A n+1 + defined by If x 1 (p), . . . , x n+1 (p) are regarded as probabilities of n + 1 states, S n is called the ndimensional probability simplex. Let {p 1 , . . . ,p n } be an affine coordinate system on S n defined byp i (p) = x i (p) − x n+1 (p) for i = 1, . . . , n, and be a frame of a tangent vector field on S n . The Fisher metric g = (g ij ) on S n is defined by where δ ij is the Kronecker's delta. We define an α-connection ∇ (α) on S n by where δ k ij = 1 if i = j = k, and δ k ij = 0 if others. Then, the Levi-Civita connection ∇ of g coincides with ∇ (0) . For α ∈ R, we have where X (S n ) is the set of all smooth tangent vector fields on S n . Then, ∇ (−α) is called the dual connection of ∇ (α) . For each α, ∇ (α) is torsion-free and ∇ (α) g is symmetric. Therefore, the triple (S n , ∇ (α) , g) is a statistical manifold, and (S n , ∇ (−α) , g) the dual statistical manifold of it. Note that affine connections ∇ (1) and ∇ (−1) in Equations (4)-(6) are the dual connection and the canonical connection, respectively.
If q = (1 − α)/2, it holds that for the Tsallis relative entropy K q on S n defined by where ln q is the q-logarithmic function defined by Refs. [1,2]. The Tsallis relative entropy K q converges to the Kullback-Leibler divergence as q → 1, because lim q→1 ln q x = log x. In the information geometric view, the α-divergence D (α) converges to the Kullback-Leibler divergence as α → −1.

Divergences Generated by Affine Immersions as Level Surfaces
In this section, we describe the general theory of affine immersions and divergences related to level surfaces of the Hessian domain.
If the HessianDdϕ = ∑ i,j (∂ 2 ϕ)/(∂x i ∂x j )dx i dx j of a function ϕ on a domain Ω ⊆ A n+1 is non-degenerate, the triple (Ω,D,g =Ddϕ) is called a Hessian domain. A statistical manifold is said to be flat if the curvature tensor of its affine connection vanishes. A flat statistical manifold is locally a Hessian domain. Conversely, a Hessian domain is a flat statistical manifold [12,17].
In a previous study, we show the following theorem on the level surfaces of a Hessian function. Theorem 1 ([18]). Let M be a simply connected n-dimensional level surface of ϕ on an (n + 1)dimensional Hessian domain (Ω,D,g =Ddϕ) with a Riemannian metricg and suppose that n ≥ 2. If we consider (Ω,D,g) a flat statistical manifold, (M, D, g) is a 1-conformally flat statistical submanifold of (Ω,D,g), where D and g denote the connection and the Riemannian metric on M induced byD andg, respectively.
Here, "1-conformally flat" represents the characterization of surfaces projected by a flat statistical manifold along dual coordinates. We continue to explain the terms used in Theorem 1 and the outline of the proof. For If (N,∇,h) is 1-conformally equivalent to a flat statistical manifold (N, ∇, h), (N,∇,h) is called a 1-conformally flat statistical manifold. A statistical manifold (N, ∇, h) is 1conformally flat iff the dual statistical manifold (N, ∇ , h) is (−1)-conformally flat [19].
For an (n + 1)-dimensional Hessian domain (Ω,D,g =Ddϕ), an n-dimensional level surface of ϕ has the dualistic structure as the statistical submanifold structure. On the other hand, the level surface also has the structure induced by the affine immersion. It is essential for Theorem 1 that the statistical submanifold structure coincides with the dualistic structure by the affine immersion on a level surface of ϕ.
For (Ω,D,g =Ddϕ), let x be the canonical immersion of an n-dimensional level surface M into Ω. Let E be a transversal vector field on M defined by whereẼ is the gradient vector field of ϕ on Ω defined bỹ For an affine immersion (x, E) and the canonical flat affine connectionD on Ω ⊆ A n+1 , the induced affine connection D E , the affine fundamental form g E , the shape operator S E and the transversal connection form τ E on M are defined by See [21,22]. Then, D E and g E coincide with the restricted affine connection ofD and the restricted Riemannian metric ofg, respectively. For the level surface M, the transversal connection form satisfies that τ E ≡ 0. Therefore, (x, E) it is called the equiaffine immersion. It is known that a simply connected statistical manifold can be realized in A n+1 by a nondegenerate equiaffine immersion iff it is 1-conformally flat [19]. Thus, Theorem 1 holds. Next, we introduce a divergence on a Hessian domain, treating it as a flat statistical manifold.
The canonical divergence ρ of a Hessian domain (Ω,D,g =Ddϕ) is defined by whereι is the gradient mapping from Ω to the dual affine space A * n+1 , i.e., and {x * 1 , . . . , x * n+1 } is the dual affine coordinate system of {x 1 , . . . , x n+1 }. The Legendre transform ϕ * of ϕ is defined by See [12]. Let ι be the conormal immersion for the affine immersion (x, E) defined by Equation (11), 12. By the definition of a conormal immersion, ι satisfies that where a, b is the pairing of a ∈ A * n+1 and b ∈ A n+1 . It is known that the conormal immersion ι coincides with the restriction of the gradient mappingι to the level surface M.
The next definition is given in relation to affine immersions and divergences.

Definition 1 ([19])
. Let (N, ∇, h) be a 1-conformally flat statistical manifold realized by a nondegenerate affine immersion (v, ξ) into A n+1 , and w the conormal immersion for v. Then the divergence ρ con f of (N, ∇, h) is defined by The ρ con f definition is independent of the choice of a realization of (N, ∇, h).
The divergence ρ con f is referred to as Kurose geometric divergence in affine geometry and as Fenchel-Young divergence in the machine learning community [23,24]. Since an ndimensional level surface M of (Ω,D,g =Ddϕ) is a 1-conformally flat statistical manifold realized by a non-degenerate affine immersion (x, E), ρ con f on M is as follows: Let ρ sub be the restriction of the canonical divergence ρ to (M, D, g) as a statistical submanifold of (Ω,D,g). From Equations (15), (17) and (18), the next theorem holds.

Deformed Probability Simplexes and Escort Distributions Generated by Affine Immersions
In this section, we explain dualistic structures on deformed probability simplexes, which correspond to sets of escort distributions via affine immersion.
We set p i = x i (p), i = 1, . . . , n + 1 for p ∈ S n , where S n and {x 1 , . . . , x n+1 } be the probability simplex and the canonical affine coordinate system on A n+1 , respectively. For n + 1 states p 1 , . . . , p n+1 on S n and 0 < q < 1, if each probability P(p i ) satisfies the probability distribution P is called the escort distribution [1,2], where (p i ) q is p i powered by q. It realizes the dualistic structure of a set of escort distributions via the affine immersion into A n+1 + [4,5]. For 0 < q < 1, let f q be the affine immersion of S n into A n+1 + defined by Then, the escort distribution P is also represented as follows: For a function ψ q on A n+1 + defined by the image f q (S n ) is a level surface of ψ q satisfying ψ q = 1/(1 − q). For 0 < q < 1, the Hessian matrix of the function ψ q is positive definite on A n+1 + . Then, ψ q induces the Hessian structure (A n+1 + ,D,g q ≡ (∂ 2 ψ q /∂x i ∂x j )). By definition the tetrad (A n+1 + ,D,D (−1) ,g q ) is the dually flat structure. The connectionD (0) coincides with the Levi-Civita connection of the Riemannian metricg q .
We denote by D and g q the restrictedD andg q on f q (S n ), and induce the dualistic structure of ( f q (S n ), D, g q ) as the submanifold structure of (A n+1 + ,D,g q ). From the discussion in Section 3, ( f q (S n ), D, g q ) coincides with the dualistic structure induced by the equiaffine immersion ( f q , E q ), where for the gradient vector fieldẼ q of ψ q on A n+1 + defined bỹ The pullback of ( f q (S n ), D, g q ) to S n is (−1)-conformally equivalent to (S n , ∇ (α) , g) defined by Equations (3)-(5). In addition, ( f q (S n ), D, g q ) has a constant curvature κ = q(1 − q) = (1 − α 2 )/4 [5]. On ( f q (S n ), D, g q ), the restricted divergence ρ q from the canonical divergence of (A n+1 + ,D,g q ) coincides with the geometric divergence by Equation (18) from the affine immersion ( f q , E q ). For an affine coordinate system {x 1 , . . . , x n+1 } on A n+1 defined by the divergence ρ q of ( f q (S n ), D, g q ) is described as In addition, the pullback divergence of ρ q to S n coincides with D (α) and the Tsallis relative entropy K q [4]. At the end of this section, we mention the divergence of (A n+1 + ,D,g q ). By Equation (17), the Legendre transform ψ * q of ψ q is By Equations (15) and (16), the canonical divergence ρ q of (A n+1 + ,D,g q ) is defined by represented by the same symbol ρ q of ( f q (S n ), D, g q ).

Extended Divergence on a Foliation by Deformed Probability Simplexes
Previous sections described the divergence for each fixed q and each fixed α. This section defines an extended divergence on a foliation by deformed probability simplexes ( f q (S n ), D, g q ) for all 0 < q < 1, and shows the divergence decomposition theorem.
The contents of our paper [15] are included but are explained in detail by the setting of affine geometry.
To give the proximity of q-escort distributions with different q-parameters, we define an extended divergence on a foliation by deformed probability simplexes as follows.
We call a function ρ f ol on S f ol × S f ol defined by Equation (31) an extended divergence on a foliation by deformed probability simplexes.
The i-th component of the conormal immersion of ( f q , E q ) is −∂ψ q /∂x i . By the righthand side of Equation (27), the dual coordinate of b, denoted by Therefore, we consider f 1−q (S n ) as the dual simplex of f q (S n ) for 0 < q < 1. As q = 1/2, f q (S n ) is self dual [4]. Note that the i-th component of the dual coordinate of b is denoted by [15].
On the extended divergence, the next proposition holds.

Proposition 1.
An extended divergence ρ f ol on S f ol of satisfies that: where ρ q is the divergence of ( f q (S n ), D, g q ) by Equation (28), D (α) is an α-divergence defined by Equation (7), and α(a) = 1 − 2q(a).
(ii) In the case of q(a) ≥ q(b), and if and only if a = b, (28) and (31), a, b).
are induced by the definition of f q (S n ). In addition, f q(a) (S n ) and f q(b) (S n ) are convex surfaces centered on the origini of A n+1 + , and the surfaces f q(a) (S n ) closer to the origin than We define the extended dual divergence ρ * f ol of ρ f ol as follows; where ψ * q is the Legendre transform of ψ q for 0 < q < 1. Then, the following holds.

Proposition 2.
The functions ρ f ol and ρ * f ol satisfy that Proof. By the definition of the Legendre transform, we have The extended divergence is related to the duo Bregman (pseudo-)divergence, where the parameters also define the convex functions [16]. To work with the entire parametrized probability distribution families and to explore the application of divergences, we must investigate their relationship.

Decomposition of an Extended Divergence
In this section, we explain the orthogonal foliation of F . Next, we give a decomposition of an extended divergence along the orthogonal leaf and the original leaf.
For the foliation F = { f q (S n )|0 < q < 1}, we consider the flow on S f ol defined using the following equation.
where a function x i on S f ol takes the i-th component of the dual coordinate on f q (S n ) as Equation (27) for each 0 < q < 1. An integral curve of Equation (35) is orthogonal to f q (S n ) for each q with respect to the pairing , . The set of integral curves becomes the orthogonal foliation of F . We denote it by F ⊥ . Translating into the primal coordinate system, we have the next equation.
where (g ij q ) is the inverse matrix of (g q ij ). The right-hand side of Equation (37) is calculated using Equations (11) and (12) for ψ q . A leaf of F ⊥ is an integral curve of the vector fieldẼ that takes the valueẼ q on f q (S n ) for each q.
The following theorem is on the decomposition of the extended divergence.
Theorem 3. Let S n be the probability simplex, and ( f q (S n ), D, g q ) the 1-conformally flat statistical manifold generated by the affine immersion ( f q , E q ), where f q is defined as ij q ∂ψ q /∂x j , and g q is the restriction of (g q ij ) = Ddψ q to f q (S n ). Let a, b ∈ f q(a) (S n ), 0 < q(a) < 1, and c ∈ S f ol ≡ ∪ 0<q<1 f q (S n ). If there exists an orthogonal leaf L ⊥ ∈ F ⊥ that includes b and c, we have where x (·) is the dual coordinate of f q (S n ) for each q.
. By the definition in Equations (22) and (23), we have See Figure 1 for a decomposition of extended divergence and graphs of deformed simplexes f q (S n ).
A decomposition similar to Equation (39) is also available on a foliation by Hessian level surfaces of one Hessian manifold [20]. Theorem 3 generalizes the previous decomposition.
Finally, we describe the gradient flow on a leaf f q (S n ) using the extended divergence.

Theorem 4.
For a submanifold ( f q (S n ), D, g q ) of S f ol , we denote by {x 1 , . . . , x n } an affine coordinate system on f q (S n ) such that Ddx i = 0, i = 1, . . . , n, and set g q ij = g q (∂/∂x i , ∂/∂x j ), (g ij q ) = (g q ij ) −1 . For a fixed point c ∈ L ⊥ , the gradient flow on f q (S n ) defined by converges to the unique point b ∈ L ⊥ ∩ f q (S n ), where a x is a variable point parametrized as {x 1 (t), . . . , x n (t)}.
Proof. By Theorem 3, for any a x ∈ f q (S n ), there exists µ > 0 such that Equation (40) is described by the dual coordinate system {x 1 , . . . , x n } on f q (S n ) as follows; On f q (S n ), from Prop. 1.(i), ρ f ol coincides with the geometric divergence ρ q , generated by the affine immersion ( f q , E q ). The geometric divergence generates the dual coordinate x i such that D * dx i = 0, i = 1, . . . , n, to be derived by x i [19]. Then, it holds that and that where a| t=0 is an initial point of Equation (40). Then, the gradient flow of Equation (40) converges to b ∈ L ⊥ ∩ f q (S n ) following a geodesic for the dual coordinate system.
The gradient flow similar to Equation (40) has been provided on a flat statistical submanifold [25]. The similar one on a Hessian level surface, i.e., a 1-conformally statistical submanifold, has been given in [20]. Theorem 4 generalizes the previous theorems on gradient flows.  (a, c), graphs of the standard simplex (q → 1), and deformed simplexes as q = 0.75, 0.6, 0.5, 0.4, 0.25 in A 2 + . For primal coordinates a, b ∈ f 0.75 (S 1 ), and c ∈ f 0.6 (S 1 ), dual coordinates satisfy −x (a), −x (b) ∈ f 0.25 (S 1 ), and −x (c) ∈ f 0.4 (S 1 ). The primal geodesic between a and b is orthogonal to the dual one between b and c with respect to the metric g 0.75 .

Conclusions
This study considers a foliation of deformed probability simplexes corresponding to sets of escort distributions with q-parameters, for the continuous transition of α-parameters on information geometry. Since these are typical q-exponential families, we still need to provide details on the extended divergence and natural definition of the foliation of q-exponential families.
The extended divergence guides the proximity of q-exponential distributions with different q-parameters. Therefore, our theory guarantees the mathematical basis for generalizing methods of machine learning and statistical mechanics to the case of the q-distribution families when different q-parameters are mixed. The decomposition theorem is applied to the problem of the optimal choice of q-parameter. The application methods are open to consideration. It also remains to investigate the relationship with a new λ-duality on nonextensive statistical mechanics with mixed parameters [26,27].