Information Geometric Duality of ϕ-Deformed Exponential Families

In the world of generalized entropies—which, for example, play a role in physical systems with sub- and super-exponential phase space growth per degree of freedom—there are two ways for implementing constraints in the maximum entropy principle: linear and escort constraints. Both appear naturally in different contexts. Linear constraints appear, e.g., in physical systems, when additional information about the system is available through higher moments. Escort distributions appear naturally in the context of multifractals and information geometry. It was shown recently that there exists a fundamental duality that relates both approaches on the basis of the corresponding deformed logarithms (deformed-log duality). Here, we show that there exists another duality that arises in the context of information geometry, relating the Fisher information of ϕ-deformed exponential families that correspond to linear constraints (as studied by J.Naudts) to those that are based on escort constraints (as studied by S.-I. Amari). We explicitly demonstrate this information geometric duality for the case of (c,d)-entropy, which covers all situations that are compatible with the first three Shannon–Khinchin axioms and that include Shannon, Tsallis, Anteneodo–Plastino entropy, and many more as special cases. Finally, we discuss the relation between the deformed-log duality and the information geometric duality and mention that the escort distributions arising in these two dualities are generally different and only coincide for the case of the Tsallis deformation.


Introduction
Entropy is one word for several distinct concepts [1]. It was originally introduced in thermodynamics, then in statistical physics, information theory, and last in the context of statistical inference. One important application of entropy in statistical physics, and in statistical inference in general, is the maximum entropy principle, which allows us to estimate probability distribution functions from limited information sources, i.e., from data [2,3]. The formal concept of entropy was generalized to also account for power laws that occur frequently in complex systems [4]. Literally dozens of generalized entropies have been proposed in various contexts, such as relativity [5], multifractals [6], or black holes [7]; see [8] for an overview. All these generalized entropies, whenever they fulfil the first three Shannon-Khinchin axioms (and violate the composition axiom) are special cases of the (c, d)-entropy asymptotically [9]. Generalized entropies play a role in non-multinomial, sub-additive systems (whose phase space volume grows sub-exponentially with the degrees of freedom) [10,11] and in systems, whose phase space grows super-exponentially [12]. All generalized entropies, for sub-, and super-exponential systems, can be treated within a single, unifying framework [13].
With the advent of generalized entropies, depending on context, two types of constraint are used in the maximum entropy principle: traditional linear constraints (typically moments), E = ∑ i p i E i , motivated by physical measurements, and the so-called escort constraints, E u = ∑ i u(p i )E i / ∑ i u(p i ), where u is some nonlinear function. Originally, the latter were introduced with multifractals in mind [4]. Different types of constraint arise from different applications of relative entropy. While for physics-related contexts (such as thermodynamics), linear constraints are normally used, in other applications, such as non-linear dynamical systems or information geometry, it might be more natural to consider escort constraints. The question about their correct use and the appropriate form of constraints has caused a heated debate in the past decade [14][15][16][17][18][19]. To introduce escort distributions in the maximum entropy principle in a consistent way, two approaches have been discussed. The first [20] appears in the context of deformed entropies that are motivated by superstatistics [21]. It was later observed in [22] that this approach is linked to other deformed entropies with linear constraints through a fundamental duality (deformed-log duality), such that both entropies lead to the same functional form of MaxEnt distributions. The second way to obtain escort distributions was studied by Amari et al. and is motivated by information geometry and the theory of statistical estimation [23,24]. There, escort distributions represent natural coordinates on a statistical manifold [24,25].
In this paper, we show that there exists another duality relation between this information geometric approach with escort distributions and an approach that uses linear constraints. The relation can be given a precise information geometric meaning on the basis of the Fisher information. We show this within the framework of φ-deformations [26][27][28]. We establish the duality relation for both cases in the relevant information geometric quantities. As an example, we explicitly show the duality relation for the class of (c, d)-exponentials, introduced in [9,10]. Finally, we discuss the relation between the deformed-log duality and the information geometric duality and show that these have fundamental differences. Each type of duality is suitable for different applications. We hope that this paper helps to avoid confusion in the use of escort distributions in the various contexts.
Let us start with reviewing central concepts of (non-deformed) information geometry, in particular relative entropy and its relation to the exponential family of distributions through the maximum entropy principle. Let us consider a probability simplex, S n , with n independent probabilities, p i , and probability p 0 . Its value is not independent, but determined by the normalization condition, p 0 = 1 − ∑ i p i . Further, consider a parametric family of distributions p(θ) with parameter vector θ ∈ M, where M is a parametric space. In this paper, we focus on probabilities over discrete sample spaces, only. For the continuous case see the mathematical formulation of Pistone and Sempi [29]. For the sake of simplicity, we sometimes do not explicitly write the parameter vector θ and consider p i as the independent parameters of the distribution. It is easy to show that the choice of p i determines the choice of the parameters θ i .
Relative entropy, or Kullback-Leibler divergence, is defined as: For the uniform distribution q = u n , i.e., q i = 1/n, we have: where S(p) is the Shannon entropy, S(p) = − ∑ i p i log p i . Let us consider a set of linear constraints, ∑ j p j E ij = E i , and denote the configuration vector as E i . Shannon entropy is maximized under this set of linear constraints and the normalization condition, by functions belonging to the exponential family of probability distributions, which can be written as: Ψ(θ) guarantees normalization. Fisher information defines the metric on the parametric manifold M by taking two infinitesimally-separated points, θ 0 and θ = θ 0 + δθ, and by expanding D KL (p(θ 0 ) p(θ)), For the exponential family of distributions, it is a well-known fact that Fisher information is equal to the inverse of the probability in Equation (3):

Deformed Exponential Family
We briefly recall the definition of φ-deformed logarithms and exponentials as introduced by Naudts [26]. The deformed logarithm is defined as: for some positive, strictly-increasing function, φ(x), defined on (0, +∞). Then, log φ is an increasing, concave function with log φ (1) = 0. log φ (x) is negative on (0, 1) and positive on (1, +∞). Naturally, the derivative of log φ (x) is 1/φ(x). The inverse function of log φ (x) exists; we denote it by exp φ (x). Finally, the φ-exponential family of probability distributions is defined as a generalization of Equation (3): We can express Ψ(θ) in the form: which allows us to introduce dual coordinates to θ. This is nothing but the Legendre transform of Ψ(θ), which is defined as: where: Because: holds, and using ∑ i ∂ θ j p i (θ) = 0, we obtain that: where P φ is the so-called escort distribution. With exp φ (log φ (x)) = φ(x), the elements of P φ are given by: where we define h φ (p) ≡ ∑ i φ(p i ). The Legendre transform provides a connection between the exponential family and the escort family of probability distributions, where the coordinates are obtained in the form of escort distributions. This generalizes the results for the ordinary exponential family of distributions, where the dual coordinates form a mixture family, which can be obtained as the superposition of the original distribution. The importance of dual coordinates in information geometry comes from the existence of a dually-flat geometry for the pair of coordinates. This means that there exist two affine connections with vanishing coefficients (Christoffel symbols). For the exponential family of distributions, the connection determined by the exponential distribution is called e-connection, and the dual connection leading to a mixture family is called m-connection [25]. For more details, see, e.g., [24]. We next look at generalizations of the Kullback-Leibler divergence and the Fisher information for the case of φ-deformations.

Deformed Divergences, Entropies, and Metrics
For the φ-deformed exponential family of distributions, we have to define the proper generalizations of the relevant quantities, such as the entropy, divergence, and metric. A natural approach is to start with the deformed Kullback-Leibler divergence, denoted by D φ (p q). φ-entropy can then be defined as: where ∼ means that the relation holds up to a multiplicative constant depending only on n. Similarly, the φ-deformed Fisher information is: There is now more than one way to generalize the ordinary Kullback-Leibler divergence. The first is Csiszár's divergence: [30] where f is a convex function. For f (x) = x ln x, we obtain the Kullback-Leibler divergence. Note, however, that the related information geometry based on the generalized Fisher information is trivial, because we have: i.e., the rescaled Fisher information metric; see [27]. The second possibility is to use the divergence of Bregman type, usually defined as: where the symbol "·" denotes the scalar product. This type of divergence can be understood as the first-order Taylor expansion of f around q, evaluated at p. Let us next discuss the two possible types of the Bregman divergence, which naturally correspond to the φ-deformed family of distributions. For both, the φ-exponential family of distributions is obtained from the maximum entropy principle of the corresponding φ-entropy, however, under different constraints. Note that the maximum entropy principle is just a special version of the more general minimal relative entropy principle, which minimizes the divergence functional D(p q) w.r.t. p, for some given prior distribution q.

Linear Constraints: Divergence a là Naudts
One generalization of the Kullback-Leibler divergence was introduced by Naudts [26] by The corresponding entropy can be expressed as: S N φ (p) is maximized by the φ-exponential family of distributions under linear constraints. The Lagrange functional is: which leads to: and we get: which is just Equation (8), averaged over the distribution p i . Note that Equation (23) provides the connection to thermodynamics, because Ψ(θ) is a so-called Massieu function. For a canonical ensemble, i.e., one constraint on the average energy E , parameter θ plays the role of an inverse temperature, and Ψ can be related to the free energy, F(θ) = θΨ(θ). Thus, the term log φ (p) can be interpreted as the thermodynamic entropy, which is determined from Equation (23). This is a consequence of the Legendre structure of thermodynamics. The corresponding MaxEnt distribution can be written as: Finally, the Fisher information metric can be obtained in the following form:

Escort Constraints: Divergence a là Amari
Amari et al. [23,24] used a different divergence introduced in [31], which is based on the choice of . This choice is motivated by the fact that the corresponding entropy is just the dual function of Ψ(θ), i.e., ϕ(η). This is easy to show, because: Thus, the divergence becomes: and the corresponding entropy can be expressed from Equation (26) as: so it is a dual function of Ψ(θ). For this reason, the entropy is called "canonical", because it is obtained by the Legendre transform from the Massieu function Ψ. Interestingly, the entropy is maximized by the φ-exponential family of distributions under escort constraints. The Lagrange function is: After a straightforward calculation, we get: and the corresponding MaxEnt distribution can be expressed as: where: Here, · φ denotes the average under the escort probability measure, P φ . Interestingly, in the escort constraints scenario, the "MaxEnt" entropy is the same as the "thermodynamic" entropy in the case of linear constraints. We call this entropy, S A φ (p), the dual entropy. Finally, one obtains the corresponding metric: Note that the metric can be obtained from Ψ(θ) as: , which is the consequence of the Legendre structure of escort coordinates [24]. For a summary of the φ-deformed divergence, entropy, and metric, see Table 1.

Cramér-Rao Bound of Naudts Type
One of the important applications of the Fisher metric is the so-called Cramér-Rao bound, which is the lower bound for the variance of an unbiased estimator. The generalization of the Cramér-Rao bound for two families of distributions was given in [26,27]. Assume these two families of distributions to be denoted by p(θ) and P(θ), with their corresponding expectation values, · p(θ) and · P(θ) . Let c k denote the estimator of the family p(θ) that fulfills c k p(θ) = ∂ ∂θ k f (θ), for some function f , and let us consider a mild regularity condition 1 P(θ) ∂ ∂θ k p(θ) P(θ) = 0. Then: c k c l P(θ) − c k P(θ) c l P(θ) where: If p(θ) = p φ (θ) is the φ-exponential family of distributions, in Equation (34), equality holds for the escort distribution P(θ) = P φ (θ), [28]. It is easy to see that for this case, i.e., for the φ-exponential family and the corresponding escort distribution, the following is true: This provides a connection between the Cramér-Rao bound and the φ-deformed Fisher metric. In the next section, we show that the Cramér-Rao bound can be also estimated for the case of the Fisher metric of the "Amari type".

The Information Geometric "Amari-Naudts" Duality
In the previous section, we have seen that there are at least two natural ways to generalize divergence, such that the φ-exponential family of distributions maximizes the associated entropy functional, however under different constraint types. These two ways result in two different geometries on the parameter manifold. The relation between the metric g A φ,ij and g N φ,ij can be expressed by the operator, T: where T(g(x)) = −N g (log g(x)) , with the normalization factor, N g = ∑ i 1/g(p i ). Note that the operator acts locally on the elements of the metric. In order to establish the connection to the Cramér-Rao bound, let us focus on the transformation of g A .

Cramér-Rao Bound of the Amari Type
The metric of the "Amari case" can be seen as a conformal transformation [32] of the metric that is obtained in the "Naudts case", for a different deformation of the logarithm. Two metric tensors are connected by a conformal transformation if they have the same form, except for the global conformal factor, Ω(p), which depends only on the point p. Our aim is to connect the Amari metric with the Cramér-Rao bound and obtain another type of bound for the estimates that are based on escort distributions. To this end, let us consider a general metric of the Naudts type, corresponding to χ-deformation, and a metric of the Amari type, corresponding to ξ-deformation. They are connected through the conformal transformation, which acts globally on the whole metric. The relation can be expressed as: By using previous results in this relation, we obtain: from which we see that Ω(p) = h ξ (p) and log χ (x) = log(ξ(x)), i.e., Note that log χ might not be concave because: . To now make the connection with the Cramér-Rao bound, let us take χ(x) = φ(x), so ξ(x) = exp log φ (x), and: As a consequence, there exist two types of Cramér-Rao bound for a given escort distribution, which might be used to estimate the lower bound of the variance of an unbiased estimator, obtained from two types of Fisher information.

Example: Duality of (c, d)-Entropy
We demonstrate the "Amari-Naudts" duality on the general class of (c, d)-entropies [9,10], which include all deformations associated with statistical systems that fulfil the first three Shannon-Khinchin axioms. These include most of the popular deformations, including Tsallis q-exponentials [4] and stretched exponentials studied in connection with the entropies by Anteneodo and Plastino [33]. The generalized (c, d)-logarithm is defined as: where c and d are the scaling exponents [8,9] and r is a free scale parameter (that does not influence the asymptotic behavior). The associated φ-deformation is: The inverse function of log (c,d) , the deformed (c, d)-exponential, can be expressed in terms of the Lambert W-function, which is the solution of equation, W(z)e W(z) = z. The deformed (c, d)-exponential is: The corresponding entropy that is maximized by (c, d)-exponentials (see [8] for their properties) is the (c, d)-entropy: where A = cdr 1−(1−c)r . This is an entropy of "Naudts type", since it is maximized with (c, d)-exponentials under linear constraints. We can immediately write the metric as: The corresponding entropy of "Amari type", i.e., maximized with (c, d)-exponentials under escort constraints: is: and its metric is: The metric of the Amari type of the (c, d)-entropy was already discussed in [34] based on (c, d)-logarithms. However, as demonstrated above, the metric can be found without using the inverse φ-deformed logarithms, which in the case of (c, d)-logarithms lead to Lambert W-functions. The Fisher metric of Naudts and Amari type and the corresponding Cramér-Rao bound are shown in Figure 1. The scaling parameter is set (following [9]) to r = 1/(1 − c + cd), for d ≤ 0, and r = exp(−d)/(1 − c), for d < 0. The Fisher metric of both types is displayed in Figure 2 as a function of the parameters c and d for a given point, P = (1/3, 2/3). We see that both types of metric have a singularity for (c, d) = (1, 0). This point corresponds to distributions with compact support. For one-dimensional distributions, the singularity corresponds to the transition between distributions with support on the real line and distributions with support on a finite interval.
Interestingly, for (c, d) = (q, 0), the metric simplifies to: which corresponds to the Tsallis q-exponential family of distributions. Therefore, g A (q,0),ij (p) is just a conformal transformation of the Fisher information metric for the exponential family of distributions, as shown in [24]. Note, that only for Tsallis q-exponentials, the relation between S N q (p) and S A q (p) can be expressed as (see also Table 2): where f (x) = (2 − q)/x and q = 2 − q. This is nothing but the well-known additive duality q ↔ 2 − q of Tsallis entropy [11]. Interestingly, q-escort distributions form a group with φ q (φ q (x)) = (φ q·q (x)) and φ −1 q (x) = φ 1/q (x), where q ↔ 1/q is the multiplicative duality [35].  This is not the case for more general deformations, because typically, the inverse does not belong to the class of escort distributions. Popular deformations belonging to the (c, d)-family, as the Tsallis q-exponential family or the stretched exponential family, are summarized in Table 2. Table 2. Two important special cases of (c, d)-deformations and related quantities: Power laws (Tsallis) and stretched exponentials. [33] φ(x)

Connection to the Deformed-Log Duality
A different duality of entropies and their associated logarithms under linear and escort averages has been discussed in [22]. There, two approaches were discussed. The first uses the generalized entropy of trace form under linear constraints. It was denoted by: It corresponds to the Naudts case, log HT (x) = log N φ HT (x). The second approach, originally introduced by Tsallis and Souza [20], uses the trace form entropy: Note that in [20], the notion of the deformed logarithm is not used (as in Equation (55)). However, it is again an entropy of the Naudts type with the deformed logarithm log TS (x) = log N φ TS (x). Equation (55) is maximized under the escort constraints: where u(p j ) = p j + νs TS (p j ). The linear constraints are recovered for ν = 0. This form is dictated by the Shannon-Khinchin axioms, as discussed in the next section. Let us assume that the maximization of both approaches-Equation (54) under linear, and Equation (55) under escort constraints-leads to the same MaxEnt distribution. One can then show that there exists the following duality (deformed-log duality) between log HT (x) and log TS (x): Let us focus on specific φ-deformations, so that log It is straightforward to calculate the metric corresponding to the entropy S TS (p): where: Thus, the Tsallis-Souza approach results in yet another information matrix. We may also start from the other direction and look at the situation when the escort distribution for the information geometric approach is the same as the escort distribution for the Tsallis-Souza approach. In this case, we get: We find that the entropy must be expressed as: Note that for φ(x) = x q and ν = 1 − q, we obtain the Tsallis entropy: which corresponds to S TS (p) for q = 2 − q, which is nothing but the mentioned Tsallis additive duality. It turns out that Tsallis entropy is the only case where the deformed-log duality and the information geometric duality result in the same class of functionals. In general, the two dualities have different escort distributions.

Discussion
We discussed the information geometric duality of entropies that are maximized by φ-exponential distributions under two types of constraint: linear constraints, which are known from contexts such as thermodynamics, and escort constraints, which appear naturally in the theory of statistical estimation and information geometry. This duality implies two different entropy functionals: For φ(x) = x, they both boil down to Shannon entropy. The connection between the entropy of Naudts type and the one of Amari type can be established through the corresponding Fisher information through the Cramér-Rao bound. Contrary to the deformed-log duality introduced in [22], the information theoretic duality introduced here cannot be established within the framework of φ-deformations, since S A (p) is not a trace form entropy. We demonstrated the duality between the Naudts approach with linear constraints and the Amari approach with escort constraints, within the example of (c, d)-entropies, which include a wide class of popular deformations, including Tsallis and Anteneodo-Plastino entropy as special cases. Finally, we compared the information geometric duality to the deformed-log duality and showed that they are fundamentally different and result in other types of Fisher information.
Let us now discuss the role of information geometric duality and possible applications in information theory and thermodynamics. Recall that the Shannon entropic functional is determined by the four Shannon-Khinchin (SK) axioms. In many contexts, at least three of the axioms should hold: • (SK1) Entropy is a continuous function of the probabilities p i only and should not explicitly depend on any other parameters. • (SK2) Entropy is maximal for the equi-distribution p i = 1/W. • (SK3) Adding a state W + 1 to a system with p W+1 = 0 does not change the entropy of the system.
Originally, the Shannon-Khinchin axioms contain four axioms. The fourth describes the "composition rule" for entropy of a joint system S(A + B) = S(A) + S(B|A)). The only entropy satisfying all four SK axioms is Shannon entropy. However, Shannon entropy is not sufficient to describe the statistics of complex systems [10] and can lead to paradoxes in applications in thermodynamics [12]. Therefore, instead of imposing the fourth axiom in situations where it does not apply, it is convenient to consider a weaker requirement, such as generic scaling relations of entropy in the thermodynamic limit [9,13]. It is possible to show that the only type of duality satisfying the first three Shannon-Khinchin axioms is the deformed-log duality of [22]. Moreover, entropies that are neither trace-form, nor sum-form (sum-form entropies are in the form S(p) = f (∑ i g(p i ))) might be problematic from the view of information theory and coding. For example, it is then not possible to introduce a conditional entropy consistently [36] because the corresponding conditional entropy cannot be properly defined. This is related to the fact that the Kolmogorov definition of conditional probability is not generally valid for escort distributions [37]. Additional issues arise from the theory of statistical estimation, since only entropies of the form S(p) = f (∑ i g(p i )), i.e., sum-form entropies, can fulfil the consistency axioms [38]. From this point of view, the deformed-log duality using the class of Tsallis-Souza escort distributions can play a role in thermodynamical applications [39], because the corresponding entropy fulfills the SK axioms. On the other hand, the importance of escort distributions considered by Amari and others is in the realm of information geometry (e.g., dually-flat geometry or generalized Cramér-Rao bound), and their applications in thermodynamics might be limited. Finally, for the case of the Tsallis q-deformation, both dualities, the information geometric and the deformed-log duality, reduce to the well-known additive duality q ↔ 2 − q.