ϕ-Informational Measures: Some Results and Interrelations

In this paper, we focus on extended informational measures based on a convex function ϕ: entropies, extended Fisher information, and generalized moments. Both the generalization of the Fisher information and the moments rely on the definition of an escort distribution linked to the (entropic) functional ϕ. We revisit the usual maximum entropy principle—more precisely its inverse problem, starting from the distribution and constraints, which leads to the introduction of state-dependent ϕ-entropies. Then, we examine interrelations between the extended informational measures and generalize relationships such the Cramér–Rao inequality and the de Bruijn identity in this broader context. In this particular framework, the maximum entropy distributions play a central role. Of course, all the results derived in the paper include the usual ones as special cases.


Introduction
Since the pioneer works of von Neumann [1], Shannon [2], Boltzmann, Maxwell, Planck, and Gibbs [3][4][5][6][7][8][9], many investigations were devoted to the generalization of the so-called Shannon entropy and its associated measures [10][11][12][13][14][15][16][17][18][19][20][21][22]. If the Shannon measures are compelling, especially in the communication domain, for compression purposes, many generalizations proposed later on have also showed promising interpretations and applications (Panter-Dite formula in quantification where the Rényi or Havrda-Charvát entropy emerges [23][24][25], encoding penalizing long codewords where the Rényi entropy appears [26,27], for instance). The great majority of the extended entropies found in the literature belongs to a very general class of entropic measures called (h, φ)-entropies [13,19,20,[28][29][30]. Such a general class (or more precisely the subclass of φ-entropies) can be traced back to the work of Burbea and Rao [28]. They offer not only a general framework to study general properties shared by special entropies, but they also offer many potential applications as described for instance in [30]. Note that if a large amount of work deals with divergences, entropies occur as special cases when one takes a uniform reference measure.
In the framework of these generalized entropies, the so-called maximum entropy principle takes a special place. This principle, advocated by Jaynes, states that the statistical distribution that describes a system in equilibrium maximizes the entropy while satisfying the system's physical constraints (e.g., the center of mass and energy) [31][32][33][34][35]. In other words, it is the less informative law given the constraints of the system. In the Bayesian approach, dealing with the stochastic modeling of a parameter, such a principle (or a minimum divergence principle) is often used to choose a prior distribution for the parameter [22,[36][37][38][39]. It also finds its counterpart in communication, clustering, pattern recognition, problems, among many others [32,33,[40][41][42][43]. In statistics, some goodness-of-fit tests are based on entropic criteria derived from the same idea of constrained maximal entropic law [44][45][46][47][48][49]. The principle behind such entropic tests lies in the Bregman divergence, measuring a kind of distance between probability distributions, i.e., the empirical distribu-tion given by data and the distribution we assume for the data (reference). It appears that if the empirical distribution and the reference share the same moments, and if the latter is of maximum entropy with these moments as constraints, the divergence reduces to a difference of entropy. In a large number of works using the maximum entropy principle, the entropy used is the Shannon entropy. However, if for some reason, a generalized entropy is considered, the approach used in the Shannon case does not fundamentally change [50][51][52][53].
One can consider the inverse problem which consists in finding the moment constraints leading to the observed distribution as a maximal entropy distribution [50]. Kesavan and Kapur also envisaged a second inverse problem, where both the distribution and the moments are given. The question is thus to determine the entropy so that the distribution is its maximizer. As a matter of fact, dealing with the Shannon entropy, whatever the constraints considered, the maximum entropy distribution falls in the exponential family [33,34,52,54]. Remind that the exponential family is the set of parametric densities (with respect to a measure µ independent on the parameter) of the form p(x) = C(θ)h(x) exp(R(θ) t S(x)) where S(x) is the sufficient statistics [39,[55][56][57][58][59][60]. When R(θ) = θ, the family is said to be natural and Z(θ) = 1/C(θ) is the partition function, the log-partition function ϕ(θ) = log Z(θ) being the cumulants generating function. Now, resolving the maximum entropy problem given later on by Equation (6) in the context of the Shannon entropy, it appears indeed that the maximum entropy distribution falls in the natural exponential family where the sufficient statistics is given by the moment constraints. Considering more general entropies allows to escape from this limitation. Moreover, if the Shannon entropy (or the Gibbs entropy in physics) is well adapted to the study of systems in the equilibrium (or in the thermodynamic limit), extended entropies allow a finer description of systems out of equilibrium [17,[61][62][63][64][65], exhibiting their importance. While the problem was considered mainly in the discrete setting by Kesavan and Kapur in [50], we will recall it in the general framework of the φ-entropies probability densities with respect to any reference measure, and make a further step considering an extended class of these entropies. Resolving the inverse problem can find applications in goodness-of-fit tests for instance, allowing to design entropies adapted to such tests, in the same line as that of the approaches mentioned above [44][45][46][47][48][49].
While the entropy is a widely used tool for quantifying information (or uncertainty) attached to a random variable or to a probability distribution, other quantities are used as well, such as the moments of the variable (giving information, for instance, on center of mass, dispersion, skewness, or impulsive character), or the Fisher information. In particular, the Fisher information appears in the context of estimation [66,67], in Bayesian inference through the Jeffreys prior [39,68], but also for complex physical systems descriptions [67,[69][70][71][72][73].
Although coming from different worlds (information theory and communication, estimation, statistics, and physics), these informational quantities are linked by well-known relations such as the Cramér-Rao inequality, the de Bruijn identity, and the Stam inequality [34,[74][75][76]. These relationships have been proved very useful in various areas, for instance, in communications [34,74,75], in estimation [66], or in physics [77,78], among others. When generalized entropies are considered, it is natural to question the other informational measures' generalization and the associated identities or inequalities. This question gave birth to a large amount of work and is still an active field of research [28,[79][80][81][82][83][84][85][86][87][88][89][90]. For instance, the Cramér-Rao inequality is very important as it gives the ultimate precision in terms of mean square error of an estimator of a parameter (i.e., the minimal error we can achieve). However, there is no reason for choosing a quadratic error in general. This choice is often made as it allows to simplify algebra or to derive estimators quite easily (e.g., of minimum mean square error). One may wish to choose other error criteria (mean of another norm of the error) and/or to stress parts of the distribution of the data in the mathematical average. It is thus of high interest to be able to derive Cramér-Rao inequalities in a context as broad as possible.
In this paper, we show that it is possible to build a whole framework, which associates a target maximum entropy distribution to generalized entropies, generalized moments, and generalized Fisher information. In this setting, we derive generalized inequalities and identities relating these quantities, which are all linked in some sense to the maximum entropy distribution.
The paper is organized as follows. In Section 2, we recall the definition of the generalized φ-entropy. Thus, we come back to the maximum entropy problem in this general settings. Following the sketch of [50], we present a sufficient condition linking the entropic functional and the maximizing distribution, allowing to both solve the direct and the inverse problems. When the sufficient conditions linking the entropic function and the distribution cannot be satisfied, the problem can be solved by introducing state-dependent generalized entropies, which is the purpose of Section 3. In Section 4, we introduce informational quantities associated to the generalized entropies of the previous sections, such as a generalized escort distribution, generalized moments, and generalized Fisher information. These generalized informational quantities allow to extend the usual informational relations such as the Cramér-Rao inequality, relations precisely saturated (or valid) for the generalized maximum entropy distribution. Finally, in Section 5, we show that the extended quantities allows to obtain an extended de Bruijn identity, provided the distribution follows a nonlinear heat equation. Some examples of φ-entropies solving the inverse maximum entropy problem are provided in a short series of appendices, showing, in particular, that the usual quantities are recovered as particular cases (Gaussian distribution, Shannon entropy, Fisher information, and variance).
In the following, we will define a series of generalized information quantities relative to a probability density defined with respect to a given reference measure µ (e.g., the Lebesgue measure when dealing with continuous random variables, discrete measure for discrete-state random variables, etc.). Therefore, rigorously, all these quantities depend on the particular choice of this reference measure. However, for simplicity, we will omit to mention this dependence in the notations along the paper.

φ-Entropies-Direct and Inverse Maximum Entropy Problems
The direct problem, i.e., finding the probability distribution of maximum entropy given moments constraints, is a common problem and can find application, for instance, in the Bayesian framework, searching for prior probability distribution as less informative as possible, given some moments [22,[36][37][38][39]. It also finds many other applications, as mentioned in the introduction.
Let us first recall the definition of the generalized φ-entropies introduced by Csiszàr in terms of divergence, and by Burbea and Rao in terms of entropy: Definition 1 (φ-entropy [28]). Let φ : Y ⊆ R + → R be a convex function defined on a convex set Y. Then, if f is a probability distribution defined with respect to a general measure µ on a set X ⊆ R d such that f (X ) ⊆ Y, when this quantity exists, is the φ-entropy of f .
where h is a nondecreasing function. The definition is extended by allowing φ to be concave, together with h nonincreasing [13,19,20,29,30]. If, additionally, h is concave, then the entropy functional H (h,φ) [ f ] is concave.
As we are interested in the maximum entropy problem, and because h is monotone, we can restrict our study to the φ-entropies. Additionally, we will assume that φ is strictly convex and differentiable.
A related quantity is the Bregman divergence associated with convex function φ: Definition 2 (Bregman divergence and functional Bregman divergence [22,91]). With the same assumptions as in Definition 1, the Bregman divergence associated with φ defined on a convex set Y is given by the function defined on Y × Y, Applied to two functions f i : X → Y, i = 1, 2, the functional Bregman divergence writes A direct consequence of the strict convexity of φ is the non-negativity of the (functional) Bregman divergence: B φ (y 1 , y 2 ) ≥ 0 and B φ ( f 1 , f 2 ) ≥ 0, with equality if and only if y 1 = y 2 and f 1 = f 2 almost everywhere respectively.
From its positivity and equality only when the distributions are (almost everywhere) equal, this divergence defines a kind of distance (it is not, being non-symmetrical) where f 2 serves as a reference.
More generally, the Bregman divergence is defined for multivariate convex functions, where the derivative is replaced by gradient operator [91]. Extensions for convex function of functions also exist, where the derivative is in the sense of Gâteau [92]. Such general extensions are not useful for our purposes; thus, we restrict ourselves to the above definition where Y ⊆ R + .

Maximum Entropy Principle: The Direct Problem
Let us first recall the maximum entropy problem that consists in searching for the distribution maximizing the φ-entropy (1) subject to constraints on some moments where T 0 (x) = 1 and t 0 = 1 (normalization constraint), T = (T 0 , . . . , T n ), t = (t 0 , . . . , t n ).
We are faced to a strictly concave optimization problem (the functional to maximize is concave w.r.t. f and the constraints are linear w.r.t. f , so that the functional restricted to a linear subspace is still concave). Therefore, the solution exists and is unique. A technique to solve the problem can be to use the classical Lagrange multipliers technique and to solve the Euler-Lagrange equation from the variational problem, but this approach requires mild conditions [50,51,53,[93][94][95]. In the following proposition, we recall a sufficient condition relating f and φ so that f is the problem's solution. This result is proven without the use of the Lagrange technique.

Proof.
Suppose that distribution f satisfies Equation (6) and consider any distribution g ∈ D T,t . The functional Bregman divergence between f and g writes where we used the fact that g and f are both probability distributions with the same moments E[T i (X)] = t i . By non-negativity of the Bregman functional divergence, we finally get that for all distributions g with the same moments as f , with equality if and only if g = f almost everywhere. In other words, this shows that if f satisfies Equation (6), then it is the desired solution.
Therefore, given an entropic functional φ and moments constraints T i , Equation (6) leads the the maximum entropy distribution f . This distribution is parameterized by the λ i s or, equivalently, by the moments t i s.
Note that the reciprocal is not necessarily true, i.e., the maximum entropy distribution does not necessarily satisfies Equation (6) (i.e., Equation (6) has not necessarily a solution), as shown, for instance, in [53]. However, the reciprocal is true (i.e., Equation (6) has a solution) when X is a compact [95] or for any X provided that φ is locally bounded on X [96].

Maximum Entropy Principle: The Inverse Problems
As stated in the introduction, two inverse problems can be considered starting from a given distribution f . These problems were considered by Kesavan and Kapur in [50] in the discrete framework.
The first inverse problem consists in searching for the adequate moments so that a desired distribution f is the maximum entropy distribution of a given φ-entropy. This amounts to find functions T i and coefficients λ i satisfying Equation (6). This is not always an easy task, and even not always possible. For instance, it is well known that given moment constraints, the maximum Shannon entropy distribution falls in the exponential family [33,34,52,54]. Therefore, if f does not belong to this family, the problem has no solution.
The second inverse problem consists in designing the entropy itself, given a target distribution f and given the T i s. In other words, given a distribution f , Equation (6) may allow to determine the entropic functional φ so that f is its maximizer. As mentioned in the introduction, solving this inverse problem can find applications, for instance, in goodnessof-fit tests. In such tests, we would like to determine if data fit a given distribution, say f . A natural criterion of fit between an empirical distribution and distribution f can be a Bregman divergence, where distribution f serves as a reference. As shown in the proof of Proposition 1, when both distributions (empirical, reference) share the same moments and when reference f is of maximum entropy subject to these moments, the divergence turns to be a difference of entropy and approaches in the line of [44][45][46][47][48][49] can be applied. Distribution f and some moments being given/fixed, the problem is thus to determine the adequate entropy so that f is of maximum entropy. This is precisely the inverse problem we deal with now.
As for the direct problem, in the second inverse problem, the solution is parameterized by the λ i s. Here, required properties on φ will shape the domain the λ i s live in. In particular, φ must satisfy: • the domain of definition of φ must include f (X ); this will be satisfied by construction; • from the strict convexity property of φ, φ must be strictly increasing.
Therefore, because φ must be strictly increasing, it is clear that solving Equation (6) requires the following two conditions: For instance, in the univariate case, for one moment constraint, • for X = R + , T 1 (x) = x, λ 1 must be negative and f (x) must be decreasing, • for X = R, T 1 (x) = x 2 or T 1 (x) = |x|, λ 1 must be negative and f (x) must be even and unimodal.
Under conditions (C1) and (C2) , the solutions of Equation (6) are given by where f −1 can be multivalued. However, even if f −1 is multivalued, because of condition (C2), φ is defined univocally. Equation (7) provides thus an effective way to solve the inverse problem. However, there exist situations where there does not exist any set of λ i s such that conditions (C1)-(C2) are satisfied (e.g., T 1 (x) = x 2 with f not even). In such a case, we look for a solution for φ in a larger class, i.e., by extending the definition of the φ-entropy. This will be the purpose of Section 3. Before focusing on this, let us illustrate the previous result on some examples.

Second Inverse Maximum Entropy Problem: Some Examples
To illustrate the previous subsection, let us analyze briefly three examples: the famous Gaussian distribution (Example 1), the q-Gaussian distribution also intensively studied (Example 2), and the arcsine distribution (Example 3). The Gaussian, q-Gaussian, and arcsine distributions will serve as a guideline all along the paper. The details of the calculus, together with a deeper study related to the sequel of the paper, are presented in the appendix. Other examples are also given in this appendix. In both three examples, except in the next section, we consider the second-order moment constraint T 1 (x) = x 2 .

Example 1. Let us consider the well-known Gaussian distribution f X
exp − x 2 2 σ 2 , defined over X = R, and let us search for the φ-entropy so that the Gaussian is its maximizer subject to the constraint T 1 (x) = x 2 . To satisfy condition (C1) we must have λ 1 < 0, whereas condition (C2) is always satisfied. Rapid calculations, detailed in Appendix A.1, and a reparameterization of the λ i s, give the entropic functional This is nothing but the Shannon entropy, up to the scaling factor α, and a shift (to avoid the divergence of the entropy when X is unbounded, one will take γ = 0). One thus recovers the long outstanding fact that the Gaussian is the maximum Shannon entropy distribution with the second order moment constraint.

Example 2.
Let us consider the q-Gaussian distribution, also known as Tsallis distribution or Student distribution [97,98] , where q > 0, q = 1, x + = max(x, 0) and A q is the normalization coefficient, defined over X = R when q < 1 or over when q > 1, and let search for the φ-entropy so that the q-Gaussian is its maximizer with the constraint T 1 (x) = x 2 . Here, again, condition (C1) is satisfied if and only if λ 1 < 0, whereas condition (C2) is always satisfied. Rapid calculations detailed in Appendix A.2 lead to the entropic functional, after a reparameterization of the λ i s, as, where q is thus an additional parameter of the family. This entropy is nothing but the Havrda-Charvát or Daróczy or Tsallis entropy [12,14,17,97], up to the scaling factor α, and a shift (here also, to avoid the divergence of the entropy when X is unbounded, one will take γ = 0). This entropy is also closely related to the Rényi entropy [10] via a one-to-one logarithmic mapping. One recovers the also well known fact that the q-Gaussian is the maximum Havrda-Charvát-Rényi-Tsallis entropy distribution with the second order moment constraint [97]. In the limit case q → 1, the distribution f X tends to the Gaussian, whereas the Havrda-Charvát-Rényi-Tsallis entropy tends to the Shannon entropy.
s π and let us determine the entropic functionals φ so that f X is the maximum φentropy distribution subject to the constraint T 1 (x) = x 2 . Condition (C2) is always satisfied and now, to fulfill condition (C1) we must impose λ 1 > 0. Some algebra detailed in Appendix A.4.1 leads to the entropic functional, after a reparameterization of the λ i s, (again, to avoid the divergence of the entropy, one can adjust parameter γ). This entropy is unusual and, due to its form, is potentially finite only for densities defined over a bounded support and that are divergent in its boundary (integrable divergence).

State-Dependent Entropic Functionals and Minimization Revisited
In order to follow asymmetries of the distribution f and address the limitation raised by conditions (C1) and (C2), we propose to allow the entropic functional to also depend on the state variable x. Indeed, imagine, for instance, that, for two values x 1 = x 2 , the probability distribution is such that f (x 1 ) = f (x 2 ), but, at the same time, In such a situation, one cannot find a function φ so as to satisfy condition (C2). Choosing a functional φ depending both on f (x) and x can allow ) so that we expect it could compensate for the fact that, with a usual entropic functional, condition (C2) cannot be satisfied. In the same vein, imposing a particular form for φ(x, f (x)), we also expect to be able to treat the case where condition (C1) cannot be satisfied with a usual entropic functional. Let us first define the hence extended state-dependent φ-entropy, before demonstrating that such a extension allows indeed to reach our goal.
Then, if f is a probability distribution defined with respect to a general measure µ on set X and such that f (X ) ⊆ Y, will be called state-dependent φ-entropy of f . As φ(x, ·) is convex, then the entropy functional H φ [ f ] is concave. A particular case arises when, for a given partition (X 1 , . . . , X k ) of X , functional φ writes where 1 A denotes the indicator of set A. This functional can be viewed as a "(X 1 , . . . , X k )-extension" over X × Y of a multiform function defined on Y, with k branches φ l and the associated φ-entropy will be called (X 1 , . . . , X k )-multiform φ-entropy.
As in the previous section, we restrict our study to functionals φ(x, y) strictly convex and differentiable with respect to y.
Following the lines of Section 2, a generalized Bregman divergence can be associ- With these extended quantities, the direct problem becomes Although the entropic functional is now state-dependent, the approach adopted before can be applied here, leading to Proposition 2 (Maximum state-dependent φ-entropy solution). Suppose that there exists a probability distribution f satisfying for some (λ 0 , . . . , λ n ) ∈ R n+1 , then f is the unique solution of the extended maximum entropy problem (10).
If φ is chosen in the (X 1 , . . . , X k )-multiform φ-entropy class, this sufficient condition writes Proof. The proof follows the steps of Proposition 1, using the generalized functional Bregman divergence instead of the usual one.
Resolving Equation (11) is not possible in all generality. However, the sufficient condition (12) can be rewritten as Therefore, if there exists (at least) a set of λ i s such that condition (C1) is satisfied (but not necessarily (C2)), one can always • design a partition (X 1 , . . . , X k ) so that (C2) is satisfied in each X l (at least, such that f is either strictly monotonic, or constant, on X l ) and • determine φ l as in Equation (7) in each X l , that is where f −1 l is the (possibly multivalued) inverse of f on X l . By the way, when X l is such that f X is monotonic on it ensures that f −1 l is univalued. In short, in the case where only condition (C1) is satisfied, one can obtain an extended entropic functional of (X 1 , . . . , X k )-multiform class so that Equation (13) provides an effective way to solve the inverse problem in the state-dependent entropic functional context. This is given by Equation (14).
Note, however, that it still may happen that there is no set of λ i s allowing to satisfy (C1). In this harder context, the problem remains solvable when the moments are defined as partial moments like E T l,i (X)1 X l (X) = t l,i , l = 1, . . . , k and i = 1, . . . , n l and when there exists on X l a set of λ l,i s such that (C1) and (C2) hold. The solution still writes as in Equation (14), but where now n, the λ i s and the T i s are replaced by n l , the λ l,i s and T l,i s, respectively, Let us now come back to the arcsine example f X (x) = 1 s 2 −π 2 x 2 , defined over X = − s π ; s π (Example 3) of the previous section, when now we constraint the first order moment or partial first order moments.

Example 4.
Let us now consider this arcsine distribution, constrained uniformly by T 1 (x) = x. Clearly, neither condition (C1) nor condition (C2) can be satisfied. Note that the arcsine distribution is a one-to-one function on each set X − = − s π ; 0 and X + = 0 ; s π that partitions X . Therefore, considering multiform entropic functionals with this partition allows to overcome the issue on condition (C2), but that on condition (C1) remains. If we ignore this issue and apply Equation (14), after a reparameterization of the λ i s, we obtain φ ± (y) = φ ±,u (sy) with where s is thus an additional parameter of the family. It appears that whereas these functionals are defined for u > 1, one can extend them continuously and with a continuous derivative for any u > 0 imposing β = 0, which finally leads to the family However, the functional are no more convex (see Appendix A.4.2 for more details).

Example 5.
If now we impose the partial constraint T ±,1 (x) = x1 X ± (x), and search for the φentropy so that f X is the maximizer subject to these constraints, condition (C1) can be now satisfied on each X ± by imposing the ±λ ±,1 given Equation (15) to be positive. We then obtain the associated multiform entropic functional, after a reparameterization of the λ i s, as φ ± (y) = φ ±,u (sy) with is thus an additional parameter of the family. In this case, the entropic functionals can be considered for any u > 0 by imposing β = 0, and one can check that the obtained functions are of class C 1 . This finally leads to the family In addition, remarkably, the entropic functional can be made univalued by choosing α + = α − and γ + = γ − . In fact, such a choice is equivalent to considering the constraint T 1 (x) = |x| which respects the symmetries of the distribution and allows to recover a classical φ-entropy (see Appendix A.4.2 for more details).
At a first glance, the solutions of Examples 4 and 5 seem to be identical. In fact, they drastically differ. Indeed, let us emphasize that the problem has one constraint in the first case, but two in the second case. The consequence is that four parameters parameterize the first solution β, γ ± and α, while five parameters β, γ ± and α ± parameterize the second solution. This difference is not insignificant: the first case cannot be viewed as a special case of the second one, because α ± must be positive, which cannot be possible with only parameter α as ±α rule the φ ± . For the first example, the solution does not lead to a convex function, because this would contradict the required condition (C1) on the parts X ± . Coming back to the direct problem, the "φ-like-entropy" defined with φ is no more concave (indeed, it is no more an entropy in the sense of Definition 1). As such, the maximum φ-entropy problem is no more concave: one cannot guarantee the uniqueness and even the existence of a maximum so that there is no guarantee that the arcsine distribution would be a maximizer. Indeed, Equation (6) coming from the Euler-Lagrange equation (see paragraph previous to Proposition 1), one can just conclude that the arcsine is a critical point (either extremal, or inflection point) of the identified φ-like-entropy.
In Sections 2 and 3, we established general entropies with a given maximizer. In what follows, we will complete the information theoretical setting by introducing generalized escort distributions, generalized moments, and generalized Fisher information associated to the same entropic functional. We will then explore some of their relationships. Indeed, as mentioned in the introduction, the Cramér-Rao inequality is very important as it gives the ultimate precision in terms of mean square error of an estimator of a parameter. Aswe would like to escape from the usual quadratic loss (that has often mathematical motivation but not physical one, and that even can not exist) and/or to stress parts of the distribution of the data so has to penalize for instance large errors depending of the tails of the distribution, it is thus of high interest to be able to derive Cramér-Rao inequalities in a broader framework, which can find natural applications in the estimation domain.

φ-Escort Distribution, (φ, α)-Moments, (φ, β)-Fisher Information, Generalized Cramér-Rao Inequalities
In this section, we begin by introducing the above-mentioned informational quantities. We will then show that generalizations of the celebrated Cramér-Rao inequalities hold and link the generalized moments and Fisher information. Furthermore, the lower bound of the inequalities are saturated precisely by maximal φ-entropy distributions. To derive such generalizations of this inequality, we thus need to precisely define the above mentioned generalization of the moments and of the Fisher information that will lower bound the moment (e.g., of any estimator of a parameter). The proposed generalizations are based on the notion of escort distribution we first need to introduce.
Escort distributions have been introduced as an operational tool in the context of multifractals [99,100], with interesting connections with the standard thermodynamics [101] and with source coding [26,27]. In our context, we also define (generalized) escort distributions associated with a particular convex function φ, and show how they pop up naturally. It is then possible to define generalized moments with respect to these escort distributions. Such distributions were previously introduced dealing with Rényi entropies and took the form f q as we will see later on. When q > 1, the effect is to stress the head of the distribution, i.e., to penalize more the errors where the data fall in the head of the distribution. At the opposite, when q < 1, the tails of the distributions are stressed. As we will see later on in the proof of the generalized Cramér-Rao inequality, any form as an escort distribution can be chosen. However, as for the usual nonparametric Cramér-Rao inequality, one may wish the inequality to be saturated for the maximum entropy distribution, which fixes the form of the escort distribution as follows.
Definition 4 (φ-escort). Let φ : X × Y → R such that for any x ∈ X ⊆ R d function φ(x, ·) is a strictly convex twice differentiable function defined on the closed convex set Y ⊆ R + . Then, if f is a probability distribution defined with respect to a general measure µ on a set X such that f (X ) ⊆ Y, and such that < +∞ (16) we define by the φ-escort density with respect to measure µ, associated to density f .
Note that from the strict convexity of φ with respect to its second argument, this probability density is well defined and is strictly positive. We can note that, with the above definition, the φ-escort distribution will tend to stress the parts of the distribution where φ(x, f (x)) has a small "curvature." Moreover, coming back to the previous examples, one can see the following.

Example 1 (cont.).
In the context of the Shannon entropy, entropy for which the Gaussian is the maximal entropy law for the second order moment constraint, φ(x, y) = φ(y) = y log y, the φ-escort density associated to f restricts to density f itself.
Example 2 (cont.). In the Rényi-Tsallis context, entropy for which the q-Gaussian is the maximal entropy law for the second-order moment constraint φ(x, y) = φ(y) = y q −y q−1 , and E φ, f ∝ f 2−q which recovers the escort distributions used in the Rényi-Tsallis context up to a duality transformation [101].

Example 3 (cont.).
For the entropy that is maximal for the arcsine distribution under the second order moment constraint, φ(x, y) = φ(y) = 1 y , and E φ, f ∝ f 3 which is nothing more than an escort distributions used in the Rényi-Tsallis context. Indeed, although the arcsine distribution does not fall in the q-Gaussian family, its form is very similar to a q-Gaussian distribution (with q = −1) where the "scaling" parameter would not be related to the exponent q. It is thus not surprising to recover an escort distribution associated to this family.

Definition 5 ((α, φ)-moments).
Under the assumptions of Definition 4, with X equipped with a norm · χ , we define the (α, φ)-moment of a random variable X associated to distribution f by if this quantity exists.
This definition goes further than the usual definition of variance as a measure of dispersion, both by generalizing the exponent, the norm, and by taking the mean with respect to an escort distribution. Thanks to the escort distribution, one can stress special parts of the distribution (heads, tails, parts where φ has a small curvature that is with a small informational content in a sense). Here, again, any escort distribution could have been chosen, but, as pointed out previously, that of the definition allows to saturate the Cramér-Rao inequality we will derive in a while for the maximum entropy distribution. Note that, in the particular case of the Euclidean norm and α = 2, the second-order moment statistics are indeed contained in the second-order moments matrix given by the mathematical mean of XX t . In such a context, the definition above coincides with the trace of this second order moment matrix and represents the total power of X.
This said, for our three examples, we have the following.

Example 1 (cont.).
In the context of the Shannon entropy, the (α, φ)-moments are the usual moments of X α χ .
Example 3 (cont.). For φ(x, y) = φ(y) = 1 y , one also naturally finds generalized moments with the same form as those introduced in [61,102] (see the items related to the escort distributions).
The Fisher information's importance is well known in estimation theory: the estimation error of a parameter is bounded by the inverse of the Fisher information associated with this distribution [34,66]. The Fisher information is also used as a method of inference and understanding in statistical physics and biology, as promoted by Frieden [67] and has been generalized in the Rényi-Tsallis context in a series of papers [81,84,[86][87][88][89]103,104]. In the following, we generalize these definitions a step further in our φ-entropy context. Definition 6 (Nonparametric (β, φ)-Fisher information). With the same assumption as in Definition 4, denoting by · χ * the dual norm (the norm induced in the dual space that gives here z χ * = sup z t x [105,106]), for any differentiable density f , we define the quantity if this quantity exists, as the nonparametric (β, φ)-Fisher information of f .
Note that the Fisher information can be viewed as local, as it is sensitive to the variation of a distribution, rather than to the distribution itself. As for the generalized moments, through the power β other moments for the gradient of f than the second one can be considered, so that more or less weight can be put in the variations of the distribution. Moreover, as for the case of generalized moments, any escort distribution could have been chosen, but, again this choice is dictated by our wish to saturate the Cramér-Rao inequality for the maximum entropy distribution.
Note also that when φ is state-independent, φ(x, y) = φ(y), as for the usual Fisher information, this quantity is shift-invariant, i.e., for g( . This property is unfortunately lost in the state-dependent context. Furthermore, whereas the Fisher information have scaling properties I[a −d f (·/a)] = I[ f ]/a 2 , this is lost for I β,φ , except when φ is a power (which corresponds either to the Shannon or Rényi-Tsallis entropy).

Definition 7 (Parametric (β, φ)-Fisher information).
Let us consider the same assumptions as in Definition 4, and a density f parameterized by θ ∈ Θ ⊆ R m where set Θ is equipped with a norm · Θ and with the corresponding dual norm denoted · Θ * . Assume that f is differentiable with respect to θ. We define by as the parametric (β, φ)-Fisher information of f .
Note that, as for the usual Fisher information, when the norms on X and on Θ are the same, the nonparametric and parametric information coincide when θ is a location parameter.
Note that in the classical setting, the information on X in the sense of Fisher is given by the so-called Fisher information matrix, which is the mathematical mean of ∇ f ∇ t f . Taking the trace of the Fisher information matrix, one obtains what is often called Fisher information (without the term "matrix"), which is nothing but the expectation of ∇ f 2 [58,67,107]. This is in the line of the above definitions. Extending these definitions to obtain a matrix would have been possible by averaging over the φ-escort distribution the element-wise power β/2 of matrix (∇ f ∇ t f )/E 2 φ, f , but the trace of this matrix does not coincide anymore with the above definition. Moreover, it is not obvious that it will allow a generalization of the matrix form of the Cramér-Rao inequality we will see in the following. Such a matrix extended Fisher information is left as a perspective.
For our three examples, we have the following.
Example 1 (cont.). In the Shannon entropy context, when the norm is the Euclidean norm and β = 2, the nonparametric and parametric information (β, φ)-Fisher give the usual nonparametric and parametric Fisher information, respectively.
We have now the quantities that allow to generalize the Cramér-Rao inequalities as follows.
Proof. The approach follows [89], starting from the differentiable probability density f (derivative denoted ∇ x f ), as x f (x) vanishes in the boundaries of X from the divergence theorem one has Now, for the first term, we use the facts that ∇ t x x = d and that f is a density to achieve for any function g non-zero on X . Now, noting that d > 0, we obtain from the work in [89] (Lemma 2) The proof ends by choosing g = E φ, f the φ-escort density associated to density f . Note now that, again from [89] (Lemma 2), the equality is obtained when This last equation has precisely the form Equation (6) of Proposition 1.
Analyzing minutely the proof, it is clear that both in the generalized moments and the generalized Fisher information, any escort distribution g can be chosen (being identical for both quantities), including the probability distribution itself. The saturation will be achieved for the distribution f satisfying ∇ x f (x) g(x) = λ 1 ∇ x x α χ , but the φ-escort distribution Definition 4 is the only choice which allows to recover maximal φ-entropy as the saturating distribution; of course with the same φ as in the escort distribution, and with the moment constraint similar to that of the inequality but averaged over the distribution itself.
An obvious consequence of the proposition is that the probability density that minimizes the (α * , φ)-Fisher information subject to the moment constraint T(x) = x α X coincides with the maximal φ-entropy distribution subject to the same moment constraint.
In the problem of estimation, the purpose is to determine a functionθ(x) in order to estimate an unknown parameter θ. In such a context, the Cramér-Rao inequality allows to lower bound the variance of the estimator thanks to the parametric Fisher information. The idea is thus to extend this to bound any α order mean error using our generalized Fisher information.
Proposition 4 (Parametric (α, φ)-Cramér-Rao inequality). Let f be a probability density function with respect to a general measure µ defined over a set X , where f is parameterized by a parameter θ ∈ Θ ⊆ R m , and satisfies the conditions of Definition 7. Assume that both µ and X do not depend on θ, that f is a jointly measurable function of x and θ which is integrable with respect to x and absolutely continuous with respect to θ, and that the derivatives of f with respect to each component of θ are locally integrable. Thus, for any estimator θ(X) of θ that does not depend on θ, we have is the bias of the estimator and α and α * are Hölder conjugated. When φ is state-independent, φ(x, y) = φ(y), the equality occurs when f is the maximal φ entropy distribution subject to the moment constraint T( Proof. The proof follows again that of [89], and starts by evaluating the divergence of the bias. The regularity conditions in the statement of the theorem enable to interchange integration with respect to x and differentiation with respect to θ, so that Note then that ∇ t θ θ = m and that θ being independent on θ, one has ∇ t θ θ(x) = 0. Thus, f being a probability density, the equality becomes for any density g non-zero on X . The proof ends with the very same steps that in Proposition 4 using [89] (Lemma 2).
In the classical setting, in the multivariate context (m > 1), the Cramér-Rao inequality takes a matrix form, stating that the difference of the second order moment matrix of the estimation error of an estimator with the inverse Fisher information matrix is positive definite [34,58,66,67,108,109]. Several scalar forms can be derived, for instance by taking the determinant, the trace, and/or by mean of trace [58,66,67,108] or determinant/trace inequalities [110]. Typically, by mean of the trace, the scalar equivalent of the above results are recovered. Conversely, extending our result in a matrix context is not immediate and left as a perspective.

For our three examples, Propositions 3 and 4 lead to what follows.
Example 1 (cont.). The usual parametric and nonparametric Cramér-Rao inequality are recovered in the usual Shannon context φ(x, y) = y log y, using the Euclidean norm and α = 2. The bound in the nonparametric context is saturated for the maximal entropy law, namely, the Gaussian.

Example 2 (cont.).
In the Rényi-Tsallis context, the generalizations proposed in [87][88][89] are recovered and, again, when α = 2, the bound is saturated in the nonparametric context for the q-Gaussian, maximal entropy law under the second order moment constraint.
Beyond the mathematical aspect of these relations, they may have great interest to assess an estimator when the usual variance/mean square error does not exist. Moreover, the escort distribution is also a way to emphasize some part of a distribution. For instance, in the Rényi-Tsallis context, one can see that in f q either the tails or the head of the distribution are emphasized. Playing with q is a way to penalize either the tails, or the head of the distribution in the estimation process.

φ-Heat Equation and Extended de Bruijn Identity
An important relation connecting the Shannon entropy H, coming from the "information world", with the Fisher information I, living in the "estimation world", is given by the de Bruijn identity and it is closely linked to the Gaussian distribution. Considering a noisy random variable Y θ = X + √ θN where N is a zero-mean d-dimensional standard Gaussian random vector and X a d-dimensional random vector independent of N, and of support independent on parameter θ, then where f Y θ stands for the probability distribution of Y θ . This identity is a critical ingredient in proving the entropy power and Stam inequalities [34]. The de Bruijn identity has applications in communication by characterizing a channel face to noise [34,76,111,112] or in mismatch estimation [113]. It is involved in the Entropy Power Inequality, which itself is involved in an informational proof of the central limit theorem [114][115][116]. Extending the de Bruijn identity is thus of great interest as, for instance, it may allow to characterize more general communication channels in the same line than that in [117] or for non-additive noise or to give rise to generalized central limit theorem [115,116].
The starting point to establish the de Bruijn identity is the heat equation satisfied by the probability distribution f Y θ , ∂ f ∂θ = 1 2 ∆ f , where ∆ stands for the Laplacian operator [118]. Let us consider probability distributions f parameterized by a parameter θ ∈ Θ ⊆ R m , satisfying what we will call generalized φ-heat equation, for some K ∈ R m , possibly dependent on θ but not on x, and where φ is a convex twice differentiable function defined over a set X ∈ R + . When θ is scalar, this equation is an instance of what are known as quasilinear parabolic equations [119] ( § 8.8) and arises in various physical problems.
Proposition 5 (Extended de Bruijn identity). Let f be a probability distribution with respect to a measure µ. Suppose that f is parameterized by a parameter θ ∈ Θ ⊆ R m , and is defined over a set X ⊂ R d . Assume that both X and µ do not depend on θ, and that f satisfies the nonlinear φ-heat equation Equation (24) for a twice differentiable convex function φ. Assume that ∇ θ φ( f ) is absolutely integrable and locally integrable with respect to θ, and that the function Thus, distribution f satisfies the extended de Bruijn identity, relating the φ-entropy of f and its nonparametric (β, φ)-Fisher information as follows, with C φ is the normalization constant given Equation (16).
Proof. From the definition of the φ-entropy, the smoothness of the assumption enables to use the Leibnitz' rule and differentiate under the integral, where the second line comes from the φ-heat equation and where the third line comes from the product derivation rule. Now, from the divergence theorem, the first term of the right hand side reduces to the integral of ∇ x φ ( f ) β−2 χ * ∇ x φ( f ) on the boundary of X , that vanishes from the assumption of the proposition, while the second term of the right hand side gives the right hand side of (25) from C φ and the (β, φ)-Fisher information given by Equations (16) and (17) and Definition 6.
As for the Cramér-Rao inequality, in the classical settings there exist matrix variants of the de Bruijn identity, the scalar form being a special one [115,117].
Coming back to the special examples we presented all along the paper: Example 1 (cont.). In the Shannon entropy context, for K = 1 2 and β = 2, the standard heat equation is recovered and the usual de Bruijn identity is recovered.
Example 3 (cont.). For φ(x, y) = φ(y) = 1 y , once again one finds the same form for the generalized heat equation than in [90,120,121], and therefore the same form of the generalized de Bruijn identity of [90] (see the items related to the escort distributions).

Concluding Remarks
In this paper, we extended as far as possible the identities and inequalities which link the classical informational quantities-Shannon entropy, Fisher information, moments, etc., in the framework of the φ-entropies. Our first result concerns the inverse maximum entropy problem, starting with a probability distribution and constraints and searching for which entropy the distribution is the maximizer. If such a study was already tackled, it is extended here in a much more general context. We used general reference measures-not necessarily discrete or of Lebesgue. We also considered the case where the distribution and constraints do not share the same symmetries, which leads to state-dependent entropic functionals. Our second result is the generalization of the Cramér-Rao inequality in the same setting: to this end, a generalized Fisher information and generalized moments are introduced, both based on a convex function φ (and a so-called φ-escort distribution). The Cramér-Rao inequality is saturated precisely for the maximum φ-entropy distribution with the same moment constraints, linking all information quantities together. Finally, our third result is the statement of a generalized de Bruijn identity, linking the φ-entropy rate and the φ-Fisher information of a distribution satisfying an extended heat equation, called φ-heat equation.
As a direct perspective, the extensions of the generalized moments and Fisher information in terms of matrix, and matrix form of the generalized Cramér-Rao inequalities and de Bruijn identities are still open problems. Several ways to define matrix moments and Fisher information may be considered, such as in a term-wise manner as evoked in this paper. However, deriving matrix forms of the inequalities and identities does not seem trivial, and neither does obtaining the scalar form, for instance, through trace operator. Moreover, as the de Bruijn identity can be closely related to the generalized Price's theorem [123][124][125], studying the connections between the extended de Bruijn and this theorem, or generalizing following the work of [125] is also of great interest.
Furthermore, two important inequalities are still lacking: The first one is the entropy power inequality (EPI), which states that the entropy power (exponential of twice the entropy) of the sum of two continuous independent random variables is higher than the sum of the individual entropy powers (In fact, there exist other equivalent versions which can be found, e.g., in [34,75,107,[126][127][128].). The second one is the Stam inequality which lower bounds the product of the entropy power and the Fisher information. For the former, despite many efforts, the literature on extended version only considers special cases. For instance, some extensions in the classical settings exist for discrete variables but are somewhat limited [129][130][131]. In the continuous framework, the EPI was also extended to the class of the Rényi entropy (log of a φ-entropy with φ(u) = u α ) [132]. Note that variants of the EPI also exist in the context where one of the variables is Gaussian. This is equivalent to the convexity property of θ → N(X + √ θY) with N the entropy power and Y a Gaussian noise independent on X [133][134][135][136][137]; property also extended in the context of the Rényi entropy [132,[138][139][140]. An important property that plays a key role in the inequality is the fact that the Rényi entropy is invariant to an affine transform of unit determinant and monotonic under convolution, a property which seems lost in the very general setting considered here. This fact leaves little room to extend the EPI in our general settings. Concerning the Stam inequality, at a first glance, the fact that the proof is based on the EPI seems to close any hope to extend it to the φ-entropy framework. However, it was remarkably extended to the Rényi entropy, based on the Gagliardo-Nirenberg inequality [84,86,87,141]. Nevertheless, a key property is that both the entropy power and the extended Fisher information have scaling properties that are lost in the general setting of the φ-entropies. A possible way to overcome the (apparent) limits just evoked could be to mimic alternative proofs such as those based on optimal transport [142]. This approach precisely drops off any use of Young or Sobolev-like inequalities. As far as we see, there is thus a little room for extensions in the settings of the paper. Both the extension of the EPI and the Stam inequality are left as perspectives.
Another perspective lies in the estimation of the generalized moments from data (or from estimates). Such a possibility would confer an operational role to our Cramér-Rao inequality, i.e., by computing the estimator's generalized moments and comparing them to the bound. A difficulty resides in the presence of the φ-escort distribution which forbids empirical or Monte Carlo approaches. The escort distribution needs to be estimated. This problem seems not far from the estimation of entropies from data and plug-in approaches used in such problems can thus be considered, like kernel approaches [143][144][145], nearest neighbor approaches [145,146], or minimal spanning tree approaches [42]. Of course, this perspective goes far beyond the scope of this paper.

Acknowledgments:
The authors wish to warmly thank the three reviewers who gave a careful reading of this manuscript. Their very valuable remarks and suggestions led to the improvement of the manuscript and opened various perspectives.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Inverse Maximum Entropy Problem and Associated Inequalities: Some Examples
In this appendix, we will now derive in details several examples of the maximum entropy inverse problem. In each case, we provide the quantities and inequalities associated with the entropic functional φ, as derived in the text. In the sequel, for sake of simplicity, we restrict our examples to the univariate context d = 1.

Appendix A.1. Normal Distribution and Second-Order Moment
For a normal distribution and second-order moment constraint we begin by computing the inverse of y = f X (x), which yields T 1 (x) = x 2 = −σ 2 ln 2πσ 2 y 2 . Note that f −1 X is multivalued, but T 1 f −1 X (·) is univalued. Injecting the expression of T 1 f −1 X (y) into Equation (7) we obtain φ (y) = λ 0 − σ 2 log(2πσ 2 ) λ 1 − 2σ 2 λ 1 log y with λ 1 < 0 where the requirement λ 1 < 0 is necessary to satisfy condition (C1), condition (C2) being already satisfied because f X and T 1 share the same symmetries. This gives, after a reparameterization of the λ i s, The judicious choice α = 1, β = γ = 0 leads to function φ(y) = y log y which gives nothing but the Shannon entropy, as expected, where X is now the support of f (overall, the obtained family of entropy is the Shannon one up to a scaling and a shift). Now, φ (y) ∝ 1 y leads to the escort distribution Definition 4 as E φ, f = f so that, as expected, the (α, φ) moments Definition 5 are the usual moments of order α. When β = 2 and the usual Euclidean norm is considered, the (β, φ)-Fisher information Definitions 6 and 7 are the usual Fisher information and the usual Cramér-Rao inequalities Propositions 3 and 4 are recovered for α = 2. Finally, for β = 2, the usual Euclidean norm, the φ-heat equation Equation (24) turns to be the heat equation, satisfied by the Gaussian, so that the usual de Bruijn identity is naturally recovered from Proposition 5.

Appendix A.2. q-Gaussian Distribution and Second-Order Moment
For q-Gaussian distribution, also known as Tsallis distribution, Student-t, and Studentr [97,98], and a second-order moment constraint, we have where q > 0, q = 1, x + = max(x, 0) and A q is a normalization coefficient. The support of . Note that, again, where the requirement λ 1 < 0 is necessary to satisfy condition (C1), condition (C2) being satisfied since f X and T 1 share the same symmetries. This gives, after a reparameterization of the λ i s, Note that the inverse of f X is defined over 0 ; A q but, without contradiction, the domain of definition of the entropic functional can be extended to R + . Then, a judicious choice of parameters is α = 1, β = γ = 0 that yields φ(y) = y q − y q − 1 .
and an associated entropy is then where X is now the support of f . This entropy is nothing but the Havrda-Charvát-Tsallis entropy [12,14,17,97] (overall, we obtain this entropy up to a scaling and a shift). Then, φ (y) = qy q−2 , so that, from Definition 4, and then from Definitions 5-7, respectively, we obtain M φ,α [ f ] and I φ,α [ f ] as, respectively, the q-moment of order α and the (q, β)-Fisher information defined previously in [84][85][86][87][88][89] (with the symmetric q index given here by 2 − q). The extended Cramér-Rao inequality proved in [84,88,89] is then recovered from Propositions 3 and 4, and the generalized de Bruijn identity of [90] is also recovered from Equation (24) and Proposition 5.
Note that when q → 1:, f X tends to the Gaussian distribution. It appears that H φ tends to the Shannon entropy, I φ,2 to the usual Fisher information and M φ,α to the usual moments (both considering the Euclidean norm): all the settings related to the Gaussian distribution are naturally recovered.

.1. Second-Order Moment
When the second-order moment T 1 (x) = x 2 is constrained, condition (C2) is satisfied, so that, injecting the expression of T 1 f −1 X (y) into Equation (7) one immediately obtains where the requirement λ 1 > 0 is necessary to satisfy condition (C1). After a reparameterization of the λ i s, the family of entropy functionals is then Although the inverse of the arcsine distribution does no exist for y ≤ 1 s , the entropy functional can be defined over R * + . Note that this entropy can be viewed as Havrda-Charvát-Tsallis entropy for q = −1, so that all the generalizations (escort, moments, Cramér-Rao inequality, de Bruijn identity) set out in Appendix A.2 are recovered in the limit q → −1. As the distribution has not the same variation as T 1 (x) = x, condition (C1) cannot be satisfied. Therefore, either we turn out to consider the arcsine distribution as a critical point (extremal, inflection point) of a non-concave "entropy", or as a maximum entropy when constraints are of the type Now, dealing, respectively, with the partial-moment constraints T ±,1 and with the uniform constraint T 1 , we obtain from Equations (14) and (15), respectively, φ ± (y) = λ 0 + λ ±,1 s 2 y 2 − 1 πy and φ ± (y) = λ 0 ± λ 1 s 2 y 2 − 1 πy where the sign is absorbed in the factors λ ±,1 in the first case. Dealing with the partial moments, one must impose λ ±,1 > 0 to satisfy condition (C1). At the opposite, condition (C1) cannot be satisfied for the second case (one would have to impose ±λ 1 > 0 on X ± ). After a reparameterization of the λ i s, one obtains the branches of the entropic functional under the form φ ± (y) = α ± > 0, and the branches for the non-convex case φ ± (y) = φ ±,u (sy) with φ ±,u (u) = ±α In this case, s appears as an additional parameter of this family of the φ-entropy.
In both cases, the entropic functionals are defined for u > 1 because of the domain where f X is invertible. However, in the first case, one can extend the domain to R + , ensuring both the continuity of the entropic functional and its derivative at u = 1 (and thus everywhere), by vanishing the derivative of the entropic functional at u = 1, which imposes β = 0. This is also possible for the functionals φ ±,u setting condition β = 0. This leads, respectively, to α ± > 0 and the branches for the non-convex case Remarkably, in the first case, an univalued entropic functional can be obtain imposing both α + = α − , γ + = γ − . Looking more attentively to this choice, one observe that it corresponds to the one obtained for the moment constraint T 1 (x) = |x|, which have the same symmetries as f X .
Here, again, s is an additional parameter for this family of φ-entropies. The entropy functional is defined for u ≤ 1 due to the domain f X is invertible. To evaluate the φ-entropy for a given distribution f , one can play with parameter s so as to restrict, if possible, s f to be on [0 ; 1]. However, one can also extend the functional to R + while remaining of class C 1 by vanishing the derivative at u = 1. This imposes β = 0 and leads to the entropy functional Figure A2a for α = 1, γ = 0.
Appendix A.5.2. (Partial) First-Order Moment(s) Constraint(s) As f X and T(x) = x do no share the same symmetries, one cannot interpret the logistic distribution as a maximum entropy constraint by the first order moment. However, constraining the partial means over X ± = R ± and using multiform entropies allow such an interpretation, while the alternative is to relax the concavity property of the entropy-but again, one would only be able to ensure that the distribution from which it comes is a critical point. To be more precise, one chooses We thus obtain from Equations (14) and (15) respectively, over each set X ± , the branches where the sign is absorbed on λ ± for the first case. Dealing with the partial moments, to satisfy condition (C1) one must impose At the opposite, condition (C1) cannot be satisfied for the second case (one would have to impose ±λ 1 < 0 on X ± ). After a reparameterization of the λ i s, one obtains the branches of the entropic functional under the form φ ± (y) = φ ±,u (sy) with φ ±,u (u) = −α ± u argtanh √ 1 − u − √ 1 − u 1 (0 ; 1] (u) + β u + γ ± where α ± > 0 and the branches for the non-convex case φ ± (y) = φ ±,u (sy) with φ ±,u (u) = ±α u argtanh Once again, s appears as an additional parameter for these families of entropies.
In both cases, even if the inverse of f X restricts u to be lower than 1, one can either play with parameter s to allow to compute the φ-entropy of any distribution f , or to extend the entropic functionals to R + by vanishing the derivative at u = 1. This imposes β = 0 and thus the entropic functional, α ± > 0 and the branches for the non-convex case φ ± (y) = φ ±,u (sy) with φ ±,u (u) = γ ± ± α u argtanh Remarkably, in the first case, an univalued entropic functional can be obtained by imposing both α + = α − , γ + = γ − . Here also, such a choice is equivalent to considering the constraint T 1 (x) = |x|, which allows to respect the symmetries of the distribution and to recover a classical φ-entropy.
As expected, this equation has two solutions. These solutions can be expressed thanks to the multivalued Lambert-W function W defined by z = W(z) exp(W(z)), i.e., W is the inverse function of u → u exp(u) [148] ( § 1), leading to the inverse functions where k denotes the branch of the Lambert-W function. k = 0 gives the principal branch and is related here to the entropy part on X 0 , while k = −1 gives the secondary branch, related to X −1 .
Applying (15) to obtain the branches of the functionals of the multiform entropy, one has thus to integrate where, to ensure the convexity of the φ k , The same approach allows to design φ k , with a unique λ 1 instead of the λ k,1 s and without restriction on λ 1 .
First, let us reparameterize the λ i s so as to absorb the factor r/Γ(q) into λ k,1 so that one can write formally Obtaining a closed-form expression for the integral term is not an easy task. However, relation z (1 + W k (z)) W k (z) = W k (z) [148] (Equation (3.2)) suggests that a way to make the integration is to search for it under the form of a series Therefore, to obtain a recursion on the a l , we proceed as follows: (i) we differentiate both side, (ii) we use the relation z W k (z) = W k (z) 1+W k (z) given above applied to z = − u . The a l can thus be evaluated explicitly, and we recognize in the series the confluent hypergeometric (or Kummer's) function 1 F 1 (1; p + q; ·) [149] (Equation (13.1.2)) or [150] (Equation (9.210-1)) (up to a factor and an additive constant), so that One can check that these functions are indeed the ones we search for. To this end, (i) one derives the previous expression, (ii) one notes that from z W k (z) = W k (z) 1+W k (z) [148] (Equation (3.2)) we have , (iii) one finally uses the relation (p + q − 1 − z) 1 F 1 (1 ; p + q ; z) + z 1 F 1 (1 ; p + q ; z) = (p + q − 1) 1 F 1 (0 ; p + q ; z) [149]  , which can be compensated for by playing with parameter r (remind that φ k (y) = φ k,u (r y)). At the opposite, noting that W k −e −1 = −1, to extend the entropic functionals to C 1 functions on R + , one would have to impose β + (−1) k α k = 0 to vanish the derivatives at u = e 1−a . This is impossible because from α k > 0, one cannot impose β = α −1 = −α 0 . Moreover, even a convex extension relaxing the C 1 condition is impossible since we would have to impose β + (−1)α k ≤ β to insure the increase of both the φ k s on R + . We can however choose the coefficients so as to impose special conditions at the boundary(ies) of the domain of definition. As an example, we may wish to vanish the φ k at u = 0 (e.g., to ensure the convergence of the integral of φ −1 ( f ), X −1 unbounded). To this end, one can evaluate the values of the φ k in the boundaries of the domain.