A Generalized Relative (α, β)-Entropy: Geometric Properties and Applications to Robust Statistical Inference

Entropy and relative entropy measures play a crucial role in mathematical information theory. The relative entropies are also widely used in statistics under the name of divergence measures which link these two fields of science through the minimum divergence principle. Divergence measures are popular among statisticians as many of the corresponding minimum divergence methods lead to robust inference in the presence of outliers in the observed data; examples include the ϕ-divergence, the density power divergence, the logarithmic density power divergence and the recently developed family of logarithmic super divergence (LSD). In this paper, we will present an alternative information theoretic formulation of the LSD measures as a two-parameter generalization of the relative α-entropy, which we refer to as the general (α,β)-entropy. We explore its relation with various other entropies and divergences, which also generates a two-parameter extension of Renyi entropy measure as a by-product. This paper is primarily focused on the geometric properties of the relative (α,β)-entropy or the LSD measures; we prove their continuity and convexity in both the arguments along with an extended Pythagorean relation under a power-transformation of the domain space. We also derive a set of sufficient conditions under which the forward and the reverse projections of the relative (α,β)-entropy exist and are unique. Finally, we briefly discuss the potential applications of the relative (α,β)-entropy or the LSD measures in statistical inference, in particular, for robust parameter estimation and hypothesis testing. Our results on the reverse projection of the relative (α,β)-entropy establish, for the first time, the existence and uniqueness of the minimum LSD estimators. Numerical illustrations are also provided for the problem of estimating the binomial parameter.


Introduction
Decision making under uncertainty is the backbone of modern information science. The works of C. E. Shannon and the development of his famous entropy measure [1][2][3] represent the early mathematical foundations of information theory. The Shannon entropy and the corresponding relative entropy, commonly known as the Kullback-Leibler divergence (KLD), has helped to link information theory simultaneously with probability [4][5][6][7][8] and statistics [9][10][11][12][13]. If P and Q are two probability measures on a measurable space (Ω, A) and have absolutely continuous densities p and q, respectively, with respect to a common dominating σ-finite measure µ, then the Shannon entropy of P is defined as E (P) = − p log(p)dµ, (1) RE (P, Q) = p log p q dµ. (2) In statistics, the minimization of the KLD measure produces the most likely approximation as given by the maximum likelihood principle; the latter, in turn, has a direct equivalence to the (Shannon) entropy maximization criterion in information theory. For example, if Ω is finite and µ is the counting measure, it is easy to see that RE (P, U) = log |Ω| − E (P), where U is the uniform measure on Ω. Minimization of this relative entropy, or equivalently maximization of the Shannon entropy, with respect to P within a suitable convex set E, generates the most probable distribution for an independent identically distributed finite source having true marginal probability in E with non-informative (uniform) prior probability of guessing [14,15]. In general, with a finite source, RE (P, Q) denotes the penalty in expected compressed length if the compressor assumes a mismatched probability Q [16,17]. The corresponding general minimizer of RE (P, Q) given Q, namely its forward projection, and other geometric properties of RE (P, Q) are well studied in the literature; see [18][19][20][21][22][23][24][25][26][27][28][29] among others. Although the maximum entropy or the minimum divergence criterion based on the classical Shannon entropy E (P) and the KLD measure RE (P, Q) is still widely used in major (probabilistic) decision making problems in information science and statistics [30][31][32][33][34][35][36][37][38][39][40][41][42][43], there also exist many different useful generalizations of these quantities to address eminent issues in quantum statistical physics, complex codings, statistical robustness and many other topics of interest. For example, if we consider the standardized cumulant of compression length in place of the expected compression length in Shannon's theory, the optimum distribution turns out to be the maximizer of a generalization of the Shannon entropy [44,45] which is given by provided p ∈ L α (µ), the complete vector space of functions for which the α-th power of their absolute values are µ-integrable. This general entropy functional is popular by the name Renyi entropy of order α [46] and covers many important entropy measures like Hartley entropy at α → 0 (for finite source), Shannon entropy at α → 1, collision entropy at α = 2 and the min-entropy at α → ∞. The corresponding Renyi divergence measure is given by whenever p, q ∈ L α (µ) and coincides with the classical KLD measure at α → 1. The Renyi entropy and the Renyi divergence are widely used in recent complex physical and statistical problems; see, for example, [47][48][49][50][51][52][53][54][55][56]. Other non-logarithmic extensions of Shannon entropy include the classical f -entropies [57], the Tsallis entropy [58] as well as the more recent generalized (α, β, γ)-entropy [59,60] among many others; the corresponding divergences and the minimum divergence criteria are widely used in critical information theoretic and statistical problems; see [57,[59][60][61][62][63][64][65][66][67][68][69][70] for details. We have noted that there is a direct information theoretic connection of KLD to the Shannon entropy under mismatched guessing by minimizing the expected compressed length. However, such a connection does not exist between the Renyi entropy E α (P) and the Renyi divergence D α (P, Q) as recently noted by [17,71]. Herein, it has been shown that, for a finite source with marginal distribution P and a (prior) mismatched compressor distribution Q, the penalty in the normalized cumulant of compression length is not D α (P, Q); rather it is given by D 1/α (P α , Q α ) where P α and Q α are defined by The new quantity D 1/α (P α , Q α ) also gives a measure of discrimination (i.e., is a divergence) between the probability distributions P and Q and coincides with the KLD at α → 1. This functional is referred to as the relative α-entropy in the terminology of [72] and has the simpler form RE α (P, Q) := D 1/α (P α , Q α ) The geometric properties of this relative α-entropy along with its forward and reverse projections have been studied recently [16,73]; see Section 2.1 for some details. This quantity had, however, already been proposed earlier as a statistical divergence, although for α ≥ 1 only, by [74] while developing a robust estimation procedure following the generalized method-of-moments approach of [75]. Later authors referred to the divergence proposed in [74] as the logarithmic density power divergence (LDPD) measure. The advantages of the minimum LDPD estimator in terms of robustness against outliers in data have been studied by, among other, [66,74]. Fujisawa [76], Fujisawa and Eguchi [77] have also used the same divergence measure with γ = (α − 1) ≥ 0 in different statistical problems and have referred to it as the γ-divergence. Note that, the formulation in (6) extends the definition of the divergence over the 0 < α < 1 region as well. Motivated by the substantial advantages of the minimum LDPD inference in terms of statistical robustness against outlying observations, Maji et al. [78,79] have recently developed a two-parameter generalization of the LDPD family, namely the logarithmic super divergence (LSD) family, given by This rich superfamily of divergences contain many important divergence measures including the LDPD at γ = 0 and the Kullback-Leibler divergence at τ = γ → 0; this family also contains a transformation of Renyi divergence at τ = 0 which has been referred to as the logarithmic power-divergence family by [80]. As shown in [78,79], the statistical inference based on some of the new members of this LSD family, outside the existing ones including the LDPD, provide much better trade-off between the robustness and efficiency of the corresponding minimum divergence estimators. The statistical benefits of the LSD family over the LDPD family raise a natural question: is it possible to translate this robustness advantage of the LSD family of divergences to the information theoretic context, through the development of a corresponding generalization of the relative α-entropy in (6)? In this paper, we partly answer this question by defining an independent information theoretic generalization of the relative α-entropy measure coinciding with the LSD measure. We will refer to this new generalized relative entropy measure as the "Relative (α, β)-entropy" and study its properties for different values of α > 0 and β ∈ R. In particular, this new formulation will extend the scope of the LSD measure for −1 < τ < 0 as well and generate several interesting new divergence and entropy measures. We also study the geometric properties of all members of the relative (α, β)-entropy family, or equivalently the LSD measures, including their continuity in both the arguments and a Pythagorean-type relation. The related forward projection problem, i.e., the minimization of the relative (α, β)-entropy in its first argument, is also studied extensively.
In summary, the main objective of the present paper is to study the geometric properties of the LSD measure through the new information theoretic or entropic formulation (or the relative (α, β)-entropy). Our results indeed generalize the properties of the relative α-entropy from [16,73]. The specific and significant contributions of the paper can be summarized as follows.

1.
We present a two parameter extension of the relative α-entropy measure in (6) motivated by the logarithmic S-divergence measures. These divergence measures are known to generate more robust statistical inference compared to the LDPD measures related to the relative α-entropy.

2.
In the new formulation of the relative (α, β)-entropy, the LSD measures are linked with several important information theoretic divergences and entropy measures like the ones named after Renyi. A new divergence family is discovered corresponding to α → 0 case (properly standardized) for the finite measure cases.

3.
As a by-product of our new formulation, we get a new two-parameter generalization of the Renyi entropy measure, which we refer to as the Generalized Renyi entropy (GRE). This opens up a new area of research to examine the detailed properties of GRE and its use in complex problems in statistical physics and information theory. In this paper, we show that this new GRE satisfies the basic entropic characteristics, i.e., it is zero when the argument probability is degenerate and is maximum when the probability is uniform. 4.
Here we provide a detailed geometric analysis of the robust LSD measure, or equivalently the relative (α, β)-entropy in our new formulation. In particular, we show their continuity or lower semi-continuity with respect to the first argument depending on the values of the tuning parameters α and β. Also, its lower semi-continuity with respect to the second argument is proved.

5.
We also study the convexity of the LSD measures (or the relative (α, β)-entropies) with respect to its argument densities. The relative α-entropy (i.e, the relative (α, β)-entropy at β = 1) is known to be quasi-convex [16] only in its first argument. Here, we will show that, for general α > 0 and β = 1, the relative (α, β)-entropies are not quasi-convex on the space of densities, but they are always quasi-convex with respect to both the arguments on a suitably (power) transformed space of densities. Such convexity results in the second argument were unavailable in the literature even for the relative α-entropy, which we will introduce in this paper through a transformation of space. 6.
Like the relative α-entropy, but unlike the relative entropy in (2), our new relative (α, β)-entropy also does not satisfy the data processing inequalities. However, we prove an extended Pythagorean relation for the relative (α, β)-entropy which makes it reasonable to treat them as "squared distances" and talk about their projections. 7.
The forward projection of a relative entropy or a suitable divergence, i.e., their minimization with respect to the first argument, is very important for both statistical physics and information theory. This is indeed equivalent to the maximum entropy principle and is also related to the Gibbs conditioning principle. In this paper, we will examine the conditions under which such a forward projection of the relative (α, β)-entropy (or, LSD) exists and is unique. 8.
Finally, for completeness, we briefly present the application of the LSD measure or the relative (α, β)-entropy measure in robust statistical inference in the spirit of [78,79] but now with extended range of tuning parameters. It uses the reverse projection principle; a result on the existence of the minimum LSD functional is first presented with the new formulation of this paper. Numerical illustrations are provided for the binomial model, where we additionally study their properties for the extended tuning parameter range α ∈ (0, 1) as well as for some new divergence families (related to α = 0). Brief indications of the potential use of these divergences in testing of statistical hypotheses are also provided.
Although we are primarily discussing the logarithmic entropies like the Renyi entropy and its generalizations in this paper, it is important to point out that non-logarithmic entropies including the f-entropy and the Tsallis entropy are also very useful in several applications with real systems. Recently, several complex physical and social systems have been observed to follow the theory developed from such non-logarithmic, non-additive entropies instead of the classical additive Shannon entropy. In particular, the Tsallis entropy has led to the development of the nonextensive statistical mechanics [61,64] to solve several critical issues in modern physics. Important areas of application include, but certainly are not limited to, the motion of cold atoms in dissipative optical lattices [81,82], the magnetic field fluctuations in the solar wind and related q-triplet [83], the distribution of velocity in driven dissipative dusty plasma [84], spin glass relaxation [85], the interaction of trapped ion with a classical buffer gas [86], different high energy collisional experiments [87][88][89], derivation of the black hole entropy [90], along with water engineering [63], text mining [65] and many others. Therefore, it is also important to investigate the possible generalizations and manipulations of such non-logarithmic entropies both from mathematical and application point of view. However, as our primary interest here is in logarithmic entropies, we have, to keep the focus clear, otherwise avoided the description and development of non-logarithmic entropies in this paper.
Although there are many applications of extended and general non-additive entropy and divergence measures, there are also some criticisms of these non-additive measures that should be kept in mind. It is of course possible to employ such quantities simply as new descriptors of the complexity of systems, but at the same time, it is known that the minimization of a generalized divergence (or maximization of the corresponding entropy) under constraints in order to determine an optimal probability assignment leads to inconsistencies for information measures other than the Kullback-Leibler divergence. See, for instance [91][92][93][94][95][96], among others. So, one needs to be very careful in discriminating the application of the newly introduced entropies and divergence measures for the purposes of inference under given information, from the ones where it is used as a measure of complexity. In this respect, we would like to emphasize that, the main advantage of our two-parameter extended family of LSD or relative (α, β)-entropy measures in parametric statistical inference is in their strong robustness property against possible contamination (generally manifested through outliers) in the sample data. The classical additive Shannon entropy and Kullback-Leibler divergence produce non-robust inference even under a small proportion of data contamination, but the extremely high robustness of the LSD has been investigated in detail, with both theoretical and empirical justifications, by [78,79]; in this respect, we will present some numerical illustrations in Section 5.2. Another important issue could be to decide whether to stop at the two-parameter level for information measures or to extend it to three-parameters, four-parameters, etc. It is not an easy question to answer. However, we have seen that many members of the two-parameter family of LSD measures generate highly robust inference along with a desirable trade-off between efficiency under pure data and robustness under contaminated data. Therefore a two-parameter system appears to work well in practice. Since it is a known principle that one "should not multiply entities beyond necessity", we will, for the sake of parsimony, restrict ourselves to the second level of generalization for robust statistical inference, at least until there is further convincing evidence that the next higher level of generalization can produce a significant improvement.

Definition: An Extension of the Relative α-Entropy
In order to motivate the development of our generalized relative (α, β)-entropy measure, let us first briefly describe an alternative formulation of the relative α-entropy following [16]. Consider the mathematical set-up of Section 1 with α > 0 and assume that the space L α (µ) is equipped with the norm and the corresponding metric d α (g, f ) = ||g − f || α for g, f ∈ L α (µ). Then, the relative α-entropy between two distributions P and Q is obtained as a function of the Cressie-Read power divergence measure [97], defined below in (11), between the escort measures P α and Q α defined in (5). Note that the disparity family or the φ-divergence family [18,[98][99][100][101][102][103] between P and Q is defined as for a continuous convex function φ on [0, ∞) satisfying φ(0) = 0 and with the usual convention 0φ(0/0) = 0. We consider the φ-function given by with the convention that, for any u > 0, 0φ λ (u/0) = 0 if λ < 0 and 0φ λ (u/0) = ∞ if λ > 0. The corresponding φ-divergence has the form which is just a positive multiple of the Cressie-Read power divergence with the multiplicative constant being |λ(1 + λ)|; when this constant is present, the case λ = 0 leads to the KLD measure in a limiting sense. Note that, our φ-function in (10) differs slightly from the one used by [16] in that we use sign(λ(λ + 1)) in place of sign(λ) there; this is to make the divergence in (11) non-negative for all λ ∈ R ( [16] considered only λ > −1) which will be needed to define our generalized relative entropy. Then, given an α > 0, [16,17] set λ = α −1 − 1(> −1) and show that the relative α-entropy of P with respect to Q can be obtained as It is straightforward to see that the above formulation (12) coincides with the definition given in (6). We often suppress the superscript µ whenever the underlying measure is clear from the context; in most applications in information theory and statistics it is either counting measure or the Lebesgue measure depending on whether the distribution is discrete or continuous.
We can now change the tuning parameters in the formulation given by (12) suitably as to arrive at the more general form of the LSD family in (7). For this purpose, let us fix α > 0, β ∈ R and assume that p, q ∈ L α (µ) are the µ-densities of P and Q, respectively. Instead of considering the re-parametrization λ = α −1 − 1 as above, we now consider the two-parameter re-parametrization λ = βα −1 − 1 ∈ R. Note that, the feasible range of λ, in order to make α > 0, now clearly depends on β through α = β 1+λ > 0; whenever β > 0 we have −1 < λ < ∞ and if β < 0 we need −∞ < λ < −1. We have already taken care of this dependence through the modified φ function defined in (10) which ensures that D λ (·, ·) is non-negative for all λ ∈ R. So we can again use the relation as in (12), after suitable standardization due to the additional parameter β, to define a new generalized relative entropy measure as given in the following definition.
A straightforward simplification gives a simpler form of this new relative (α, β)-entropy which coincides with the LSD measure as follows.

Proposition 1.
For any α > 0 and β ∈ R, RE α,β (P, Q) ≥ 0 for all probability measures P and Q, whenever it is defined.
Also, it is important to identify the cases where the relative (α, β)-entropy is not finitely defined, which can be obtained from the definition and convention related to D λ divergence; these are summarized in the following proposition.
Note that this modified Renyi divergence also coincides with the KLD measure at β = 1. Statistical applications of this divergence family have been studied by [80].
However, not all the members of the family of relative (α, β)-entropies are distinct or symmetric. For example, RE α,0 (P, Q) = RE α,α (Q, P) for any α > 0. The following proposition characterizes all such identities.
Recall that the KLD measure is linked to the Shannon entropy and the relative α-entropy is linked with the Renyi entropy when the prior mismatched probability is uniform over the finite space. To derive such a relation for our general relative (α, β)-entropy, let us assume µ(Ω) < ∞ and let U denote the uniform probability measure on Ω. Then, we get where the functional E α,β (P) is given in Definition 2 below and coincides with the Renyi entropy at β = 1. Thus, it can be used to define a two-parameter generalization of the Renyi entropy as follows.
The GRE is a new entropy to the best of our knowledge, and does not belong to the general class of entropy functionals as given in [104] which covers many existing entropies (including most, if not all, classical entropies). The following property of the functional E α,β (P) is easy to verify and justifies its use as a new entropy functional. To keep the focus of the present paper clear on the relative (α, β)-entropy, further properties of the GRE will be explored in our future work.
Example 1 (Normal Distribution). Consider distributions P i from the most common class of multivariate (s-dimensional) normal distributions having mean µ i ∈ R s and variance matrix Σ i for i = 1, 2. It is known that the Shannon and the Renyi entropies of P 1 are, respectively, given by With the new entropy measure, GRE, the entropy of the normal distribution P 1 can be seen to have the form Interestingly, the GRE of a normal distribution is effectively the same as its Shannon entropy or Renyi entropy up to an additive constant. However, similar characteristic does not hold between the relative entropy (KLD) and relative (α, β)-entropy. The KLD measure between two normal distributions P 1 and P 2 is given by whereas the general relative (α, β)-entropy, with α > 0 and β ∈ R\{0, α}, has the form .
Note that the relative (α, β)-entropy gives a more general divergence measure which utilizes different weights for the variance (or precision) matrix of the two normal distributions.

Example 2 (Exponential Distribution). Consider the exponential distribution P having density p
This distribution is very useful in lifetime modeling and reliability engineering; it is also the maximum entropy distribution of a non-negative random variable with fixed mean. The Shannon and the Renyi entropies of P are, respectively, given by A simple calculation leads to the following form of the our new GRE measure of the exponential distribution P.
Once again, the new GRE is effectively the same as the Shannon entropy or the Renyi entropy, up to an additive constant, for the exponential distribution as well. Further, if P 1 and P 2 are two exponential distributions with parameters θ 1 and θ 2 , respectively, the relative entropy (KLD) and the relative (α, β)-entropy between them are given by for α > 0 and β ∈ R\{0, α}. Clearly, the contributions of both the distribution is weighted differently by β and (α − β) in their relative (α, β)-entropy measure.
Before concluding this section, we study the nature of our relative (α, β)-entropy as α → 0. For this purpose, we restrict ourselves to the case of finite measure spaces with µ(Ω) < ∞. It is again straightforward to note that lim α→0 RE α,β (P, Q) = 0 for any β ∈ R and any distributions P and Q on Ω.
However, if we take the limit after scaling the relative entropy measure by α we get a non-degenerate divergence measure as follows.
for β ∈ R\{0}, and These interesting relative entropy measures again define a subfamily of valid statistical divergences, from its construction. The particular member at β = 1 is linked to the LDPD (or the γ-divergence) with tuning parameter −1 and can be thought of as a logarithmic extension of the famous Itakura-Saito divergence [105] given by This Itakura-Saito-divergence has been successfully applied to non-negative matrix factorization in different applications [106] which can be extended by using the new divergence family RE * β (P, Q) in future works.

Continuity
We start the exploration of the geometric properties of the relative (α, β)-entropy with its continuity over the functional space L α (µ). In the following, we interchangeably use the notation RE α,β (p, q) and D λ (p, q) to denote RE α,β (P, Q) and D λ (P, Q), respectively. Our results generalize the corresponding properties of the relative α-entropy from [16,73] to our relative (α, β)-entropy or equivalent LSD measure.
This function is lower semi-continuous in L α (µ) for any α > 0, β ∈ R. Additionally, it is continuous in L α (µ) when α > β > 0 and the relative entropy is finitely defined.

Remark 2.
Whenever Ω is finite (discrete) equipped with the counting measure µ, all integrals in the definition of RE α,β (P, Q) become finite sums and any limit can be taken inside these finite sums. Thus, whenever defined finitely, the function p → RE α,β (p, q) is always continuous in this case.

Remark 3.
For a general infinite space Ω, the function p → RE α,β (p, q) is not necessarily continuous for the cases α < β. This can be seen by using the same counterexample as given in Remark 3 of [16]. However, it is yet to be verified if this function can be continuous for β < 0 cases.
This function is lower semi-continuous in L α (µ) for any α > 0 and β ∈ R.
The case β = α can be proved in a similar manner and is left as an exercise to the readers.
Then we have the following inequalities: Proof. It follows by using the Jensen's inequality and the convexity of the function x β/α .
Next, note in view of Proposition 4 that, for any p, q ∈ L α (µ), RE α,β (p, q) = RE α,α−β (q, p). Using this result along with the above theorem, we also get the quasi-convexity of the relative (α, β)-entropy RE α,β (p, q) in q over a different power transformed space of densities. This leads to the following theorem.

Remark 5.
Note that, at α = β = 1, the RE 1,1 (p, q) coincides with the KLD measure (or relative entropy) which is quasi-convex in both the arguments p and q on L α (µ).

Extended Pythagorean Relation
Motivated by the quasi-convexity of RE α,β (p, q) on L α (µ) β , we now present a Pythagorean-type result for the general relative (α, β)-entropy over the power-transformed space. It generalizes the corresponding result for relative α-entropy [16]; the proof is similar to that in [16] with necessary modifications due to the transformation of the domain space.
If statement: Now, let us assume that (34)-or equivalently (38)-holds true. Further, as in the derivation of (38), we can start from the trivial statement Now, multiply (38) by τ and (44) byτ, and add to get In view of (37), this implies that This proves the if statement of Part (i) completing the proof.

Proof of Part (ii).
Note that the if statement follows directly from Part (i).
Note that, at β = 1, the above theorem coincides with Theorem 9 of [16]. However, for general α, β as well, the above extended Pythagorean relation for the relative (α, β)-entropy suggests that it behaves "like" a squared distance (although with a non-linear space transformation). So, one can meaningfully define its projection on to a suitable set which we will explore in the following sections.

The Forward Projection of Relative (α, β)-Entropy
The forward projection, i.e., minimization with respect to the first argument given a fixed second argument, leads to the important maximum entropy principle of information theory; it also relates to the Gibbs conditioning principle from statistical physics [16]. Let us now formally define and study the forward projection of the relative (α, β)-entropy. Let S * denote the set of probability measure on (Ω, A) and let the set of corresponding µ-densities be denoted by S = {p = dP/dµ : P ∈ S * }.

Definition 3 (Forward (α, β)-Projection).
Fix Q ∈ S * having µ-density q ∈ L α (µ). Let E ⊂ S with RE α,β (p, q) < ∞ for some p ∈ E. Then, p * ∈ E is called the forward projection of the relative (α, β)-entropy or simply the forward (α, β)-projection (or forward LSD projection) of q on E if it satisfies the relation Note that we must assume that, E ⊂ L α (µ) so that the above relative (α, β)-entropy is finitely defined for p ∈ E. We first prove the uniqueness of the forward (α, β)-projection from the Pythagorean property, whenever it exists. The following theorem describe the connection of the forward (α, β)-projection with Pythagorean relation; the proof is same as that of ( [16], Theorem 10) using Theorem 4 and hence omitted for brevity.

Proof. By (37), we get
Then the Lemma follows by an application of the extended Minkowski's inequalities (32) and (33) from Lemma 1.
We now present the sufficient conditions for the existence of the forward (α, β)-projection in the following theorem.

Proof.
We prove it separately for the cases βλ > 0 and βλ < 0, extending the arguments from [16]. The case βλ = 0 can be obtained from these two cases by standard limiting arguments and hence omitted for brevity.
The Case βλ < 0: Note that, in this case, we must have 0 < β < α, since α > 0. Then, using (29), we can see that Now, since E β and hence E is closed, one can show that E is also closed; see, e.g., the proof of ([16], Theorem 8). Next, we will show that E is also convex. For take s 1 for some s 0 , s 1 ∈ [0, 1] and p 0 , p 1 ∈ E, and take any τ ∈ [0, 1]. Note that However, by convexity of E β , p τ ∈ E and also 0 ≤ s τ ≤ 1 by the extended Minkowski inequality (33).
Finally, since 0 < β < α, L α/β (µ) is a reflexive Banach space and hence the closed and convex E ⊂ L α/β (µ) is also closed in the weak topology. So, the unit ball is compact in the weak topology by the Banach-Alaoglu theorem and hence its closed subset E is also weakly compact. However, since g belongs to the dual space of L α/β (µ), the linear functional h → hgdµ is continuous in weak topology and also increasing in s. Hence its supremum over E is attained at s = 1 and some p * ∈ E, which is the required forward (α, β)-projection.
Before concluding this section, we will present one example of the forward (α, β)-projection onto a transformed-linear family of distributions. Example 3 (An example of the forward (α, β)-projection). Fix α > 0, β ∈ R\{0, α} and q ∈ L α (µ) related to the measure Q. Consider measurable functions f i : Ω → R for i ∈ I, an index set, and the family of distributions Let us denote the corresponding µ-density set by L β = p = dP dµ : P ∈ L * β . We assume that, L * β is non-empty, every P ∈ L * β is absolute continuous with respect to µ and L β ⊂ L α (µ). Then, p * is the forward (α, β)-projection of q on L β if and only if there exists a function g in the L 1 (Q β )-closure of the linear space spanned by { f i : i ∈ I} and a subset N ⊂ Ω such that, for every P ∈ L * The proof follows by extending the arguments of the proof of ( [16], Theorem 11) and hence it is left as an exercise to the readers.

Remark 6.
Note that, at the special case β = 1, L * 1 is a linear family of distributions and the above example coincides with ( [16], Theorem 11) on the forward projection of relative α-entropy on L * 1 . However, it is still an open question to derive the forward (α, β)-projection on L * 1 .

The Reverse Projection and Parametric Estimation
As in the case of the forward projection of a relative entropy measure, we can also define the reverse projection by minimizing it with respect to the second argument over a convex set E keeping the first argument fixed. More formally, we use the following definition.
Definition 4 (Reverse (α, β)-Projection). Fix p ∈ L α (µ) and let E ⊂ S with RE α,β (p, q) < ∞ for some q ∈ E. Then, q * ∈ E is called the reverse projection of the relative (α, β)-entropy or simply the reverse (α, β)-projection (or reverse LSD projection) of p on E if it satisfies the relation We can get sufficient conditions for the existence and uniqueness of the reverse (α, β)-projection directly from Theorem 6 and the fact that RE α,β (p, q) = RE α,α−β (q, p); this is presented in the following theorem.
The reverse projection is mostly used in statistical inference where we fix the first argument of a relative entropy measure (or divergence measure) at the empirical data distribution and minimize the relative entropy with respect to the model family of distributions in its second argument. The resulting estimator, commonly known as the minimum distance or minimum divergence estimator, yields the reverse projection of the observed data distribution on the family of model distributions with respect to the relative entropy or divergence under consideration. This approach was initially studied by [9][10][11][12][13] to obtain the popular maximum likelihood estimator as the reverse projection with respect to the relative entropy in (2). More recently, this approach has become widely popular, but with more general relative entropies or divergence measures, to obtain robust estimators against possible contamination in the observed data. Let us describe it more rigorously in the following for our relative (α, β)-entropy.
Suppose we have independent and identically distributed data X 1 , . . . , X n from a true distribution G having density g with respect to some common dominating measure µ. We model g by a parametric model family of µ-densities F = { f θ : θ ∈ Θ ⊆ R p }, where it is assumed that both g and f θ have the same support independent of θ. Our objective is to infer about the unknown parameter θ. In minimum divergence inference, an estimator of θ is obtained by minimizing the divergence measure between (an estimate of) g and f θ with respect to θ ∈ Θ. Maji et al. [78] have considered the LSD (or equivalently the relative (α, β)-entropy) as the divergence under consideration and defined the corresponding minimum divergence functional at G, say T α,β (G), through the relation whenever the minimum exists. We will refer to T α,β (G) as the minimum relative (α, β)-entropy (MRE) functional, or the minimum LSD functional in the language of [78,79]. Note that, if g ∈ F , i.e., g = f θ 0 for some θ 0 ∈ Θ, then we must have T α,β (G) = θ 0 . If g / ∈ F , we call T α,β (G) as the "best fitting parameter" value, since f T α,β (G) is the closest model element to g in the LSD sense. In fact, for g / ∈ F , T α,β (G) is nothing but the reverse (α, β)-projection of the true density g on the model family F , which exists and is unique under the sufficient conditions of Theorem 7. Therefore, under identifiability of the model family F we get the existence and uniqueness of the MRE functional, which is presented in the following corollary. Although this estimator was first introduced by [78] in terms of the LSD, the results concerning the existence of the estimate were not provided. Corollary 2 (Existence and Uniqueness of the MRE Functional). Consider the above parametric estimation problem with g ∈ L α (µ) and F ⊂ L α (µ). Fix α > 0 and β ∈ R with β = α and assume that the model family F is identifiable in θ.
Further, under standard differentiability assumptions, we can obtain the estimating equation of the MRE functional T α,β (G) as given by where u θ (x) = ∂ ∂θ ln f θ (x). It is important to note that, at β = α = 1, the MRE functional T 1,1 (G) coincides with the maximum likelihood functional since RE 1,1 = RE , the KLD measure. Based on the estimating Equation (61), Maji et al. [78] extensively studied the theoretical robustness properties of the MRE functional against gross-error contamination in data through the higher order influence function analysis. The classical first order influence function was seen to be inadequate for this purpose; it becomes independent of β at the model but the real-life performance of the MRE functional critically depends on both α and β [78,79] as we will also see in Section 5.2.
In practice, however, the true data generating density is not known and so we need to use some empirical estimate in place of g and the resulting value of the MRE functional is called the minimum relative (α, β)-entropy estimator (MREE) or the minimum LSD estimator in the terminology of [78,79]. Note that, when the data are discrete and µ is the counting measure, one can use a simple estimate of g given by the relative frequencies r n (x) = 1 n ∑ n i=1 I(X i = x), where I(A) is the indicator function of the event A; the corresponding MREE is then obtained by solving (61) with g(x) replaced by r n (x) and integrals replaced by sums over the discrete support. Asymptotic properties of this MREE under discrete models are well-studied by [78,79] for the tuning parameters α ≥ 1 and β ∈ R; the same line of argument can be used to extend them also for the cases α ∈ (0, 1) in a straightforward manner.
However, in case of continuous data, there is no such simple estimator available to use in place of g unless β = 1. When β = 1, the estimating Equation (61) depends on g through the terms f α−1 u θ dG; so we can simply use the empirical distribution function G n in place of G and solve the resulting equation to obtain the corresponding MREE. However, for β = 1, we must use a non-parametric kernel estimator g n of g in (61) to obtain the MREE under continuous models; this leads to complications including bandwidth selection while deriving the asymptotics of the resulting MREE. One possible approach to avoid such complications is to use the smoothed model technique, which has been applied in [108] for the case of minimum φ-divergence estimators. Another alternative approach has been discussed in [109,110]. However, the detailed analyses of the MREE under the continuous model, in either of the above approaches, are yet to be studied so far.

Numerical Illustration: Binomial Model
Let us now present numerical illustrations under the common binomial model to study the finite sample performance of the MREEs. Along with the known properties of the MREE at α ≥ 1 (i.e., the minimum LSD estimators with τ ≥ 0 from [78,79]), here we will additionally explore their properties in case of α ∈ (0, 1) and for the new divergences RE * β (P, Q) related to α = 0. Suppose X 1 , . . . , X n are random observations from a true density g having support χ = {0, 1, 2, . . . , m} for some positive integer m. We model g by the Binomial(m, θ) densities f θ (x) = ( n x )θ x (1 − θ) m−x for x ∈ χ and θ ∈ [0, 1]. Here an estimate g of g is given by the relative frequency g(x) = r n (x). For any α > 0 and β ∈ R, the relative (α, β)-entropy between g and f θ is given by which can be minimized with respect to θ ∈ [0, 1] to obtain the corresponding MREE of θ. Note that, it is also the solution of the estimating Equation (61) with g(x) replaced by the relative frequency r n (x). However, in this example, u θ (x) = x−mθ θ (1−θ) and hence the MREE estimating equation simplifies to We can numerically solve the above estimating equation over θ ∈ [0, 1], or equivalently over the transformed parameter p := θ 1−θ ∈ [0, ∞], to obtain the corresponding MREE (i.e., the minimum LSD estimator).
We simulate random sample of size n from a binomial population with true parameter θ 0 = 0.1 with m = 10 and numerically compute the MREE. Repeating this exercise 1000 times, we can obtain an empirical estimate of the bias and the mean squared error (MSE) of the MREE of 10θ (since θ is very small in magnitude). Tables 1 and 2 present these values for sample sizes n = 20, 50, 100 and different values of tuning parameters α > 0 and β > 0; their existences are guaranteed by Corollary 2. Note that the choice α = 1 = β gives the maximum likelihood estimator whereas β = 1 only yields the minimum LDPD estimator with parameter α. Next, in order to study the robustness, we contaminate 10% of each sample by random observations from a distant binomial distribution with parameters θ = 0.9 and m = 10 and repeat the above simulation exercise; the resulting bias and MSE for the contaminated samples are given in Tables 3 and 4. Our observations from these tables can be summarized as follows. • Under pure data with no contamination, the maximum likelihood estimator (the MREE at α = 1 = β) has the least bias and MSE as expected, which further decrease as sample size increases.

•
As we move away from α = 1 and β = 1 in either direction, the MSEs of the corresponding MREEs under pure data increase slightly; but as long as the tuning parameters remain within a reasonable window of the (1, 1) point and neither component is very close to zero, this loss in efficiency is not very significant.

•
When α or β approaches zero, the MREEs become somewhat unstable generating comparatively larger MSE values. This is probably due to the presence of inliers under the discrete binomial model. Note that, the relative (α, β)-entropy measures with β ≤ 0 are not finitely defined for the binomial model if there is just only one empty cell present in the data.

•
Under contamination, the bias and MSE of the maximum likelihood estimator increase significantly but many MREEs remains stable. In particular, the MREEs with β ≥ α and the MREEs with β close to zero are non-robust against data contamination. Many of the remaining members of the MREE family provide significantly improved robust estimators.

•
In the entire simulation, the combination (α = 1, β = 0.7) appears to provide the most stable results. In Table 4, the best results are available along a tubular region which moves from the top left-hand to the bottom right-hand of the table subject to the conditions that α > β and none of them are very close to zero. • Based on our numerical experiments, the optimum range of values of α, β providing the most robust minimum relative (α, β)-estimators are α = 0.9, 1, 0.5 ≤ β ≤ 0.7 and 1 < α ≤ 1.5, 0.5 ≤ β < 1. Note that this range includes the estimators based on the logarithmic power divergence measure as well as the new LSD measures with α < 1.

•
Many of the MREEs, which belong to the optimum range mentioned in the last item and are close to the combination α = 1 = β, generally also provide the best trade-off between efficiency under pure data and robustness under contaminated data.
In summary, many MREEs provide highly robust estimators under data contamination along with only a very small loss in efficiency under pure data. These numerical findings about the finite sample behavior of the MREEs under the binomial model and the corresponding optimum range of tuning parameters, for the subclass with α ≥ 1, are consistent with the findings of [78,79] who used a Poisson model. Additionally, our illustrations shed lights on the properties of the MREEs at α < 1 as well and show that some MREEs in this range, e.g., at α = 0.9 and β = 0.5, also yield optimum estimators in terms of the dual goal of high robustness and high efficiency.

Application to Testing Statistical Hypothesis
We end the paper with a very brief indication on the potential of the relative (α, β)-entropy or the LSD measure in statistical hypothesis testing problems. The minimum possible value of the relative entropy or divergence measure between the data and the null distribution indicates the amount of departure from null and hence can be used to develop a statistical testing procedure.
Consider the parametric estimation set-up as in Section 5.1 with g ∈ F and fix a parameter value θ 0 ∈ Θ. Suppose we want to test the simple null hypothesis in the one sample case given by H 0 : θ = θ 0 against H 1 : θ = θ 0 .
Maji et al. [78] have developed the LSD-based test statistics for the above testing problem as given by where θ α,β is the MREE with parameters α and β. [78,79] have also developed the LSD-based test for a simple two-sample problem where two independent samples of sizes n 1 and n 2 are given from true densities f θ 1 , f θ 2 ∈ F , respectively and we want to test for the homogeneity of the two samples trough the hypothesis H 0 : θ 1 = θ 2 against H 1 : θ 1 = θ 2 .
The proposed test statistics for this two-sample problem has the form T (2) n,α,β = 2n 1 n 2 where (1) θ α,β and (2) θ α,β are the MREEs of θ 1 and θ 2 , respectively, obtained from the two samples separately Note that, at α = β = 1, both the test statistics in (63) and (64) become asymptotically equivalent to the corresponding likelihood ratio tests under the respective null hypothesis. Maji et al. [78,79] have studied the asymptotic properties of these two tests, which have asymptotic null distributions as linear combinations of chi-square distributions. They have also numerically illustrated the benefits of these LSD or relative (α, β)-entropy-based tests, although with tuning parameters α ≥ 1 only, to achieve robust inference against possible contamination in the sample data.
The same approach can also be used to develop robust tests for more complex hypothesis testing problems based on the relative (α, β)-entropy or the LSD measures, now with parameters α > 0, and also using the new divergences RE * β (·, ·). For example, consider the above one sample set-up and a subset Θ 0 ⊂ Θ and let we are interested in testing the composite hypothesis H 0 : θ ∈ Θ 0 against H 1 : θ / ∈ Θ 0 .
with similar motivation from (63) and (64), we can construct relative entropy or LSD-based test statistics for testing the above composite hypothesis as given by T n,α,β where θ α,β is the restricted MREE with parameters α and β obtained by minimizing the relative entropy over θ ∈ Θ 0 and θ α,β is the corresponding unrestricted MREE obtained by minimizing over θ ∈ Θ. It will surely be of significant interest to study the asymptotic and robustness properties of this relative entropy-based test for the above composite hypothesis under one sample or even more general hypotheses with two or more samples. However, considering the length of the present paper, which is primarily focused on the geometric properties of entropies and relative entropies, we have deferred the detailed analyses of such MREE-based hypothesis testing procedures in a future report.

Conclusions
We have explored the geometric properties of the LSD measures through a new information theoretic formulation when we develop this divergence measure as a natural extension of the relative α-entropy; we refer to it as the two-parameter relative (α, β)-entropy. It is shown to be always lower semicontinuous in both the arguments, but is continuous in its first argument only if α > β > 0. We also proved that the relative (α, β)-entropy is quasi-convex in both its arguments after a suitable (different) transformation of the domain space and derive an extended Pythagorean relation under these transformations. Along with the study of its forward and reverse projections, statistical applications are also discussed.
It is worthwhile to note that the information theoretic divergences can also be used to define new measures of robustness and efficiency of a parameter estimate; one can then obtain the optimum robust estimator, along Hampel's infinitesimal principle, to achieve the best trade-off between these divergence-based summary measures [111][112][113]. In particular, the LDPD measure, a prominent member of our LSD or relative (α, β)-entropy family, has been used by [113] who have illustrated important theoretical properties including different types of equivariance of the resulting optimum estimators besides their strong robustness properties. A similar approach can also be used with our general relative (α, β)-entropies to develop estimators with enhanced optimality properties, establishing a better robustness-efficiency trade-off.
The present work opens up several interesting problems to be solved in future research as already noted throughout the paper. In particular, we recall that the relative α-entropy has an interpretation from the problem of guessing under source uncertainty [17,71]. As an extension of relative α-entropy, a similar information theoretic interpretation of the relative (α, β)-entropy (i.e., the LSD) is expected and its proper interpretation will be a useful development. Additionally, we have obtained a new extension of the Renyi entropy as a by-product and detailed study of this new entropy measure and its potential applications may lead to a new aspect of the mathematical information theory. Also, statistical applications of these measures need to be studied thoroughly specially for the continuous models, where the complications of a kernel density estimator is unavoidable, and for testing complex composite hypotheses from one or more samples. We hope to pursue some of these interesting extensions in future.