Next Article in Journal
Molecular Dynamics vs. Stochastic Processes: Are We Heading Anywhere?
Next Article in Special Issue
Robust Estimation for the Single Index Model Using Pseudodistances
Previous Article in Journal
Time-Fractional Diffusion with Mass Absorption in a Half-Line Domain due to Boundary Value of Concentration Varying Harmonically in Time
Previous Article in Special Issue
Minimum Penalized ϕ-Divergence Estimation under Model Misspecification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Generalized Relative (α, β)-Entropy: Geometric Properties and Applications to Robust Statistical Inference

Indian Statistical Institute, Kolkata 700108, India
*
Author to whom correspondence should be addressed.
Entropy 2018, 20(5), 347; https://doi.org/10.3390/e20050347
Submission received: 30 March 2018 / Revised: 22 April 2018 / Accepted: 1 May 2018 / Published: 6 May 2018

Abstract

:
Entropy and relative entropy measures play a crucial role in mathematical information theory. The relative entropies are also widely used in statistics under the name of divergence measures which link these two fields of science through the minimum divergence principle. Divergence measures are popular among statisticians as many of the corresponding minimum divergence methods lead to robust inference in the presence of outliers in the observed data; examples include the ϕ -divergence, the density power divergence, the logarithmic density power divergence and the recently developed family of logarithmic super divergence (LSD). In this paper, we will present an alternative information theoretic formulation of the LSD measures as a two-parameter generalization of the relative α -entropy, which we refer to as the general ( α , β ) -entropy. We explore its relation with various other entropies and divergences, which also generates a two-parameter extension of Renyi entropy measure as a by-product. This paper is primarily focused on the geometric properties of the relative ( α , β ) -entropy or the LSD measures; we prove their continuity and convexity in both the arguments along with an extended Pythagorean relation under a power-transformation of the domain space. We also derive a set of sufficient conditions under which the forward and the reverse projections of the relative ( α , β ) -entropy exist and are unique. Finally, we briefly discuss the potential applications of the relative ( α , β ) -entropy or the LSD measures in statistical inference, in particular, for robust parameter estimation and hypothesis testing. Our results on the reverse projection of the relative ( α , β ) -entropy establish, for the first time, the existence and uniqueness of the minimum LSD estimators. Numerical illustrations are also provided for the problem of estimating the binomial parameter.

1. Introduction

Decision making under uncertainty is the backbone of modern information science. The works of C. E. Shannon and the development of his famous entropy measure [1,2,3] represent the early mathematical foundations of information theory. The Shannon entropy and the corresponding relative entropy, commonly known as the Kullback-Leibler divergence (KLD), has helped to link information theory simultaneously with probability [4,5,6,7,8] and statistics [9,10,11,12,13]. If P and Q are two probability measures on a measurable space ( Ω , A ) and have absolutely continuous densities p and q, respectively, with respect to a common dominating σ -finite measure μ , then the Shannon entropy of P is defined as
E ( P ) = p log ( p ) d μ ,
and the KLD measure between P and Q is given by
RE ( P , Q ) = p log p q d μ .
In statistics, the minimization of the KLD measure produces the most likely approximation as given by the maximum likelihood principle; the latter, in turn, has a direct equivalence to the (Shannon) entropy maximization criterion in information theory. For example, if Ω is finite and μ is the counting measure, it is easy to see that RE ( P , U ) = log | Ω | E ( P ) , where U is the uniform measure on Ω . Minimization of this relative entropy, or equivalently maximization of the Shannon entropy, with respect to P within a suitable convex set E , generates the most probable distribution for an independent identically distributed finite source having true marginal probability in E with non-informative (uniform) prior probability of guessing [14,15]. In general, with a finite source, RE ( P , Q ) denotes the penalty in expected compressed length if the compressor assumes a mismatched probability Q [16,17]. The corresponding general minimizer of RE ( P , Q ) given Q, namely its forward projection, and other geometric properties of RE ( P , Q ) are well studied in the literature; see [18,19,20,21,22,23,24,25,26,27,28,29] among others.
Although the maximum entropy or the minimum divergence criterion based on the classical Shannon entropy E ( P ) and the KLD measure RE ( P , Q ) is still widely used in major (probabilistic) decision making problems in information science and statistics [30,31,32,33,34,35,36,37,38,39,40,41,42,43], there also exist many different useful generalizations of these quantities to address eminent issues in quantum statistical physics, complex codings, statistical robustness and many other topics of interest. For example, if we consider the standardized cumulant of compression length in place of the expected compression length in Shannon’s theory, the optimum distribution turns out to be the maximizer of a generalization of the Shannon entropy [44,45] which is given by
E α ( P ) = 1 1 α log p α d μ , α > 0 , α 1
provided p L α ( μ ) , the complete vector space of functions for which the α -th power of their absolute values are μ -integrable. This general entropy functional is popular by the name Renyi entropy of order α [46] and covers many important entropy measures like Hartley entropy at α 0 (for finite source), Shannon entropy at α 1 , collision entropy at α = 2 and the min-entropy at α . The corresponding Renyi divergence measure is given by
D α ( P , Q ) = 1 α 1 log p α q 1 α d μ , α > 0 , α 1 ,
whenever p , q L α ( μ ) and coincides with the classical KLD measure at α 1 . The Renyi entropy and the Renyi divergence are widely used in recent complex physical and statistical problems; see, for example, [47,48,49,50,51,52,53,54,55,56]. Other non-logarithmic extensions of Shannon entropy include the classical f-entropies [57], the Tsallis entropy [58] as well as the more recent generalized ( α , β , γ ) -entropy [59,60] among many others; the corresponding divergences and the minimum divergence criteria are widely used in critical information theoretic and statistical problems; see [57,59,60,61,62,63,64,65,66,67,68,69,70] for details.
We have noted that there is a direct information theoretic connection of KLD to the Shannon entropy under mismatched guessing by minimizing the expected compressed length. However, such a connection does not exist between the Renyi entropy E α ( P ) and the Renyi divergence D α ( P , Q ) as recently noted by [17,71]. Herein, it has been shown that, for a finite source with marginal distribution P and a (prior) mismatched compressor distribution Q, the penalty in the normalized cumulant of compression length is not D α ( P , Q ) ; rather it is given by D 1 / α ( P α , Q α ) where P α and Q α are defined by
d P α d μ = p α = p α p α d μ , d Q α d μ = q α = q α q α d μ .
The new quantity D 1 / α ( P α , Q α ) also gives a measure of discrimination (i.e., is a divergence) between the probability distributions P and Q and coincides with the KLD at α 1 . This functional is referred to as the relative α -entropy in the terminology of [72] and has the simpler form
RE α ( P , Q ) : = D 1 / α ( P α , Q α ) = α 1 α log p q α 1 d μ 1 1 α log p α d μ + log q α d μ , α > 0 , α 1 .
The geometric properties of this relative α -entropy along with its forward and reverse projections have been studied recently [16,73]; see Section 2.1 for some details. This quantity had, however, already been proposed earlier as a statistical divergence, although for α 1 only, by [74] while developing a robust estimation procedure following the generalized method-of-moments approach of [75]. Later authors referred to the divergence proposed in [74] as the logarithmic density power divergence (LDPD) measure. The advantages of the minimum LDPD estimator in terms of robustness against outliers in data have been studied by, among other, [66,74]. Fujisawa [76], Fujisawa and Eguchi [77] have also used the same divergence measure with γ = ( α 1 ) 0 in different statistical problems and have referred to it as the γ -divergence. Note that, the formulation in (6) extends the definition of the divergence over the 0 < α < 1 region as well.
Motivated by the substantial advantages of the minimum LDPD inference in terms of statistical robustness against outlying observations, Maji et al. [78,79] have recently developed a two-parameter generalization of the LDPD family, namely the logarithmic super divergence (LSD) family, given by
LSD τ , γ ( P , Q ) = 1 B log p 1 + τ d μ 1 + τ A B log p A q B d μ + 1 A log q 1 + τ d μ , with A = 1 + γ ( 1 τ ) , B = 1 + τ A , τ 0 , γ R .
This rich superfamily of divergences contain many important divergence measures including the LDPD at γ = 0 and the Kullback-Leibler divergence at τ = γ 0 ; this family also contains a transformation of Renyi divergence at τ = 0 which has been referred to as the logarithmic power-divergence family by [80]. As shown in [78,79], the statistical inference based on some of the new members of this LSD family, outside the existing ones including the LDPD, provide much better trade-off between the robustness and efficiency of the corresponding minimum divergence estimators.
The statistical benefits of the LSD family over the LDPD family raise a natural question: is it possible to translate this robustness advantage of the LSD family of divergences to the information theoretic context, through the development of a corresponding generalization of the relative α -entropy in (6)? In this paper, we partly answer this question by defining an independent information theoretic generalization of the relative α -entropy measure coinciding with the LSD measure. We will refer to this new generalized relative entropy measure as the “Relative ( α , β ) -entropy” and study its properties for different values of α > 0 and β R . In particular, this new formulation will extend the scope of the LSD measure for 1 < τ < 0 as well and generate several interesting new divergence and entropy measures. We also study the geometric properties of all members of the relative ( α , β ) -entropy family, or equivalently the LSD measures, including their continuity in both the arguments and a Pythagorean-type relation. The related forward projection problem, i.e., the minimization of the relative ( α , β ) -entropy in its first argument, is also studied extensively.
In summary, the main objective of the present paper is to study the geometric properties of the LSD measure through the new information theoretic or entropic formulation (or the relative ( α , β ) -entropy). Our results indeed generalize the properties of the relative α -entropy from [16,73]. The specific and significant contributions of the paper can be summarized as follows.
  • We present a two parameter extension of the relative α -entropy measure in (6) motivated by the logarithmic S-divergence measures. These divergence measures are known to generate more robust statistical inference compared to the LDPD measures related to the relative α -entropy.
  • In the new formulation of the relative ( α , β ) -entropy, the LSD measures are linked with several important information theoretic divergences and entropy measures like the ones named after Renyi. A new divergence family is discovered corresponding to α 0 case (properly standardized) for the finite measure cases.
  • As a by-product of our new formulation, we get a new two-parameter generalization of the Renyi entropy measure, which we refer to as the Generalized Renyi entropy (GRE). This opens up a new area of research to examine the detailed properties of GRE and its use in complex problems in statistical physics and information theory. In this paper, we show that this new GRE satisfies the basic entropic characteristics, i.e., it is zero when the argument probability is degenerate and is maximum when the probability is uniform.
  • Here we provide a detailed geometric analysis of the robust LSD measure, or equivalently the relative ( α , β ) -entropy in our new formulation. In particular, we show their continuity or lower semi-continuity with respect to the first argument depending on the values of the tuning parameters α and β . Also, its lower semi-continuity with respect to the second argument is proved.
  • We also study the convexity of the LSD measures (or the relative ( α , β ) -entropies) with respect to its argument densities. The relative α -entropy (i.e, the relative ( α , β ) -entropy at β = 1 ) is known to be quasi-convex [16] only in its first argument. Here, we will show that, for general α > 0 and β 1 , the relative ( α , β ) -entropies are not quasi-convex on the space of densities, but they are always quasi-convex with respect to both the arguments on a suitably (power) transformed space of densities. Such convexity results in the second argument were unavailable in the literature even for the relative α -entropy, which we will introduce in this paper through a transformation of space.
  • Like the relative α -entropy, but unlike the relative entropy in (2), our new relative ( α , β ) -entropy also does not satisfy the data processing inequalities. However, we prove an extended Pythagorean relation for the relative ( α , β ) -entropy which makes it reasonable to treat them as “squared distances” and talk about their projections.
  • The forward projection of a relative entropy or a suitable divergence, i.e., their minimization with respect to the first argument, is very important for both statistical physics and information theory. This is indeed equivalent to the maximum entropy principle and is also related to the Gibbs conditioning principle. In this paper, we will examine the conditions under which such a forward projection of the relative ( α , β ) -entropy (or, LSD) exists and is unique.
  • Finally, for completeness, we briefly present the application of the LSD measure or the relative ( α , β ) -entropy measure in robust statistical inference in the spirit of [78,79] but now with extended range of tuning parameters. It uses the reverse projection principle; a result on the existence of the minimum LSD functional is first presented with the new formulation of this paper. Numerical illustrations are provided for the binomial model, where we additionally study their properties for the extended tuning parameter range α ( 0 , 1 ) as well as for some new divergence families (related to α = 0 ). Brief indications of the potential use of these divergences in testing of statistical hypotheses are also provided.
Although we are primarily discussing the logarithmic entropies like the Renyi entropy and its generalizations in this paper, it is important to point out that non-logarithmic entropies including the f-entropy and the Tsallis entropy are also very useful in several applications with real systems. Recently, several complex physical and social systems have been observed to follow the theory developed from such non-logarithmic, non-additive entropies instead of the classical additive Shannon entropy. In particular, the Tsallis entropy has led to the development of the nonextensive statistical mechanics [61,64] to solve several critical issues in modern physics. Important areas of application include, but certainly are not limited to, the motion of cold atoms in dissipative optical lattices [81,82], the magnetic field fluctuations in the solar wind and related q-triplet [83], the distribution of velocity in driven dissipative dusty plasma [84], spin glass relaxation [85], the interaction of trapped ion with a classical buffer gas [86], different high energy collisional experiments [87,88,89], derivation of the black hole entropy [90], along with water engineering [63], text mining [65] and many others. Therefore, it is also important to investigate the possible generalizations and manipulations of such non-logarithmic entropies both from mathematical and application point of view. However, as our primary interest here is in logarithmic entropies, we have, to keep the focus clear, otherwise avoided the description and development of non-logarithmic entropies in this paper.
Although there are many applications of extended and general non-additive entropy and divergence measures, there are also some criticisms of these non-additive measures that should be kept in mind. It is of course possible to employ such quantities simply as new descriptors of the complexity of systems, but at the same time, it is known that the minimization of a generalized divergence (or maximization of the corresponding entropy) under constraints in order to determine an optimal probability assignment leads to inconsistencies for information measures other than the Kullback-Leibler divergence. See, for instance [91,92,93,94,95,96], among others. So, one needs to be very careful in discriminating the application of the newly introduced entropies and divergence measures for the purposes of inference under given information, from the ones where it is used as a measure of complexity. In this respect, we would like to emphasize that, the main advantage of our two-parameter extended family of LSD or relative ( α , β ) -entropy measures in parametric statistical inference is in their strong robustness property against possible contamination (generally manifested through outliers) in the sample data. The classical additive Shannon entropy and Kullback-Leibler divergence produce non-robust inference even under a small proportion of data contamination, but the extremely high robustness of the LSD has been investigated in detail, with both theoretical and empirical justifications, by [78,79]; in this respect, we will present some numerical illustrations in Section 5.2. Another important issue could be to decide whether to stop at the two-parameter level for information measures or to extend it to three-parameters, four-parameters, etc. It is not an easy question to answer. However, we have seen that many members of the two-parameter family of LSD measures generate highly robust inference along with a desirable trade-off between efficiency under pure data and robustness under contaminated data. Therefore a two-parameter system appears to work well in practice. Since it is a known principle that one “should not multiply entities beyond necessity”, we will, for the sake of parsimony, restrict ourselves to the second level of generalization for robust statistical inference, at least until there is further convincing evidence that the next higher level of generalization can produce a significant improvement.

2. The Relative ( α , β ) -Entropy Measure

2.1. Definition: An Extension of the Relative α -Entropy

In order to motivate the development of our generalized relative ( α , β ) -entropy measure, let us first briefly describe an alternative formulation of the relative α -entropy following [16]. Consider the mathematical set-up of Section 1 with α > 0 and assume that the space L α ( μ ) is equipped with the norm
| | f | | α = | f | α d μ 1 / α if α 1 , f L α ( μ ) , | f | α d μ if 0 < α < 1 , f L α ( μ ) ,
and the corresponding metric d α ( g , f ) = | | g f | | α for g , f L α ( μ ) . Then, the relative α -entropy between two distributions P and Q is obtained as a function of the Cressie-Read power divergence measure [97], defined below in (11), between the escort measures P α and Q α defined in (5). Note that the disparity family or the ϕ -divergence family [18,98,99,100,101,102,103] between P and Q is defined as
D ϕ ( P , Q ) = q ϕ p q d μ ,
for a continuous convex function ϕ on [ 0 , ) satisfying ϕ ( 0 ) = 0 and with the usual convention 0 ϕ ( 0 / 0 ) = 0 . We consider the ϕ -function given by
ϕ ( u ) = ϕ λ ( u ) = s i g n ( λ ( λ + 1 ) ) u λ + 1 1 , λ R , u 0 ,
with the convention that, for any u > 0 , 0 ϕ λ ( u / 0 ) = 0 if λ < 0 and 0 ϕ λ ( u / 0 ) = if λ > 0 . The corresponding ϕ -divergence has the form
D λ ( P , Q ) = D ϕ λ ( P , Q ) = s i g n ( λ ( λ + 1 ) ) q p q λ + 1 1 d μ ,
which is just a positive multiple of the Cressie-Read power divergence with the multiplicative constant being | λ ( 1 + λ ) | ; when this constant is present, the case λ = 0 leads to the KLD measure in a limiting sense. Note that, our ϕ -function in (10) differs slightly from the one used by [16] in that we use s i g n ( λ ( λ + 1 ) ) in place of s i g n ( λ ) there; this is to make the divergence in (11) non-negative for all λ R ([16] considered only λ > 1 ) which will be needed to define our generalized relative entropy. Then, given an α > 0 , [16,17] set λ = α 1 1 ( > 1 ) and show that the relative α -entropy of P with respect to Q can be obtained as
RE α ( P , Q ) = RE α μ ( P , Q ) = 1 λ log s i g n ( λ ) D λ ( P α , Q α ) + 1 .
It is straightforward to see that the above formulation (12) coincides with the definition given in (6). We often suppress the superscript μ whenever the underlying measure is clear from the context; in most applications in information theory and statistics it is either counting measure or the Lebesgue measure depending on whether the distribution is discrete or continuous.
We can now change the tuning parameters in the formulation given by (12) suitably as to arrive at the more general form of the LSD family in (7). For this purpose, let us fix α > 0 , β R and assume that p , q L α ( μ ) are the μ -densities of P and Q, respectively. Instead of considering the re-parametrization λ = α 1 1 as above, we now consider the two-parameter re-parametrization λ = β α 1 1 R . Note that, the feasible range of λ , in order to make α > 0 , now clearly depends on β through α = β 1 + λ > 0 ; whenever β > 0 we have 1 < λ < and if β < 0 we need < λ < 1 . We have already taken care of this dependence through the modified ϕ function defined in (10) which ensures that D λ ( · , · ) is non-negative for all λ R . So we can again use the relation as in (12), after suitable standardization due to the additional parameter β , to define a new generalized relative entropy measure as given in the following definition.
Definition 1 (Relative ( α , β ) -entropy).
Given any α > 0 and β R , put λ = β α 1 (i.e., α = β 1 + λ ). Then, the relative ( α , β ) -entropy of P with respect to Q is defined as
RE α , β ( P , Q ) = RE α , β μ ( P , Q ) = 1 β λ log s i g n ( β λ ) D λ ( P α , Q α ) + 1 .
The cases β = 0 and λ = 0 (i.e, β = α ) are defined in limiting sense; see Equations (15) and (16) below.
A straightforward simplification gives a simpler form of this new relative ( α , β ) -entropy which coincides with the LSD measure as follows.
RE α , β ( P , Q ) = 1 α β log p α d μ α β ( α β ) log p β q α β d μ + 1 β log q α d μ , = LSD α 1 , β 1 2 α ( P , Q ) .
Note that, it coincides with the relative α -entropy RE α ( P , Q ) at the choice β = 1 . For the limiting cases, it leads to the forms
RE α , 0 ( P , Q ) = log ( q / p ) q α d μ q α d μ + 1 α log p α d μ q α d μ ,
RE α , α ( P , Q ) = log ( p / q ) p α d μ p α d μ + 1 α log q α d μ p α d μ .
By the divergence property of D λ ( · , · ) , all the relative ( α , β ) -entropies are non-negative and valid statistical divergences. Note that, in view of (14), the formulation (13) extends the scope of LSD measure, defined in (7), for τ ( 1 , 0 ) .
Proposition 1.
For any α > 0 and β R , RE α , β ( P , Q ) 0 for all probability measures P and Q, whenever it is defined. Further, RE α , β ( P , Q ) = 0 if and only in P = Q [ μ ] .
Also, it is important to identify the cases where the relative ( α , β ) -entropy is not finitely defined, which can be obtained from the definition and convention related to D λ divergence; these are summarized in the following proposition.
Proposition 2.
For any α > 0 , β R and distributions P , Q having μ-densities in L α ( μ ) , the relative ( α , β ) -entropy RE α , β ( P , Q ) is a finite positive number except for the following three cases:
1. 
P is not absolutely continuous with respect to Q and α < β , in which case RE α , β ( P , Q ) = + .
2. 
P is mutually singular to Q and α > β , in which case also RE α , β ( P , Q ) = + .
3. 
0 < β < α and D λ ( P α , Q α ) 1 , in which case also RE α , β ( P , Q ) is undefined.
The above two propositions completely characterize the values and existence of our new relative ( α , β ) -entropy measure. In the next subsection, we will now explore its relation with other existing entropies and divergence measures; along the way we will get some new ones as by-products of our generalized relative entropy formulation.

2.2. Relations with Different Existing or New Entropies and Divergences

The relative ( α , β ) -entropy measures form a large family containing several existing relative entropies and divergences. Its relation with some popular ones are summarized in the following proposition; the proof is straightforward from definitions and hence omitted.
Proposition 3.
For α > 0 , β R and distributions P, Q, the following results hold (whenever the relevant integrals and divergences are defined finitely, even in limiting sense).
1. 
RE 1 , 1 ( P , Q ) = RE ( P , Q ) , the KLD measure.
2. 
RE α , 1 ( P , Q ) = RE α ( P , Q ) , the relative α-entropy.
3. 
RE 1 , β ( P , Q ) = 1 β D β ( P , Q ) , a scaled Renyi divergence, which also coincides with the logarithmic power divergence measure of [80].
4. 
RE α , β ( P , Q ) = 1 β D β / α ( P α , Q α ) , where P α and Q α are as defined in (5).
Remark 1.
Note that, items 3 and 4 in Proposition 3 indicate a possible extension of the Renyi divergence measure over negative values of the tuning parameter β as follows:
D β ( P , Q ) = 1 β D β ( P , Q ) , β R { 0 } , D 0 ( P , Q ) = q log q p d μ .
Note that this modified Renyi divergence also coincides with the KLD measure at β = 1 . Statistical applications of this divergence family have been studied by [80].
However, not all the members of the family of relative ( α , β ) -entropies are distinct or symmetric. For example, RE α , 0 ( P , Q ) = RE α , α ( Q , P ) for any α > 0 . The following proposition characterizes all such identities.
Proposition 4.
For α > 0 , β R and distributions P, Q, the relative ( α , β ) -entropy RE α , β ( P , Q ) is symmetric if and only if β = α 2 . In general, we have RE α , α 2 γ ( P , Q ) = RE α , α 2 + γ ( Q , P ) for any α > 0 , γ R .
Recall that the KLD measure is linked to the Shannon entropy and the relative α -entropy is linked with the Renyi entropy when the prior mismatched probability is uniform over the finite space. To derive such a relation for our general relative ( α , β ) -entropy, let us assume μ ( Ω ) < and let U denote the uniform probability measure on Ω . Then, we get
RE α , β ( P , U ) = 1 β log μ ( Ω ) E α , β ( P ) , β 0
where the functional E α , β ( P ) is given in Definition 2 below and coincides with the Renyi entropy at β = 1 . Thus, it can be used to define a two-parameter generalization of the Renyi entropy as follows.
Definition 2 (Generalized Renyi   Entropy).
For any probability measure P over a measurable space Ω, we define the generalized Renyi entropy (GRE) of order ( α , β ) as
E α , β ( P ) = 1 β α log p α d μ β p β d μ α , α > 0 , β R , β 0 , α ;
E α , α ( P ) = log ( p ) p α d μ p α d μ + 1 α log p α d μ , α > 0 .
Note that, at β = 1 , we have E α , 1 ( P ) = E α ( P ) , the usual Renyi entropy measure of order α.
The GRE is a new entropy to the best of our knowledge, and does not belong to the general class of entropy functionals as given in [104] which covers many existing entropies (including most, if not all, classical entropies). The following property of the functional E α , β ( P ) is easy to verify and justifies its use as a new entropy functional. To keep the focus of the present paper clear on the relative ( α , β ) -entropy, further properties of the GRE will be explored in our future work.
Theorem 1 (Entropic characteristics of GRE).
For any probability measure P over a finite measure space Ω, we have 0 E α , β ( P ) log μ ( Ω ) for all α > 0 and β R { 0 } . The two extremes are attained as follows.
1 
E α , β ( P ) = 0 if P is degenerate at a point in Ω (no uncertainty).
2 
E α , β ( P ) = log μ ( Ω ) if P is uniform over Ω (maximum uncertainty).
Example 1 (Normal Distribution).
Consider distributions P i from the most common class of multivariate (s-dimensional) normal distributions having mean μ i R s and variance matrix Σ i for i = 1 , 2 . It is known that the Shannon and the Renyi entropies of P 1 are, respectively, given by
E ( P 1 ) = s 2 + s 2 log ( 2 π ) + 1 2 log | Σ 1 | , E α ( P 1 ) = s 2 log α α 1 + s 2 log ( 2 π ) + 1 2 log | Σ 1 | , α > 0 , α 1 .
With the new entropy measure, GRE, the entropy of the normal distribution P 1 can be seen to have the form
E α , β ( P 1 ) = s 2 ( α log β β log α ) ( β α ) + s 2 log ( 2 π ) + 1 2 log | Σ 1 | , α > 0 , β R { 0 , α } , E α , α ( P 1 ) = s 2 ( 1 log α ) + s 2 log ( 2 π ) + 1 2 log | Σ 1 | , α > 0 .
Interestingly, the GRE of a normal distribution is effectively the same as its Shannon entropy or Renyi entropy up to an additive constant. However, similar characteristic does not hold between the relative entropy (KLD) and relative ( α , β ) -entropy. The KLD measure between two normal distributions P 1 and P 2 is given by
RE ( P 1 , P 2 ) = 1 2 T r a c e ( Σ 2 1 Σ 1 ) + 1 2 ( μ 2 μ 1 ) T Σ 2 1 ( μ 2 μ 1 ) + 1 2 log | Σ 2 | | Σ 1 | s 2 ,
whereas the general relative ( α , β ) -entropy, with α > 0 and β R { 0 , α } , has the form
RE α , β ( P 1 , P 2 ) = α 2 ( μ 2 μ 1 ) T β Σ 2 + ( α β ) Σ 1 1 ( μ 2 μ 1 ) + 1 2 β ( β α ) log | Σ 2 | β | Σ 1 | α β | β Σ 2 + ( α β ) Σ 1 | α s α log α 2 β ( α β ) .
Note that the relative ( α , β ) -entropy gives a more general divergence measure which utilizes different weights for the variance (or precision) matrix of the two normal distributions.
Example 2 (Exponential Distribution).
Consider the exponential distribution P having density p θ ( x ) = θ e θ x I ( x 0 ) with θ > 0 . This distribution is very useful in lifetime modeling and reliability engineering; it is also the maximum entropy distribution of a non-negative random variable with fixed mean. The Shannon and the Renyi entropies of P are, respectively, given by
E ( P ) = 1 log θ , a n d E α ( P ) = log α α 1 log θ , α > 0 , α 1 .
A simple calculation leads to the following form of the our new GRE measure of the exponential distribution P.
E α , β ( P ) = ( α log β β log α ) ( β α ) log θ , α > 0 , β R { 0 , α } , E α , α ( P ) = ( 1 log α ) log θ , α > 0 .
Once again, the new GRE is effectively the same as the Shannon entropy or the Renyi entropy, up to an additive constant, for the exponential distribution as well.
Further, if P 1 and P 2 are two exponential distributions with parameters θ 1 and θ 2 , respectively, the relative entropy (KLD) and the relative ( α , β ) -entropy between them are given by
RE ( P 1 , P 2 ) = θ 2 θ 1 + log θ 1 log θ 2 1 , RE α , β ( P 1 , P 2 ) = α β ( α β ) log β θ 1 + ( α β ) θ 2 1 α β log θ 1 1 β log θ 2 α log α β ( α β ) ,
for α > 0 and β R { 0 , α } . Clearly, the contributions of both the distribution is weighted differently by β and ( α β ) in their relative ( α , β ) -entropy measure.
Before concluding this section, we study the nature of our relative ( α , β ) -entropy as α 0 . For this purpose, we restrict ourselves to the case of finite measure spaces with μ ( Ω ) < . It is again straightforward to note that lim α 0 RE α , β ( P , Q ) = 0 for any β R and any distributions P and Q on Ω . However, if we take the limit after scaling the relative entropy measure by α we get a non-degenerate divergence measure as follows.
RE β ( P , Q ) = lim α 0 1 α RE α , β ( P , Q ) = 1 β 2 log p q β d μ β μ ( Ω ) log p q d μ log μ ( Ω ) ,
for β R { 0 } , and
RE 0 ( P , Q ) = lim α 0 1 α RE α , 0 ( P , Q ) = 1 2 μ ( Ω ) log p / q 2 d μ 1 μ ( Ω ) log p / q d μ 2 .
These interesting relative entropy measures again define a subfamily of valid statistical divergences, from its construction. The particular member at β = 1 is linked to the LDPD (or the γ -divergence) with tuning parameter 1 and can be thought of as a logarithmic extension of the famous Itakura–Saito divergence [105] given by
D I S ( P , Q ) = p q d μ log p q d μ μ ( Ω ) .
This Itakura–Saito-divergence has been successfully applied to non-negative matrix factorization in different applications [106] which can be extended by using the new divergence family RE β ( P , Q ) in future works.

3. Geometry of the Relative ( α , β ) -Entropy

3.1. Continuity

We start the exploration of the geometric properties of the relative ( α , β ) -entropy with its continuity over the functional space L α ( μ ) . In the following, we interchangeably use the notation RE α , β ( p , q ) and D λ ( p , q ) to denote RE α , β ( P , Q ) and D λ ( P , Q ) , respectively. Our results generalize the corresponding properties of the relative α -entropy from [16,73] to our relative ( α , β ) -entropy or equivalent LSD measure.
Proposition 5.
For a given q L α ( μ ) , consider the function p RE α , β ( p , q ) from p L α ( μ ) to [ 0 , ] . This function is lower semi-continuous in L α ( μ ) for any α > 0 , β R . Additionally, it is continuous in L α ( μ ) when α > β > 0 and the relative entropy is finitely defined.
Proof. 
First let us consider any α > 0 and take p n p in L α ( μ ) . Then, | | p n | | α | | p | | α . Also, | p n α p α | | p n | α + | p | α and hence a general version of the dominated convergence theorem yields p n α p α in L 1 ( μ ) . Thus, we get
p n , α : = p n α p n α d μ p α in   L 1 ( μ ) .
Further, following ([107], Lemma 1), we know that the function h ϕ λ ( h ) d ν is lower semi-continuous in L 1 ( ν ) for any λ R and any probability measure ν on ( Ω , A ) . Taking ν = Q α , we get from (21) that p n , α / q α p α / q α in L 1 ( ν ) . Therefore, the above lower semi-continuity result along with (9) implies that
lim inf n D λ ( p n , α , q α ) D λ ( p α , q α ) 0 , λ R .
Now, note that the function ψ ( u ) = 1 ρ log ( s i g n ( ρ ) u + 1 ) is continuous and increasing on [ 0 , ) for ρ > 0 and on [ 0 , 1 ) for ρ < 0 . Thus, combining (22) with the definition of the relative ( α , β ) -entropy in (13), we get that
lim inf n RE α , β ( p n , q ) RE α , β ( p , q ) ,
i.e., the function p RE α , β ( p , q ) is lower semi-continuous.
Finally, consider the case α > β > 0 . Note that the dual space of L α / β ( μ ) is L α α β ( μ ) since α > β > 0 . Also, for q L α ( μ ) , we have q | | q | | α α β L α α β ( μ ) , the dual space of the Banach space L α / β ( μ ) . Therefore, the function T : L α / β ( μ ) R defined by
T ( h ) = h q | | q | | α α β d μ , h L α / β ( μ ) ,
is a bounded linear functional and hence continuous. Now, take p n p in L α ( μ ) so that | | p n | | α | | p | | α as n . Therefore, p n | | p n | | α p | | p | | α in L α ( μ ) implying p n | | p n | | α β p | | p | | α β in L α / β ( μ ) . Hence, by the continuity of T on L α / β ( μ ) , we get
T p n | | p n | | α β T p | | p | | α β , as   n .
However, from (14), we get
RE α , β ( p n , q ) = α β ( β α ) log T p n | | p n | | α β α β ( β α ) log T p | | p | | α β = RE α , β ( p , q ) .
This proves the continuity of RE α , β ( p , q ) in its first argument when α > β > 0 . ☐
Remark 2.
Whenever Ω is finite (discrete) equipped with the counting measure μ, all integrals in the definition of RE α , β ( P , Q ) become finite sums and any limit can be taken inside these finite sums. Thus, whenever defined finitely, the function p RE α , β ( p , q ) is always continuous in this case.
Remark 3.
For a general infinite space Ω, the function p RE α , β ( p , q ) is not necessarily continuous for the cases α < β . This can be seen by using the same counterexample as given in Remark 3 of [16]. However, it is yet to be verified if this function can be continuous for β < 0 cases.
Proposition 6.
For a given p L α ( μ ) , consider the function q RE α , β ( p , q ) from q L α ( μ ) to [ 0 , ] . This function is lower semi-continuous in L α ( μ ) for any α > 0 and β R .
Proof. 
Fix an α > 0 and β R , which in turn fixes a λ R . Note that, the relative ( α , β ) -entropy measure can be re-expressed from (13) as
RE α , β ( p , q ) = 1 β λ log s i g n ( β λ ) D ( λ + 1 ) ( q α , p α ) + 1 .
Now, consider a sequence q n q in L α ( μ ) and proceed as in the proof of Proposition 5 using ([107], Lemma 1) to obtain
lim inf n D ( λ + 1 ) ( q n , α , p α ) D ( λ + 1 ) ( q α , p α ) 0 , λ R .
Now, whenever D ( λ + 1 ) ( q α , p α ) = 1 with β λ < 0 or D ( λ + 1 ) ( q α , p α ) = with β λ > 0 , we get from (25) and (26) that
lim inf n RE α , β ( p , q n ) = RE α , β ( p , q ) = + .
In all other cases, we consider the function ψ ( u ) = 1 ρ log ( s i g n ( ρ ) u + 1 ) as in the proof of Proposition 5. This function is continuous and increasing whenever the corresponding relative entropy is finitely defined for all tuning parameter values; on [ 0 , ) for ρ > 0 and on [ 0 , 1 ) for ρ < 0 . Hence, again combining (26) with (25) through the function ψ , we conclude that
lim inf n RE α , β ( p , q n ) RE α , β ( p , q ) .
Therefore, the function q RE α , β ( p , q ) is also lower semi-continuous. ☐
Remark 4.
As in Remark 2, whenever Ω is finite (discrete) and is equipped with the counting measure μ, the function q RE α , β ( p , q ) is continuous in L α ( μ ) for any fixed p L α ( μ ) , α > 0 and β R .

3.2. Convexity

It has been shown in [16] that the relative α -entropy (i.e., RE α , 1 ( p , q ) ) is neither convex nor bi-convex, but it is quasi-convex in p. For general β 1 , however, the relative ( α , β ) -entropy RE α , β ( p , q ) is not even quasi-convex in p L α ( μ ) ; rather it is quasi-convex on the β -power transformed space of densities, L α ( μ ) β = p β : p L α ( μ ) , as described in the following theorem. Note that, for α , β > 0 , L α ( μ ) β = L α / β ( μ ) . Here we define the lower level set B α , β ( q , r ) = p : RE α , β ( p , q ) r and its power-transformed set B α , β ( q , r ) β = p β : p B α , β ( q , r ) , for any q L α ( μ ) and r > 0 .
Theorem 2.
For any given α > 0 , β R and q L α ( μ ) , the sets B α , β ( q , r ) β are convex for all r > 0 . Therefore, the function p β RE α , β ( p , q ) is quasi-convex on L α ( μ ) β .
Proof. 
Note that, at β = 1 , our theorem coincides with Proposition 5 of [16]; so we will prove the result for the case β 1 . Fix α , r > 0 , a real β { 1 , α } , q L α ( μ ) , and p 0 , p 1 B α , β ( q , r ) . Then p 0 β , p 1 β B α , β ( q , r ) β . For τ [ 0 , 1 ] , we consider p τ β = τ p 1 β + τ ¯ p 0 β with τ ¯ = 1 τ . We need to show that p τ β B α , β ( q , r ) β , i.e., RE α , β ( p τ , q ) r .
Now, from (14), we have
RE α , β ( p , q ) = 1 β λ log p | | p | | α β q | | q | | α α β d μ = 1 β λ log p α q α β / α d Q α .
Since p 0 β , p 1 β B α , β ( q , r ) β , we have
s i g n ( β λ ) p τ | | p τ | | α β q | | q | | α α β d μ s i g n ( β λ ) e r β λ , for   τ = 0 , 1 .
For any τ ( 0 , 1 ) , we get
s i g n ( β λ ) p τ | | p τ | | α β q | | q | | α α β d μ = s i g n ( β λ ) τ p 1 β + τ ¯ p 0 β | | p τ | | α β q | | q | | α α β d μ , [ by   definition   of   p τ ] s i g n ( β λ ) e r β λ τ | | p 1 | | α β + τ ¯ | | p 0 | | α β | | p τ | | α β , [ by   ( 30 ) ] .
Now, using the extended Minkowski’s inequalities from Lemma 1, given below, along with (31) and noting that β λ = β ( β α ) / α , we get that
s i g n ( β λ ) p τ | | p τ | | α β q | | q | | α α β d μ s i g n ( β λ ) e r β λ .
Therefore, by (29) and the fact that 1 ρ log ( s i g n ( ρ ) u ) is increasing in u, we finally get RE α , β ( p τ , q ) r . This proves the result for α β .
The case β = α can be proved in a similar manner and is left as an exercise to the readers. ☐
Lemma 1 (Extended Minkowski’s inequality).
Fix α > 0 , a real β { 1 , α } , p 0 , p 1 L α ( μ ) , and τ [ 0 , 1 ] . Define p τ β = τ p 1 β + τ ¯ p 0 β with τ ¯ = 1 τ . Then we have the following inequalities:
| | p τ | | α β τ | | p 1 | | α β + τ ¯ | | p 0 | | α β , i f   β ( β α ) > 0 ,
| | p τ | | α β τ | | p 1 | | α β + τ ¯ | | p 0 | | α β , i f   β ( β α ) < 0 .
Proof. 
It follows by using the Jensen’s inequality and the convexity of the function x β / α . ☐
Next, note in view of Proposition 4 that, for any p , q L α ( μ ) , RE α , β ( p , q ) = RE α , α β ( q , p ) . Using this result along with the above theorem, we also get the quasi-convexity of the relative ( α , β ) -entropy RE α , β ( p , q ) in q over a different power transformed space of densities. This leads to the following theorem.
Theorem 3.
For any given α > 0 , β R and p L α ( μ ) , the function q α β RE α , β ( p , q ) is quasi-convex on L α ( μ ) α β . In particular, for the choice β = α 1 , the function q RE α , β ( p , q ) is quasi-convex on L α ( μ ) .
Remark 5.
Note that, at α = β = 1 , the RE 1 , 1 ( p , q ) coincides with the KLD measure (or relative entropy) which is quasi-convex in both the arguments p and q on L α ( μ ) .

3.3. Extended Pythagorean Relation

Motivated by the quasi-convexity of RE α , β ( p , q ) on L α ( μ ) β , we now present a Pythagorean-type result for the general relative ( α , β ) -entropy over the power-transformed space. It generalizes the corresponding result for relative α -entropy [16]; the proof is similar to that in [16] with necessary modifications due to the transformation of the domain space.
Theorem 4 (Pythagorean Property).
Fix an α > 0 , β R with β α and p 0 , p 1 , q L α ( μ ) . Define p τ L α ( μ ) by p τ β = τ p 1 β + τ ¯ p 0 β for τ [ 0 , 1 ] and τ ¯ = 1 τ .
(i) 
Suppose RE α , β ( p 0 , q ) and RE α , β ( p 1 , q ) are finite. Then, RE α , β ( p τ , q ) RE α , β ( p 0 , q ) for all τ [ 0 , 1 ] , i.e., the back-transformation of line segment joining p 1 β and p 0 β on L α ( μ ) β to L α ( μ ) does not intersect B α , β ( q , RE α , β ( p 0 , q ) ) , if and only if
RE α , β ( p 1 , q ) RE α , β ( p 1 , p 0 ) + RE α , β ( p 0 , q ) .
(ii) 
Suppose RE α , β ( p τ , q ) is finite for some fixed τ ( 0 , 1 ) . Then, the back-transformation of line segment joining p 1 β and p 0 β on L α ( μ ) β to L α ( μ ) does not intersect B α , β ( q , RE α , β ( p τ , q ) ) if and only if
RE α , β ( p 1 , q ) = RE α , β ( p 1 , p τ ) + RE α , β ( p τ , q ) ,
a n d RE α , β ( p 0 , q ) = RE α , β ( p 0 , p τ ) + RE α , β ( p τ , q ) .
Proof of Part (i).
Let P τ , α to be the probability measure having μ -density p τ , α = p τ α p τ α d μ for τ [ 0 , 1 ] . Also note that, with λ = β / α 1 , we have
D λ ( P α , Q α ) = s i g n ( β λ ) p | | p | | α β q α λ d μ 1 , for p , q L α ( μ ) .
Thus, (34) is equivalent to the statement
s i g n ( β λ ) | | p 0 | | α β p 1 β q α λ d μ s i g n ( β λ ) p 1 β p 0 , α λ d μ · p 0 β q α λ d μ .
and we have
D λ ( P τ , α , Q α ) = s i g n ( β λ ) p τ | | p τ | | α β q α λ d μ 1 = s i g n ( β λ ) s ( τ ) t ( τ ) ,
where s ( τ ) = p τ β q α λ d μ and t ( τ ) = | | p τ | | α β . Now consider the two implications separately.
Only if statement: Now, let us assume that RE α , β ( p τ , q ) RE α , β ( p 0 , q ) for all τ ( 0 , 1 ) . Then, we get 1 τ D λ ( P τ , α , Q α ) D λ ( P 0 , α , Q α ) 0 for all τ ( 0 , 1 ) . Letting τ 0 , we get that
τ D λ ( P τ , α , Q α ) τ = 0 0 .
In order to find the derivative of D λ ( P τ , α , Q α ) , we first note that
s ( τ ) s ( 0 ) τ = 1 τ p τ β q α λ d μ p 0 β q α λ d μ = ( p 1 β p 0 β ) q α λ d μ ,
and hence
s ( 0 ) = lim τ 0 s ( τ ) s ( 0 ) τ = ( p 1 β p 0 β ) q α λ d μ .
Further, using a simple modification of the techniques in the proof of ([16], Theorem 9), it is easy to verify that the derivative of t ( τ ) with respect to τ exists and is given by
t ( τ ) = p τ α d μ ( β α ) α p τ α β ( p 1 β p 0 β ) d μ .
Hence we get
t ( 0 ) = p 0 α d μ ( β α ) α p 0 α β ( p 1 β p 0 β ) d μ = p 1 β p 0 , α λ d μ | | p 0 | | α β .
Therefore, the derivative of D λ ( P τ , α , Q α ) = s i g n ( β λ ) s ( τ ) / t ( τ ) exists and is given by s i g n ( β λ ) t ( 0 ) s ( 0 ) t ( 0 ) s ( 0 ) / t ( 0 ) 2 . Therefore, using (40), we get that
s i g n ( β λ ) t ( 0 ) s ( 0 ) s i g n ( β λ ) t ( 0 ) s ( 0 ) ,
which implies (38) after substituting the values from (41) and (42).
If statement: Now, let us assume that (34)—or equivalently (38)—holds true. Further, as in the derivation of (38), we can start from the trivial statement
RE α , β ( p 0 , q ) = RE α , β ( p 0 , p 0 ) + RE α , β ( p 0 , q ) ,
to deduce
s i g n ( β λ ) | | p 0 | | α β p 0 β q α λ d μ = s i g n ( β λ ) p 0 β p 0 , α λ d μ · p 0 β q α λ d μ .
Now, multiply (38) by τ and (44) by τ ¯ , and add to get
s i g n ( β λ ) | | p 0 | | α β p τ β q α λ d μ s i g n ( β λ ) p τ β p 0 , α λ d μ · p 0 β q α λ d μ .
In view of (37), this implies that
RE α , β ( p τ , q ) RE α , β ( p τ , p 0 ) + RE α , β ( p 0 , q ) RE α , β ( p 0 , q ) .
This proves the if statement of Part (i) completing the proof. ☐
Proof of Part (ii).
Note that the if statement follows directly from Part (i).
To prove the only if statement, we first show that RE α , β ( p 1 , q ) and RE α , β ( p 0 , q ) are finite since RE α , β ( p τ , q ) is finite. For this purpose, we note that p 1 β τ 1 p τ β by the definition of p τ and hence ( p 1 / q ) β τ 1 ( p τ / q ) β . Therefore, we get
p 1 , α q α β / α = p 1 q β | | q | | | | p 1 | | β 1 τ p τ q β | | q | | | | p 1 | | β = 1 τ p τ , α q α β | | p τ | | | | p 1 | | β .
Integration with respect to Q α and using (29), we get RE α , β ( p 1 , q ) RE α , β ( p τ , q ) + c < , where c is a constant. Similarly one can also show that RE α , β ( p 0 , q ) < .
Therefore, we can apply Part (i) to conclude that
RE α , β ( p 1 , q ) RE α , β ( p 1 , p τ ) + RE α , β ( p τ , q ) , and RE α , β ( p 0 , q ) RE α , β ( p 0 , p τ ) + RE α , β ( p τ , q ) .
These relations imply that
s i g n ( β λ ) | | p τ | | α β p 1 β q α λ d μ s i g n ( β λ ) p 1 β p τ , α λ d μ · p τ β q α λ d μ ,
and   s i g n ( β λ ) | | p τ | | α β p 0 β q α λ d μ s i g n ( β λ ) p 0 β p τ , α λ d μ · p τ β q α λ d μ .
The proof of the above results proceed in a manner analogous to the proof of (38). Now, if either of the inequalities in (46) is strict, the corresponding inequality in (47) or (48) will also be strict. Then, multiplying (47) and (48) by τ and τ ¯ , respectively, and adding them we get (44) with a strict inequality (in place of an equality), which is a contradiction. Hence, both inequalities in (46) must be equalities implying (35) and (36). This completes the proof. ☐
Note that, at β = 1 , the above theorem coincides with Theorem 9 of [16]. However, for general α , β as well, the above extended Pythagorean relation for the relative ( α , β ) -entropy suggests that it behaves “like" a squared distance (although with a non-linear space transformation). So, one can meaningfully define its projection on to a suitable set which we will explore in the following sections.

4. The Forward Projection of Relative ( α , β ) -Entropy

The forward projection, i.e., minimization with respect to the first argument given a fixed second argument, leads to the important maximum entropy principle of information theory; it also relates to the Gibbs conditioning principle from statistical physics [16]. Let us now formally define and study the forward projection of the relative ( α , β ) -entropy. Let S denote the set of probability measure on ( Ω , A ) and let the set of corresponding μ -densities be denoted by S = p = d P / d μ : P S .
Definition 3 (Forward ( α , β ) -Projection).
Fix Q S having μ-density q L α ( μ ) . Let E S with RE α , β ( p , q ) < for some p E . Then, p E is called the forward projection of the relative ( α , β ) -entropy or simply the forward ( α , β ) -projection (or forward LSD projection) of q on E if it satisfies the relation
RE α , β ( p , q ) = inf p E RE α , β ( p , q ) .
Note that we must assume that, E L α ( μ ) so that the above relative ( α , β ) -entropy is finitely defined for p E .
We first prove the uniqueness of the forward ( α , β ) -projection from the Pythagorean property, whenever it exists. The following theorem describe the connection of the forward ( α , β ) -projection with Pythagorean relation; the proof is same as that of ([16], Theorem 10) using Theorem 4 and hence omitted for brevity.
Theorem 5.
Consider the set E S such that E β is convex and fix q L α ( μ ) . Then, p E B α , β ( q , ) is a forward ( α , β ) -projection of q on E if and only if every p E B α , β ( q , ) satisfies
RE α , β ( p , q ) RE α , β ( p , p ) + RE α , β ( p , q ) .
Further, if ( p ) β is an algebraic inner point of E β , i.e., for every p E there exists p E and τ ( 0 , 1 ) such that ( p ) β = τ p β + ( 1 τ ) ( p ) β , then every p E satisfies RE α , β ( p , q ) < and
RE α , β ( p , q ) = RE α , β ( p , p ) + RE α , β ( p , q ) , a n d RE α , β ( p , q ) = RE α , β ( p , p ) + RE α , β ( p , q ) .
Corollary 1 (Uniqueness of Forward ( α , β ) -Projection).
Consider the set E S such that E β is convex and fix q L α ( μ ) . If a forward ( α , β ) -projection of q on E exists, it must be unique a.s. [ μ ] .
Proof. 
Suppose p 1 and p 2 are two forward ( α , β ) -projection of q on E . Then, by definition, RE α , β ( p 1 , q ) = RE α , β ( p 2 , q ) < . Applying Theorem 5 with p = p 1 and p = p 2 , we get
RE α , β ( p 2 , q ) RE α , β ( p 2 , p 1 ) + RE α , β ( p 1 , q ) .
Hence RE α , β ( p 2 , p 1 ) 0 or RE α , β ( p 2 , p 1 ) = 0 by non-negativity of relative entropy, which further implies that p 1 = p 2 a.s. [ μ ] by Proposition 1. ☐
Next we will show the existence of the forward ( α , β ) -projection under suitable conditions. We need to use an extended Apollonius Theorem for the ϕ -divergence measure D λ used in the definition (13) of the relative ( α , β ) -entropy. Such a result is proved in [16] for the special case α ( 1 + λ ) = 1 ; the following lemma extends it for the general case α ( 1 + λ ) = β R .
Lemma 2.
Fix p 0 , p 1 , q L α ( μ ) , τ [ 0 , 1 ] and α ( 1 + λ ) = β R with α > 0 and define r satisfying
r β = τ | | p 1 | | α β p 1 β + 1 τ | | p 0 | | α β p 0 β τ | | p 1 | | α β + 1 τ | | p 0 | | α β .
Let p j , α = p j α / p j α d μ for j = 0 , 1 , and similarly q α and r α . Then, if β ( β α ) > 0 we have
τ D λ ( p 1 , α , q α ) + ( 1 τ ) D λ ( p 0 , α , q α ) τ D λ ( p 1 , α , r α ) + ( 1 τ ) D λ ( p 0 , α , r α ) + D λ ( r α , q α ) ,
but the inequality gets reversed if β ( β α ) < 0 .
Proof. 
By (37), we get
τ D λ ( p 1 , α , q α ) + ( 1 τ ) D λ ( p 0 , α , q α ) τ D λ ( p 1 , α , r α ) ( 1 τ ) D λ ( p 0 , α , r α ) = s i g n ( β λ ) τ p 1 | | p 1 | | α β q α λ r α λ d μ + s i g n ( β λ ) ( 1 τ ) p 0 | | p 0 | | α β q α λ r α λ d μ = s i g n ( β λ ) | | r | | α β τ | | p 1 | | α β + 1 τ | | p 0 | | α β r | | r | | α β q α λ r α λ d μ = s i g n ( β λ ) | | r | | α β τ | | p 1 | | α β + 1 τ | | p 0 | | α β D λ ( R α , Q α ) .
Then the Lemma follows by an application of the extended Minkowski’s inequalities (32) and (33) from Lemma 1. ☐
We now present the sufficient conditions for the existence of the forward ( α , β ) -projection in the following theorem.
Theorem 6 (Existence of Forward ( α , β ) -Projection).
Fix α > 0 and β R with β α and q L α ( μ ) . Given any set E S for which E β is convex and closed and RE α , β ( p , q ) < for some p E , a forward ( α , β ) -projection of q on E always exists (and it is unique by Corollary 1).
Proof. 
We prove it separately for the cases β λ > 0 and β λ < 0 , extending the arguments from [16]. The case β λ = 0 can be obtained from these two cases by standard limiting arguments and hence omitted for brevity.
The Case β λ > 0 :
Consider a sequence { p n } E such that D λ ( p n , α , q α ) < for each n and D λ ( p n , α , q α ) inf p E D λ ( p α , q α ) as n . Then, by Lemma 2 applied to p m and p n with τ = 1 / 2 , we get
1 2 D λ ( p m , α , q α ) + 1 2 D λ ( p n , α , q α ) 1 2 D λ ( p m , α , r m , n , α ) + 1 2 D λ ( p n , α , r m , n , α ) + D λ ( r m , n , α , q α ) ,
where r m , n is defined by
r m , n β = τ | | p m | | α β p m β + 1 τ | | p n | | α β p n β τ | | p m | | α β + 1 τ | | p n | | α β .
Note that, since E β is convex, r m , n E β and so r m , n E . Also, using the non-negativity of divergence, (53) leads to
0 1 2 D λ ( p m , α , r m , n , α ) + 1 2 D λ ( p n , α , r m , n , α ) 1 2 D λ ( p m , α , q α ) + 1 2 D λ ( p n , α , q α ) D λ ( r m , n , α , q α ) .
Taking limit as m , n , one can see that 1 2 D λ ( p m , α , q α ) + 1 2 D λ ( p n , α , q α ) D λ ( r m , n , α , q α ) 0 and hence D λ ( p m , α , r m , n , α ) + D λ ( p n , α , r m , n , α ) 0 . Thus, D λ ( p m , α , r m , n , α ) 0 as m , n by non-negativity. This along with a generalization of Pinker’s inequality for ϕ -divergence ([100], Theorem 1) gives
lim m , n | | p m , α r m , n , α | | T = 0 ,
whenever λ ( 1 + λ ) > 0 (which is true since β λ > 0 ); here | | · | | T denotes the total variation norm. Now, by triangle inequality
| | p m , α p n , α | | T | | p m , α r m , n , α | | T + | | p n , α r m , n , α | | T 0 , as m , n .
Thus, { p n , α } is Cauchy in L 1 ( μ ) and hence converges to some g L 1 ( μ ) , i.e.,
lim n | p n , α g | d μ = 0 ,
and g is a probability density with respect to μ since each p n is so. Also, (57) implies that p n , α g in [ μ ] -measure and hence p n , α 1 / α g 1 / α in L α ( μ ) by an application of generalized dominated convergence theorem.
Next, as in the proof of ([16], Theorem 8), we can show that | | p n | | α is bounded and hence | | p n | | α c for some c > 0 , possibly working with a subsequence if needed. Thus we have p n = | | p n | | α p n , α 1 / α c g 1 / α in L α ( μ ) . However, since E β is closed, we have E is closed and hence c g 1 / α = p for some p E . Further, since g d μ = 1 , we must have c = | | p | | α and hence g = p α . Since p n p and p E , Proposition 5 implies that
RE α , β ( p , q ) lim inf n RE α , β ( p n , q ) = inf p E RE α , β ( p , q ) RE α , β ( p , q ) ,
where the second equality follows by continuity of the function f ( u ) = ( β λ ) 1 log ( s i g n ( β λ ) u + 1 ) , definitions of p n sequence and (13). Hence, we must have RE α , β ( p , q ) = inf p E RE α , β ( p , q ) , i.e., p is a forward ( α , β ) -projection of q on E .
The Case β λ < 0 :
Note that, in this case, we must have 0 < β < α , since α > 0 . Then, using (29), we can see that
inf p E RE α , β ( p , q ) = 1 β λ log sup p E p | | p | | α β q | | q | | α α β d μ = 1 β λ log sup h E ˜ h g d μ ,
where g = q | | q | | α α β L α α β ( μ ) and
E ˜ = s p | | p | | α β : p E , s [ 0 , 1 ] L α / β ( μ ) .
Now, since E β and hence E is closed, one can show that E ˜ is also closed; see, e.g., the proof of ([16], Theorem 8). Next, we will show that E ˜ is also convex. For take s 1 p 1 | | p 1 | | α β E ˜ and s 0 p 0 | | p 0 | | α β E ˜ for some s 0 , s 1 [ 0 , 1 ] and p 0 , p 1 E , and take any τ [ 0 , 1 ] . Note that
τ s 1 p 1 | | p 1 | | α β + ( 1 τ ) s 0 p 0 | | p 0 | | α β = s τ p τ | | p τ | | α β ,
where
p τ β = τ s 1 p 1 | | p 1 | | α β + ( 1 τ ) s 0 p 0 | | p 0 | | α β τ s 1 | | p 1 | | α β + ( 1 τ ) s 0 | | p 0 | | α β , and s τ = τ s 1 | | p 1 | | α β + ( 1 τ ) s 0 | | p 0 | | α β | | p τ | | α β .
However, by convexity of E β , p τ E and also 0 s τ 1 by the extended Minkowski inequality (33). Therefore, s τ p τ | | p τ | | α β E ˜ and hence E ˜ is convex.
Finally, since 0 < β < α , L α / β ( μ ) is a reflexive Banach space and hence the closed and convex E ˜ L α / β ( μ ) is also closed in the weak topology. So, the unit ball is compact in the weak topology by the Banach-Alaoglu theorem and hence its closed subset E ˜ is also weakly compact. However, since g belongs to the dual space of L α / β ( μ ) , the linear functional h h g d μ is continuous in weak topology and also increasing in s. Hence its supremum over E ˜ is attained at s = 1 and some p E , which is the required forward ( α , β ) -projection. ☐
Before concluding this section, we will present one example of the forward ( α , β ) -projection onto a transformed-linear family of distributions.
Example 3 (An example of the forward ( α , β ) -projection).
Fix α > 0 , β R { 0 , α } and q L α ( μ ) related to the measure Q. Consider measurable functions f i : Ω R for i I , an index set, and the family of distributions
L β = P S : f γ d P β = 0 S .
Let us denote the corresponding μ-density set by L β = p = d P d μ : P L β . We assume that, L β is non-empty, every P L β is absolute continuous with respect to μ and L β L α ( μ ) .
Then, p is the forward ( α , β ) -projection of q on L β if and only if there exists a function g in the L 1 ( Q β ) -closure of the linear space spanned by f i : i I and a subset N Ω such that, for every P L β
P ( N ) = 0 i f   α < β , c N q α β d P β Ω N g d P β i f   α > β ,
with c = ( p ) α d μ ( p ) β q α β d μ and p satisfies
p ( x ) α β = c q ( x ) α β + g ( x ) , i f   x N , p ( x ) = 0 , i f   x N .
The proof follows by extending the arguments of the proof of ([16], Theorem 11) and hence it is left as an exercise to the readers.
Remark 6.
Note that, at the special case β = 1 , L 1 is a linear family of distributions and the above example coincides with ([16], Theorem 11) on the forward projection of relative α-entropy on L 1 . However, it is still an open question to derive the forward ( α , β ) -projection on L 1 .

5. Statistical Applications: The Minimum Relative Entropy Inference

5.1. The Reverse Projection and Parametric Estimation

As in the case of the forward projection of a relative entropy measure, we can also define the reverse projection by minimizing it with respect to the second argument over a convex set E keeping the first argument fixed. More formally, we use the following definition.
Definition 4 (Reverse ( α , β ) -Projection).
Fix p L α ( μ ) and let E S with RE α , β ( p , q ) < for some q E . Then, q E is called the reverse projection of the relative ( α , β ) -entropy or simply the reverse ( α , β ) -projection (or reverse LSD projection) of p on E if it satisfies the relation
RE α , β ( p , q ) = inf q E RE α , β ( p , q ) .
We can get sufficient conditions for the existence and uniqueness of the reverse ( α , β ) -projection directly from Theorem 6 and the fact that RE α , β ( p , q ) = RE α , α β ( q , p ) ; this is presented in the following theorem.
Theorem 7 (Existence and Uniqueness of Reverse ( α , β ) -Projection).
Fix α > 0 and β R with β α and p L α ( μ ) . Given any set E S for which E α β is convex and closed and RE α , β ( p , q ) < for some q E , a reverse ( α , β ) -projection of p on E exists and is unique.
The reverse projection is mostly used in statistical inference where we fix the first argument of a relative entropy measure (or divergence measure) at the empirical data distribution and minimize the relative entropy with respect to the model family of distributions in its second argument. The resulting estimator, commonly known as the minimum distance or minimum divergence estimator, yields the reverse projection of the observed data distribution on the family of model distributions with respect to the relative entropy or divergence under consideration. This approach was initially studied by [9,10,11,12,13] to obtain the popular maximum likelihood estimator as the reverse projection with respect to the relative entropy in (2). More recently, this approach has become widely popular, but with more general relative entropies or divergence measures, to obtain robust estimators against possible contamination in the observed data. Let us describe it more rigorously in the following for our relative ( α , β ) -entropy.
Suppose we have independent and identically distributed data X 1 , , X n from a true distribution G having density g with respect to some common dominating measure μ . We model g by a parametric model family of μ -densities F = { f θ : θ Θ R p } , where it is assumed that both g and f θ have the same support independent of θ . Our objective is to infer about the unknown parameter θ . In minimum divergence inference, an estimator of θ is obtained by minimizing the divergence measure between (an estimate of) g and f θ with respect to θ Θ . Maji et al. [78] have considered the LSD (or equivalently the relative ( α , β ) -entropy) as the divergence under consideration and defined the corresponding minimum divergence functional at G, say T α , β ( G ) , through the relation
RE α , β g , f T α , β ( G ) = min θ Θ RE α , β ( g , f θ ) ,
whenever the minimum exists. We will refer to T α , β ( G ) as the minimum relative ( α , β ) -entropy (MRE) functional, or the minimum LSD functional in the language of [78,79]. Note that, if g F , i.e., g = f θ 0 for some θ 0 Θ , then we must have T α , β ( G ) = θ 0 . If g F , we call T α , β ( G ) as the “best fitting parameter" value, since f T α , β ( G ) is the closest model element to g in the LSD sense. In fact, for g F , T α , β ( G ) is nothing but the reverse ( α , β ) -projection of the true density g on the model family F , which exists and is unique under the sufficient conditions of Theorem 7. Therefore, under identifiability of the model family F we get the existence and uniqueness of the MRE functional, which is presented in the following corollary. Although this estimator was first introduced by [78] in terms of the LSD, the results concerning the existence of the estimate were not provided.
Corollary 2 (Existence and Uniqueness of the MRE Functional).
Consider the above parametric estimation problem with g L α ( μ ) and F L α ( μ ) . Fix α > 0 and β R with β α and assume that the model family F is identifiable in θ.
(1) 
Suppose g = f θ 0 for some θ 0 Θ . Then the unique MRE functional is given by T α , β ( G ) = θ 0 .
(2) 
Suppose g F . If F α β is convex and closed and RE α , β ( g , f θ ) < for some θ Θ , the MRE functional T α , β ( G ) exists and is unique.
Further, under standard differentiability assumptions, we can obtain the estimating equation of the MRE functional T α , β ( G ) as given by
f θ α u θ d μ f θ α β g β d μ = f θ α β g β u θ d μ f θ α d μ ,
where u θ ( x ) = θ ln f θ ( x ) . It is important to note that, at β = α = 1 , the MRE functional T 1 , 1 ( G ) coincides with the maximum likelihood functional since RE 1 , 1 = RE , the KLD measure. Based on the estimating Equation (61), Maji et al. [78] extensively studied the theoretical robustness properties of the MRE functional against gross-error contamination in data through the higher order influence function analysis. The classical first order influence function was seen to be inadequate for this purpose; it becomes independent of β at the model but the real-life performance of the MRE functional critically depends on both α and β [78,79] as we will also see in Section 5.2.
In practice, however, the true data generating density is not known and so we need to use some empirical estimate in place of g and the resulting value of the MRE functional is called the minimum relative ( α , β ) -entropy estimator (MREE) or the minimum LSD estimator in the terminology of [78,79]. Note that, when the data are discrete and μ is the counting measure, one can use a simple estimate of g given by the relative frequencies r n ( x ) = 1 n i = 1 n I ( X i = x ) , where I ( A ) is the indicator function of the event A; the corresponding MREE is then obtained by solving (61) with g ( x ) replaced by r n ( x ) and integrals replaced by sums over the discrete support. Asymptotic properties of this MREE under discrete models are well-studied by [78,79] for the tuning parameters α 1 and β R ; the same line of argument can be used to extend them also for the cases α ( 0 , 1 ) in a straightforward manner.
However, in case of continuous data, there is no such simple estimator available to use in place of g unless β = 1 . When β = 1 , the estimating Equation (61) depends on g through the terms f θ α 1 g d μ = f θ α 1 d G and f θ α 1 u θ g d μ = f θ α 1 u θ d G ; so we can simply use the empirical distribution function G n in place of G and solve the resulting equation to obtain the corresponding MREE. However, for β 1 , we must use a non-parametric kernel estimator g n of g in (61) to obtain the MREE under continuous models; this leads to complications including bandwidth selection while deriving the asymptotics of the resulting MREE. One possible approach to avoid such complications is to use the smoothed model technique, which has been applied in [108] for the case of minimum ϕ -divergence estimators. Another alternative approach has been discussed in [109,110]. However, the detailed analyses of the MREE under the continuous model, in either of the above approaches, are yet to be studied so far.

5.2. Numerical Illustration: Binomial Model

Let us now present numerical illustrations under the common binomial model to study the finite sample performance of the MREEs. Along with the known properties of the MREE at α 1 (i.e., the minimum LSD estimators with τ 0 from [78,79]), here we will additionally explore their properties in case of α ( 0 , 1 ) and for the new divergences RE β ( P , Q ) related to α = 0 .
Suppose X 1 , , X n are random observations from a true density g having support χ = { 0 , 1 , 2 , , m } for some positive integer m. We model g by the Binomial( m , θ ) densities f θ ( x ) = n x θ x ( 1 θ ) m x for x χ and θ [ 0 , 1 ] . Here an estimate g ^ of g is given by the relative frequency g ^ ( x ) = r n ( x ) . For any α > 0 and β R , the relative ( α , β ) -entropy between g ^ and f θ is given by
RE α , β ( g ^ , f θ ) = 1 β log x = 0 m n x α θ 1 θ α x ( 1 θ ) m α + 1 α β log x = 0 m r n ( x ) α α β ( α β ) log x = 0 m n x α β θ 1 θ ( α β ) x ( 1 θ ) m ( α β ) r n ( x ) β ,
which can be minimized with respect to θ [ 0 , 1 ] to obtain the corresponding MREE of θ . Note that, it is also the solution of the estimating Equation (61) with g ( x ) replaced by the relative frequency r n ( x ) . However, in this example, u θ ( x ) = x m θ θ ( 1 θ ) and hence the MREE estimating equation simplifies to
x = 0 m n x α ( x m θ ) θ 1 θ α x x = 0 m n x α θ 1 θ α x = x = 0 m ( x m θ ) n x α β θ 1 θ ( α β ) x r n ( x ) β x = 0 m n x α β θ 1 θ ( α β ) x r n ( x ) β .
We can numerically solve the above estimating equation over θ [ 0 , 1 ] , or equivalently over the transformed parameter p : = θ 1 θ [ 0 , ] , to obtain the corresponding MREE (i.e., the minimum LSD estimator).
We simulate random sample of size n from a binomial population with true parameter θ 0 = 0.1 with m = 10 and numerically compute the MREE. Repeating this exercise 1000 times, we can obtain an empirical estimate of the bias and the mean squared error (MSE) of the MREE of 10 θ (since θ is very small in magnitude). Table 1 and Table 2 present these values for sample sizes n = 20 , 50 , 100 and different values of tuning parameters α > 0 and β > 0 ; their existences are guaranteed by Corollary 2. Note that the choice α = 1 = β gives the maximum likelihood estimator whereas β = 1 only yields the minimum LDPD estimator with parameter α . Next, in order to study the robustness, we contaminate 10% of each sample by random observations from a distant binomial distribution with parameters θ = 0.9 and m = 10 and repeat the above simulation exercise; the resulting bias and MSE for the contaminated samples are given in Table 3 and Table 4. Our observations from these tables can be summarized as follows.
  • Under pure data with no contamination, the maximum likelihood estimator (the MREE at α = 1 = β ) has the least bias and MSE as expected, which further decrease as sample size increases.
  • As we move away from α = 1 and β = 1 in either direction, the MSEs of the corresponding MREEs under pure data increase slightly; but as long as the tuning parameters remain within a reasonable window of the ( 1 , 1 ) point and neither component is very close to zero, this loss in efficiency is not very significant.
  • When α or β approaches zero, the MREEs become somewhat unstable generating comparatively larger MSE values. This is probably due to the presence of inliers under the discrete binomial model. Note that, the relative ( α , β ) -entropy measures with β 0 are not finitely defined for the binomial model if there is just only one empty cell present in the data.
  • Under contamination, the bias and MSE of the maximum likelihood estimator increase significantly but many MREEs remains stable. In particular, the MREEs with β α and the MREEs with β close to zero are non-robust against data contamination. Many of the remaining members of the MREE family provide significantly improved robust estimators.
  • In the entire simulation, the combination ( α = 1 , β = 0.7 ) appears to provide the most stable results. In Table 4, the best results are available along a tubular region which moves from the top left-hand to the bottom right-hand of the table subject to the conditions that α > β and none of them are very close to zero.
  • Based on our numerical experiments, the optimum range of values of α , β providing the most robust minimum relative ( α , β ) -estimators are α = 0.9 , 1 , 0.5 β 0.7 and 1 < α 1.5 , 0.5 β < 1 . Note that this range includes the estimators based on the logarithmic power divergence measure as well as the new LSD measures with α < 1 .
  • Many of the MREEs, which belong to the optimum range mentioned in the last item and are close to the combination α = 1 = β , generally also provide the best trade-off between efficiency under pure data and robustness under contaminated data.
In summary, many MREEs provide highly robust estimators under data contamination along with only a very small loss in efficiency under pure data. These numerical findings about the finite sample behavior of the MREEs under the binomial model and the corresponding optimum range of tuning parameters, for the subclass with α 1 , are consistent with the findings of [78,79] who used a Poisson model. Additionally, our illustrations shed lights on the properties of the MREEs at α < 1 as well and show that some MREEs in this range, e.g., at α = 0.9 and β = 0.5 , also yield optimum estimators in terms of the dual goal of high robustness and high efficiency.

5.3. Application to Testing Statistical Hypothesis

We end the paper with a very brief indication on the potential of the relative ( α , β ) -entropy or the LSD measure in statistical hypothesis testing problems. The minimum possible value of the relative entropy or divergence measure between the data and the null distribution indicates the amount of departure from null and hence can be used to develop a statistical testing procedure.
Consider the parametric estimation set-up as in Section 5.1 with g F and fix a parameter value θ 0 Θ . Suppose we want to test the simple null hypothesis in the one sample case given by
H 0 : θ = θ 0 a g a i n s t H 1 : θ θ 0 .
Maji et al. [78] have developed the LSD-based test statistics for the above testing problem as given by
T n , α , β ( 1 ) = 2 n RE α , β ( f θ ^ α , β , f θ 0 ) ,
where θ ^ α , β is the MREE with parameters α and β . [78,79] have also developed the LSD-based test for a simple two-sample problem where two independent samples of sizes n 1 and n 2 are given from true densities f θ 1 , f θ 2 F , respectively and we want to test for the homogeneity of the two samples trough the hypothesis
H 0 : θ 1 = θ 2 a g a i n s t H 1 : θ 1 θ 2 .
The proposed test statistics for this two-sample problem has the form
T n , α , β ( 2 ) = 2 n 1 n 2 n 1 + n 2 RE α , β ( f ( 1 ) θ ^ α , β , f ( 2 ) θ ^ α , β ) ,
where ( 1 ) θ ^ α , β and ( 2 ) θ ^ α , β are the MREEs of θ 1 and θ 2 , respectively, obtained from the two samples separately Note that, at α = β = 1 , both the test statistics in (63) and (64) become asymptotically equivalent to the corresponding likelihood ratio tests under the respective null hypothesis. Maji et al. [78,79] have studied the asymptotic properties of these two tests, which have asymptotic null distributions as linear combinations of chi-square distributions. They have also numerically illustrated the benefits of these LSD or relative ( α , β ) -entropy-based tests, although with tuning parameters α 1 only, to achieve robust inference against possible contamination in the sample data.
The same approach can also be used to develop robust tests for more complex hypothesis testing problems based on the relative ( α , β ) -entropy or the LSD measures, now with parameters α > 0 , and also using the new divergences RE β ( · , · ) . For example, consider the above one sample set-up and a subset Θ 0 Θ and let we are interested in testing the composite hypothesis
H 0 : θ Θ 0 a g a i n s t H 1 : θ Θ 0 .
with similar motivation from (63) and (64), we can construct relative entropy or LSD-based test statistics for testing the above composite hypothesis as given by
T n , α , β ˜ ( 1 ) = 2 n RE α , β ( f θ ^ α , β , f θ ˜ α , β ) ,
where θ ˜ α , β is the restricted MREE with parameters α and β obtained by minimizing the relative entropy over θ Θ 0 and θ ^ α , β is the corresponding unrestricted MREE obtained by minimizing over θ Θ . It will surely be of significant interest to study the asymptotic and robustness properties of this relative entropy-based test for the above composite hypothesis under one sample or even more general hypotheses with two or more samples. However, considering the length of the present paper, which is primarily focused on the geometric properties of entropies and relative entropies, we have deferred the detailed analyses of such MREE-based hypothesis testing procedures in a future report.

6. Conclusions

We have explored the geometric properties of the LSD measures through a new information theoretic formulation when we develop this divergence measure as a natural extension of the relative α -entropy; we refer to it as the two-parameter relative ( α , β ) -entropy. It is shown to be always lower semicontinuous in both the arguments, but is continuous in its first argument only if α > β > 0 . We also proved that the relative ( α , β ) -entropy is quasi-convex in both its arguments after a suitable (different) transformation of the domain space and derive an extended Pythagorean relation under these transformations. Along with the study of its forward and reverse projections, statistical applications are also discussed.
It is worthwhile to note that the information theoretic divergences can also be used to define new measures of robustness and efficiency of a parameter estimate; one can then obtain the optimum robust estimator, along Hampel’s infinitesimal principle, to achieve the best trade-off between these divergence-based summary measures [111,112,113]. In particular, the LDPD measure, a prominent member of our LSD or relative ( α , β ) -entropy family, has been used by [113] who have illustrated important theoretical properties including different types of equivariance of the resulting optimum estimators besides their strong robustness properties. A similar approach can also be used with our general relative ( α , β ) -entropies to develop estimators with enhanced optimality properties, establishing a better robustness-efficiency trade-off. The present work opens up several interesting problems to be solved in future research as already noted throughout the paper. In particular, we recall that the relative α -entropy has an interpretation from the problem of guessing under source uncertainty [17,71]. As an extension of relative α -entropy, a similar information theoretic interpretation of the relative ( α , β ) -entropy (i.e., the LSD) is expected and its proper interpretation will be a useful development. Additionally, we have obtained a new extension of the Renyi entropy as a by-product and detailed study of this new entropy measure and its potential applications may lead to a new aspect of the mathematical information theory. Also, statistical applications of these measures need to be studied thoroughly specially for the continuous models, where the complications of a kernel density estimator is unavoidable, and for testing complex composite hypotheses from one or more samples. We hope to pursue some of these interesting extensions in future.

Author Contributions

Conceptualization, A.B. and A.G.; Methodology, A.B. and A.G.; Coding and Numerical Work, A.G.; Validation, A.G.; Formal Analysis, A.G. and A.B.; Investigation, A.G. and A.B.

Funding

The research of the first author is funded by the INSPIRE Faculty Research Grant from the Department of Science and Technology, Government of India.

Acknowledgments

The authors wish to thank four anonymous referees whose comments have led to a significantly improved version of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
KLDKullback-Leibler Divergence
LDPDLogarithmic Density Power Divergence
LSDLogarithmic Super Divergence
GREGeneralized Renyi Entropy
MREMinimum Relative ( α , β ) -entropy
MREEMinimum Relative ( α , β ) -entropy Estimator
MSEMean Squared Error

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  2. Shannon, C.E. Communication in the presence of noise. Proc. IRE 1949, 37, 10–21. [Google Scholar] [CrossRef]
  3. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1949. [Google Scholar]
  4. Khinchin, A.I. The entropy concept in probability theory. Uspekhi Matematicheskikh Nauk 1953, 8, 3–20. [Google Scholar]
  5. Khinchin, A.I. On the fundamental theorems of information theory. Uspekhi Matematicheskikh Nauk 1956, 11, 17–75. [Google Scholar]
  6. Khinchin, A.I. Mathematical Foundations of Information Theory; Dover Publications: New York, NY, USA, 1957. [Google Scholar]
  7. Kolmogorov, A.N. Foundations of the Theory of Probability; Chelsea Publishing Co.: New York, NY, USA, 1950. [Google Scholar]
  8. Kolmogorov, A.N. On the Shannon theory of information transmission in the case of continuous signals. IRE Trans. Inf. Theory 1956, IT-2, 102–108. [Google Scholar] [CrossRef]
  9. Kullback, S. An application of information theory to multivariate analysis. Ann. Math. Stat. 1952, 23, 88–102. [Google Scholar] [CrossRef]
  10. Kullback, S. A note on information theory. J. Appl. Phys. 1953, 24, 106–107. [Google Scholar] [CrossRef]
  11. Kullback, S. Certain inequalities in information theory and the Cramer-Rao inequality. Ann. Math. Stat. 1954, 25, 745–751. [Google Scholar] [CrossRef]
  12. Kullback, S. An application of information theory to multivariate analysis II. Ann. Math. Stat. 1956, 27, 122–145. [Google Scholar] [CrossRef]
  13. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  14. Rosenkrantz, R.D. E T Jaynes: Papers on Probability, Statistics and Statistical Physics; Springer Science and Business Media: New York, NY, USA, 1983. [Google Scholar]
  15. Van Campenhout, J.M.; Cover, T.M. Maximum entropy and conditional probability. IEEE Trans. Inf. Theory 1981, 27, 483–489. [Google Scholar] [CrossRef]
  16. Kumar, M.A.; Sundaresan, R. Minimization Problems Based on Relative α-Entropy I: Forward Projection. IEEE Trans. Inf. Theory 2015, 61, 5063–5080. [Google Scholar] [CrossRef]
  17. Sundaresan, R. Guessing under source uncertainty. Proc. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
  18. Csiszár, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
  19. Csiszár, I. Sanov property, generalized I -projection, and a conditional limit theorem. Ann. Probab. 1984, 12, 768–793. [Google Scholar] [CrossRef]
  20. Csiszár, I.; Shields, P. Information Theory and Statistics: A Tutorial; NOW Publishers: Hanover, NH, USA, 2004. [Google Scholar]
  21. Csiszár, I.; Tusnady, G. Information geometry and alternating minimization procedures. Stat. Decis. 1984, 1, 205–237. [Google Scholar]
  22. Amari, S.I.; Karakida, R.; Oizumi, M. Information Geometry Connecting Wasserstein Distance and Kullback-Leibler Divergence via the Entropy-Relaxed Transportation Problem. arXiv, 2017; arXiv:1709.10219. [Google Scholar]
  23. Costa, S.I.; Santos, S.A.; Strapasson, J.E. Fisher information distance: A geometrical reading. Discret. Appl. Math. 2015, 197, 59–69. [Google Scholar] [CrossRef]
  24. Nielsen, F.; Sun, K. Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures. IEEE Signal Process. Lett. 2016, 23, 1543–1546. [Google Scholar] [CrossRef]
  25. Amari, S.I.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 2010, 58, 183–195. [Google Scholar] [CrossRef]
  26. Contreras-Reyes, J.E.; Arellano-Valle, R.B. Kullback-Leibler divergence measure for multivariate skew-normal distributions. Entropy 2012, 14, 1606–1626. [Google Scholar] [CrossRef]
  27. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya Centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
  28. Pinski, F.J.; Simpson, G.; Stuart, A.M.; Weber, H. Kullback–Leibler approximation for probability measures on infinite dimensional spaces. SIAM J. Math. Anal. 2015, 47, 4091–4122. [Google Scholar] [CrossRef]
  29. Attouch, H.; Bolte, J.; Redont, P.; Soubeyran, A. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Lojasiewicz inequality. Math. Oper. Res. 2010, 35, 438–457. [Google Scholar] [CrossRef] [Green Version]
  30. Eliazar, I.; Sokolov, I.M. Maximization of statistical heterogeneity: From Shannon’s entropy to Gini’s index. Phys. A Stat. Mech. Appl. 2010, 389, 3023–3038. [Google Scholar] [CrossRef]
  31. Monthus, C. Non-equilibrium steady states: Maximization of the Shannon entropy associated with the distribution of dynamical trajectories in the presence of constraints. J. Stat. Mech. Theory Exp. 2011, 2011, P03008. [Google Scholar] [CrossRef]
  32. Bafroui, H.H.; Ohadi, A. Application of wavelet energy and Shannon entropy for feature extraction in gearbox fault detection under varying speed conditions. Neurocomputing 2014, 133, 437–445. [Google Scholar] [CrossRef]
  33. Batty, M. Space, Scale, and Scaling in Entropy Maximizing. Geogr. Anal. 2010, 42, 395–421. [Google Scholar] [CrossRef]
  34. Oikonomou, T.; Bagci, G.B. Entropy Maximization with Linear Constraints: The Uniqueness of the Shannon Entropy. arXiv, 2018; arXiv:1803.02556. [Google Scholar]
  35. Hoang, D.T.; Song, J.; Periwal, V.; Jo, J. Maximizing weighted Shannon entropy for network inference with little data. arXiv, 2017; arXiv:1705.06384. [Google Scholar]
  36. Sriraman, T.; Chakrabarti, B.; Trombettoni, A.; Muruganandam, P. Characteristic features of the Shannon information entropy of dipolar Bose-Einstein condensates. J. Chem. Phys. 2017, 147, 044304. [Google Scholar] [CrossRef] [PubMed]
  37. Sun, M.; Li, Y.; Gemmeke, J.F.; Zhang, X. Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence. IEEE Trans. Audio Speech Lang. Process. 2015, 23, 1233–1242. [Google Scholar] [CrossRef]
  38. Garcia-Fernandez, A.F.; Vo, B.N. Derivation of the PHD and CPHD Filters Based on Direct Kullback-Leibler Divergence Minimization. IEEE Trans. Signal Process. 2015, 63, 5812–5820. [Google Scholar] [CrossRef]
  39. Giantomassi, A.; Ferracuti, F.; Iarlori, S.; Ippoliti, G.; Longhi, S. Electric motor fault detection and diagnosis by kernel density estimation and Kullback-Leibler divergence based on stator current measurements. IEEE Trans. Ind. Electron. 2015, 62, 1770–1780. [Google Scholar] [CrossRef]
  40. Harmouche, J.; Delpha, C.; Diallo, D.; Le Bihan, Y. Statistical approach for nondestructive incipient crack detection and characterization using Kullback-Leibler divergence. IEEE Trans. Reliab. 2016, 65, 1360–1368. [Google Scholar] [CrossRef]
  41. Hua, X.; Cheng, Y.; Wang, H.; Qin, Y.; Li, Y.; Zhang, W. Matrix CFAR detectors based on symmetrized Kullback-Leibler and total Kullback-Leibler divergences. Digit. Signal Process. 2017, 69, 106–116. [Google Scholar] [CrossRef]
  42. Ferracuti, F.; Giantomassi, A.; Iarlori, S.; Ippoliti, G.; Longhi, S. Electric motor defects diagnosis based on kernel density estimation and Kullback-Leibler divergence in quality control scenario. Eng. Appl. Artif. Intell. 2015, 44, 25–32. [Google Scholar] [CrossRef]
  43. Matthews, A.G.D.G.; Hensman, J.; Turner, R.; Ghahramani, Z. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. J. Mach. Learn. Res. 2016, 51, 231–239. [Google Scholar]
  44. Arikan, E. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory 1996, 42, 99–105. [Google Scholar] [CrossRef] [Green Version]
  45. Campbell, L.L. A coding theorem and Renyi’s entropy. Inf. Control 1965, 8, 423–429. [Google Scholar] [CrossRef]
  46. Renyi, A. On measures of entropy and information. In Proceedings of 4th Berkeley Symposium on Mathematical Statistics and Probability I; University of California: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  47. Wei, B.B. Relations between heat exchange and Rényi divergences. Phys. Rev. E 2018, 97, 042107. [Google Scholar] [CrossRef]
  48. Kumar, M.A.; Sason, I. On projections of the Rényi divergence on generalized convex sets. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016. [Google Scholar]
  49. Sadeghpour, M.; Baratpour, S.; Habibirad, A. Exponentiality test based on Renyi distance between equilibrium distributions. Commun. Stat.-Simul. Comput. 2017. [Google Scholar] [CrossRef]
  50. Markel, D.; El Naqa, I.I. PD-0351: Development of a novel regmentation framework using the Jensen Renyi divergence for adaptive radiotherapy. Radiother. Oncol. 2014, 111, S134. [Google Scholar] [CrossRef]
  51. Bai, S.; Lepoint, T.; Roux-Langlois, A.; Sakzad, A.; Stehlé, D.; Steinfeld, R. Improved security proofs in lattice-based cryptography: Using the Rényi divergence rather than the statistical distance. J. Cryptol. 2018, 31, 610–640. [Google Scholar] [CrossRef]
  52. Dong, X. The gravity dual of Rényi entropy. Nat. Commun. 2016, 7, 12472. [Google Scholar] [CrossRef] [PubMed]
  53. Kusuki, Y.; Takayanagi, T. Renyi entropy for local quenches in 2D CFT from numerical conformal blocks. J. High Energy Phys. 2018, 2018, 115. [Google Scholar] [CrossRef]
  54. Kumbhakar, M.; Ghoshal, K. One-Dimensional velocity distribution in open channels using Renyi entropy. Stoch. Environ. Res. Risk Assess. 2017, 31, 949–959. [Google Scholar] [CrossRef]
  55. Xing, H.J.; Wang, X.Z. Selective ensemble of SVDDs with Renyi entropy based diversity measure. Pattern Recog. 2017, 61, 185–196. [Google Scholar] [CrossRef]
  56. Nie, F.; Zhang, P.; Li, J.; Tu, T. An Image Segmentation Method Based on Renyi Relative Entropy and Gaussian Distribution. Recent Patents Comput. Sci. 2017, 10, 122–130. [Google Scholar] [CrossRef]
  57. Ben Bassat, M. f-entropies, probability of error, and feature selection. Inf. Control 1978, 39, 277–292. [Google Scholar] [CrossRef]
  58. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys 1988, 52, 479–487. [Google Scholar] [CrossRef]
  59. Kumar, S.; Ram, G.; Gupta, V. Axioms for (α, β, γ)-entropy of a generalized probability scheme. J. Appl. Math. Stat. Inf. 2013, 9, 95–106. [Google Scholar] [CrossRef]
  60. Kumar, S.; Ram, G. A generalization of the Havrda-Charvat and Tsallis entropy and its axiomatic characterization. Abstr. Appl. Anal. 2014, 2014, 505184. [Google Scholar] [CrossRef]
  61. Tsallis, C.; Brigatti, E. Nonextensive statistical mechanics: A brief introduction. Contin. Mech. Thermodyn. 2004, 16, 223–235. [Google Scholar] [CrossRef]
  62. Rajesh, G.; Sunoj, S.M. Some properties of cumulative Tsallis entropy of order α. Stat. Pap. 2016. [Google Scholar] [CrossRef]
  63. Singh, V.P. Introduction to Tsallis Entropy Theory in Water Engineering; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
  64. Pavlos, G.P.; Karakatsanis, L.P.; Iliopoulos, A.C.; Pavlos, E.G.; Tsonis, A.A. Nonextensive Statistical Mechanics: Overview of Theory and Applications in Seismogenesis, Climate, and Space Plasma. In Advances in Nonlinear Geosciences; Tsonis, A., Ed.; Springer: Cham, Switzerland, 2018; pp. 465–495. [Google Scholar]
  65. Jamaati, M.; Mehri, A. Text mining by Tsallis entropy. Phys. A Stat. Mech. Appl. 2018, 490, 1368–1376. [Google Scholar] [CrossRef]
  66. Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; Chapman & Hall/CRC: Boca Raton, FL, USA, 2011. [Google Scholar]
  67. Leise, F.; Vajda, I. On divergence and information in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
  68. Pardo, L. Statistical Inference Based on Divergences; CRC/Chapman-Hall: London, UK, 2006. [Google Scholar]
  69. Vajda, I. Theory of Statistical Inference and Information; Kluwer: Boston, MA, USA, 1989. [Google Scholar]
  70. Stummer, W.; Vajda, I. On divergences of finite measures and their applicability in statistics and information theory. Statistics 2010, 44, 169–187. [Google Scholar] [CrossRef]
  71. Sundaresan, R. A measure of discrimination and its geometric properties. In Proceedings of the IEEE International Symposium on Information Theory, Lausanne, Switzerland, 30 June–5 July 2002. [Google Scholar]
  72. Lutwak, E.; Yang, D.; Zhang, G. Cramear-Rao and moment-entropy inequalities for Renyi entropy and generalized Fisher information. IEEE Trans. Inf. Theory 2005, 51, 473–478. [Google Scholar] [CrossRef]
  73. Kumar, M.A.; Sundaresan, R. Minimization Problems Based on Relative α-Entropy II: Reverse Projection. IEEE Trans. Infor. Theory 2015, 61, 5081–5095. [Google Scholar] [CrossRef]
  74. Jones, M.C.; Hjort, N.L.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 2001, 88, 865–873. [Google Scholar] [CrossRef]
  75. Windham, M. Robustifying model fitting. J. R. Stat. Soc. Ser. B 1995, 57, 599–609. [Google Scholar]
  76. Fujisawa, H. Normalized estimating equation for robust parameter estimation. Elect. J. Stat. 2013, 7, 1587–1606. [Google Scholar] [CrossRef]
  77. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  78. Maji, A.; Ghosh, A.; Basu, A. The Logarithmic Super Divergence and its use in Statistical Inference. arXiv, 2014; arXiv:1407.3961. [Google Scholar]
  79. Maji, A.; Ghosh, A.; Basu, A. The Logarithmic Super Divergence and Asymptotic Inference Properties. AStA Adv. Stat. Anal. 2016, 100, 99–131. [Google Scholar] [CrossRef]
  80. Maji, A.; Chakraborty, S.; Basu, A. Statistical Inference Based on the Logarithmic Power Divergence. Rashi 2017, 2, 39–51. [Google Scholar]
  81. Lutz, E. Anomalous diffusion and Tsallis statistics in an optical lattice. Phys. Rev. A 2003, 67, 051402. [Google Scholar] [CrossRef]
  82. Douglas, P.; Bergamini, S.; Renzoni, F. Tunable Tsallis Distributions in Dissipative Optical Lattices. Phys. Rev. Lett. 2006, 96, 110601. [Google Scholar] [CrossRef] [PubMed]
  83. Burlaga, L.F.; Viñas, A.F. Triangle for the entropic index q of non-extensive statistical mechanics observed by Voyager 1 in the distant heliosphere. Phys. A Stat. Mech. Appl. 2005, 356, 375. [Google Scholar] [CrossRef]
  84. Liu, B.; Goree, J. Superdiffusion and Non-Gaussian Statistics in a Driven-Dissipative 2D Dusty Plasma. Phys. Rev. Lett. 2008, 100, 055003. [Google Scholar] [CrossRef] [PubMed]
  85. Pickup, R.; Cywinski, R.; Pappas, C.; Farago, B.; Fouquet, P. Generalized Spin-Glass Relaxation. Phys. Rev. Lett. 2009, 102, 097202. [Google Scholar] [CrossRef] [PubMed]
  86. Devoe, R. Power-Law Distributions for a Trapped Ion Interacting with a Classical Buffer Gas. Phys. Rev. Lett. 2009, 102, 063001. [Google Scholar] [CrossRef] [PubMed]
  87. Khachatryan, V.; Sirunyan, A.; Tumasyan, A.; Adam, W.; Bergauer, T.; Dragicevic, M.; Erö, J.; Fabjan, C.; Friedl, M.; Frühwirth, R.; et al. Transverse-Momentum and Pseudorapidity Distributions of Charged Hadrons in pp Collisions at s = 7 TeV. Phys. Rev. Lett. 2010, 105, 022002. [Google Scholar] [CrossRef] [PubMed]
  88. Chatrchyan, S.; Khachatryan, V.; Sirunyan, A.M.; Tumasyan, A.; Adam, W.; Bergauer, T.; Dragicevic, M.; Erö, J.; Fabjan, C.; Friedl, M.; et al. Charged particle transverse momentum spectra in pp collisions at s = 0.9 and 7 TeV. J. High Energy Phys. 2011, 2011, 86. [Google Scholar] [CrossRef]
  89. Adare, A.; Afanasiev, S.; Aidala, C.; Ajitanand, N.; Akiba, Y.; Al-Bataineh, H.; Alexander, J.; Aoki, K.; Aphecetche, L.; Armendariz, R.; et al. Measurement of neutral mesons in p + p collisions at s = 200 GeV and scaling properties of hadron production. Phys. Rev. D 2011, 83, 052004. [Google Scholar] [CrossRef]
  90. Majhi, A. Non-extensive statistical mechanics and black hole entropy from quantum geometry. Phys. Lett. B 2017, 775, 32–36. [Google Scholar] [CrossRef]
  91. Shore, J.E.; Johnson, R.W. Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy. IEEE Trans. Inf. Theory 1980, 26, 26–37. [Google Scholar] [CrossRef]
  92. Caticha, A.; Giffin, A. Updating Probabilities. AIP Conf. Proc. 2006, 872, 31–42. [Google Scholar]
  93. Presse, S.; Ghosh, K.; Lee, J.; Dill, K.A. Nonadditive Entropies Yield Probability Distributions with Biases not Warranted by the Data. Phys. Rev. Lett. 2013, 111, 180604. [Google Scholar] [CrossRef] [PubMed]
  94. Presse, S. Nonadditive entropy maximization is inconsistent with Bayesian updating. Phys. Rev. E 2014, 90, 052149. [Google Scholar] [CrossRef] [PubMed]
  95. Presse, S.; Ghosh, K.; Lee, J.; Dill, K.A. Reply to C. Tsallis’ “Conceptual Inadequacy of the Shore and Johnson Axioms for Wide Classes of Complex Systems”. Entropy 2015, 17, 5043–5046. [Google Scholar] [CrossRef] [Green Version]
  96. Vanslette, K. Entropic Updating of Probabilities and Density Matrices. Entropy 2017, 19, 664. [Google Scholar] [CrossRef]
  97. Cressie, N.; Read, T.R.C. Multinomial goodness-of-fit tests. J. R. Stat. Soc. B 1984, 46, 440–464. [Google Scholar]
  98. Csiszár, I. Eine informations theoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten. Publ. Math. Inst. Hung. Acad. Sci. 1963, 3, 85–107. (In German) [Google Scholar]
  99. Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Stud. Scientiarum Math. Hung. 1967, 2, 299–318. [Google Scholar]
  100. Csiszár, I. On topological properties of f-divergences. Stud. Scientiarum Math. Hung. 1967, 2, 329–339. [Google Scholar]
  101. Csiszár, I. A class of measures of informativity of observation channels. Priodica Math. Hung. 1972, 2, 191–213. [Google Scholar] [CrossRef]
  102. Csiszár, I. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Stat. 1991, 19, 2032–2066. [Google Scholar] [CrossRef]
  103. Lindsay, B.G. Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Ann. Stat. 1994, 22, 1081–1114. [Google Scholar] [CrossRef]
  104. Esteban, M.D.; Morales, D. A summary of entropy statistics. Kybernetica 1995, 31, 337–346. [Google Scholar]
  105. Itakura, F.; Saito, S. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, Tokyo, Japan, 21–28 August 1968. [Google Scholar]
  106. Fevotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative Matrix Factorization with the Itakura–Saito Divergence: With application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
  107. Teboulle, M.; Vajda, I. Convergence of best ϕ-entropy estimates. IEEE Trans. Inf. Theory 1993, 39, 297–301. [Google Scholar] [CrossRef]
  108. Basu, A.; Lindsay, B.G. Minimum disparity estimation for continuous models: Efficiency, distributions and robustness. Ann. Inst. Stat. Math. 1994, 46, 683–705. [Google Scholar] [CrossRef]
  109. Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivar. Anal. 2009, 100, 16–36. [Google Scholar] [CrossRef]
  110. Broniatowski, M.; Vajda, I. Several applications of divergence criteria in continuous families. Kybernetika 2012, 48, 600–636. [Google Scholar]
  111. Toma, A. Optimal robust M-estimators using divergences. Stat. Probab. Lett. 2009, 79, 1–5. [Google Scholar] [CrossRef]
  112. Marazzi, A.; Yohai, V. Optimal robust estimates using the Hellinger distance. Adv. Data Anal. Classif. 2010, 4, 169–179. [Google Scholar] [CrossRef]
  113. Toma, A.; Leoni-Aubin, S. Optimal robust M-estimators using Renyi pseudodistances. J. Multivar. Anal. 2010, 115, 359–373. [Google Scholar] [CrossRef]
Table 1. Bias of the MREE for different α , β and sample sizes n under pure data.
Table 1. Bias of the MREE for different α , β and sample sizes n under pure data.
β α
0.30.50.70.911.11.31.51.72
n = 20
0.1−0.210−0.416−0.397−0.311−0.277−0.227−0.1300.0210.0240.122
0.32.218−0.273−0.229−0.160−0.141−0.115−0.096−0.068−0.0360.034
0.5−0.1270.001−0.125−0.088−0.082−0.069−0.058−0.042−0.032−0.019
0.7−0.093−0.110−0.010−0.046−0.044−0.029−0.023−0.031−0.023−0.020
0.9−0.066−0.056−0.028−0.001−0.015−0.0020.0080.000−0.006−0.013
1−0.041−0.045−0.0170.005−0.0020.0110.0140.0120.008−0.003
1.3−0.035−0.0130.0230.0360.0300.0390.0880.0390.0350.021
1.5−0.0030.0120.0480.0530.0470.0580.0530.1700.0480.035
1.70.0120.0280.0580.0670.0610.0700.0700.0580.2690.045
20.0080.0490.0780.0840.0780.0860.0870.0780.0690.444
n = 50
0.1−0.085−0.301−0.254−0.183−0.156−0.106−0.0020.1140.2920.245
0.31.829−0.176−0.150−0.078−0.066−0.042−0.045−0.0140.0050.030
0.5−0.0560.099−0.054−0.037−0.033−0.026−0.019−0.009−0.007−0.005
0.7−0.009−0.0590.035−0.012−0.013−0.005−0.002−0.009−0.0020.006
0.9−0.031−0.031−0.0090.0120.0020.0130.0210.0150.0080.004
10.014−0.0230.0000.0110.0090.0190.0220.0200.0180.004
1.30.002−0.0040.0220.0340.0270.0300.0840.0340.0350.028
1.50.0090.0230.0380.0440.0370.0420.0340.1740.0400.032
1.70.0280.0290.0490.0540.0470.0500.0470.0360.2770.039
20.0400.0510.0650.0680.0590.0630.0600.0510.0410.464
n = 100
0.1−0.028−0.216−0.175−0.113−0.103−0.0630.0360.1690.4520.349
0.31.874−0.135−0.125−0.052−0.044−0.022−0.038−0.0230.0090.024
0.5−0.0020.146−0.034−0.026−0.025−0.021−0.019-0.001−0.008−0.009
0.70.000−0.0420.045−0.009−0.013−0.0090.000−0.009−0.008−0.001
0.90.007−0.025−0.0150.001−0.0040.0050.0090.013−0.001−0.003
10.014−0.010−0.007−0.001−0.0010.0050.0090.0140.0100.009
1.30.0360.0100.0060.0150.0100.0100.0650.0120.0190.014
1.50.0410.0230.0180.0220.0170.0180.0060.1580.0160.015
1.70.0520.0270.0280.0320.0240.0250.0160.0090.2670.019
20.0560.0430.0420.0430.0330.0340.0230.0200.0130.454
Table 2. MSE of the MREE for different α , β and sample sizes n under pure data.
Table 2. MSE of the MREE for different α , β and sample sizes n under pure data.
β α
0.30.50.70.911.11.31.51.72
n = 20
0.10.3470.2510.2220.1450.1220.1060.0980.2420.2060.240
0.37.5060.1470.1000.0690.0630.0590.0590.0620.0980.169
0.50.2380.0760.0670.0510.0490.0470.0500.0550.0640.101
0.70.1770.0910.0560.0450.0440.0430.0450.0550.0560.071
0.90.1630.0850.0610.0450.0420.0430.0470.0530.0580.064
10.1710.0850.0640.0450.0420.0450.0480.0530.0580.063
1.30.1480.0820.0650.0520.0460.0460.0610.0550.0580.065
1.50.1460.0850.0690.0560.0500.0500.0510.0870.0610.065
1.70.1500.0850.0700.0600.0530.0550.0550.0560.1340.066
20.1320.0910.0760.0650.0590.0600.0600.0600.0610.265
n = 50
0.10.3340.1700.1180.0660.0440.0370.0670.1950.4010.275
0.35.0500.0930.0510.0260.0210.0200.0240.0270.0350.050
0.50.1960.0590.0300.0180.0170.0180.0210.0260.0300.037
0.70.1910.0530.0310.0180.0160.0170.0230.0250.0280.035
0.90.1310.0500.0290.0190.0160.0180.0220.0250.0280.029
10.1540.0440.0310.0180.0170.0200.0220.0240.0270.031
1.30.1120.0460.0290.0230.0180.0180.0330.0280.0290.031
1.50.1080.0490.0330.0240.0200.0220.0220.0590.0310.031
1.70.1190.0490.0360.0260.0220.0230.0250.0250.1080.033
20.1080.0530.0400.0300.0250.0260.0280.0290.0280.249
n = 100
0.10.2950.1390.0850.0380.0220.0220.0680.2010.5830.403
0.34.7700.0750.0390.0160.0110.0110.0170.0190.0230.035
0.50.1890.0610.0220.0110.0090.0120.0160.0170.0220.023
0.70.1410.0380.0240.0100.0090.0100.0140.0170.0180.021
0.90.1230.0350.0210.0110.0090.0110.0120.0150.0190.021
10.1220.0360.0190.0100.0090.0110.0130.0160.0170.020
1.30.1140.0350.0190.0120.0090.0100.0210.0160.0170.019
1.50.1050.0370.0190.0120.0100.0110.0120.0450.0170.020
1.70.0970.0340.0210.0140.0110.0120.0140.0140.0920.020
20.0880.0390.0230.0160.0120.0130.0130.0160.0160.227
Table 3. Bias of the MREE for different α , β and sample sizes n under contaminated data.
Table 3. Bias of the MREE for different α , β and sample sizes n under contaminated data.
β α
0.30.50.70.911.11.31.51.72
n = 20
0.1−0.104−0.382−0.340−0.243−0.131−0.0710.0900.1880.2950.379
0.33.287−0.157−0.187−0.135−0.113−0.091−0.0450.0130.1070.237
0.52.6911.483−0.024−0.067−0.069−0.043−0.031−0.010−0.0030.051
0.73.0042.5461.1680.036−0.017−0.0080.0030.0060.0050.010
0.93.1332.8892.3190.9170.2220.0580.0190.0230.0170.022
13.1832.9862.5581.6190.8050.2140.0390.0300.0310.019
1.33.2393.1212.9022.5502.2621.8720.6130.0770.0490.040
1.53.2553.1703.0122.7752.6062.3961.6760.5710.0690.051
1.73.2713.1943.0712.9032.7902.6612.2561.4890.5780.057
23.2893.2163.1223.0122.9422.8652.6492.3051.6900.682
n = 50
0.10.384−0.170−0.189−0.132−0.0540.0240.1040.1710.2610.382
0.33.5490.000−0.122−0.086−0.077−0.053−0.0230.0290.0540.118
0.52.8751.7710.040−0.048−0.048−0.029−0.013−0.015−0.0170.003
0.73.0912.6981.2940.048−0.010−0.014−0.0010.0040.001−0.005
0.93.2052.9452.3790.9390.2260.0450.0090.0130.0120.013
13.2403.0112.6121.6090.7930.1960.0180.0140.0210.012
1.33.3163.1712.9252.5482.2391.8190.5540.0340.0200.020
1.53.3463.2233.0342.7802.5962.3631.5890.5020.0350.022
1.73.3623.2543.1002.9162.7912.6432.1991.3830.5180.025
23.3733.2813.1623.0352.9552.8652.6222.2361.5750.650
n = 100
0.10.610−0.138−0.105−0.0310.0020.0400.1170.1840.2700.381
0.33.9060.136−0.071−0.050−0.052−0.028−0.028−0.0080.0230.066
0.52.9271.9340.101−0.034−0.027−0.0160.0060.000−0.003−0.008
0.73.1222.7611.3480.0660.004−0.0070.0070.0110.0120.000
0.93.2412.9552.4060.9580.2380.0470.0040.0140.0220.017
13.2893.0452.6511.6220.7980.2020.0100.0110.0160.023
1.33.3623.2042.9442.5672.2451.8120.5330.0280.0150.022
1.53.3843.2693.0582.8022.6102.3691.5670.4850.0270.018
1.73.4053.3053.1332.9402.8112.6582.1961.3570.5040.018
23.4213.3273.2043.0652.9802.8862.6332.2341.5410.637
Table 4. MSE of the MREE for different α , β and sample sizes n under contaminated data.
Table 4. MSE of the MREE for different α , β and sample sizes n under contaminated data.
β α
0.30.50.70.911.11.31.51.72
n = 20
0.10.4030.2480.4650.5761.0251.0931.6131.5651.6261.591
0.312.5950.1420.1030.0750.1920.1880.3620.5901.0161.537
0.57.4432.2680.0880.0620.0580.0590.0650.1890.2410.527
0.79.2096.6451.4100.0690.0560.0580.0630.0680.1190.208
0.99.9828.4935.5120.8820.1190.0680.0650.0690.0750.090
110.2929.0726.6722.6920.6930.1170.0680.0700.0760.087
1.310.6649.9168.5746.6415.2403.6100.4300.0790.0790.089
1.510.77810.2299.2387.8506.9405.8832.9170.3890.0790.087
1.710.88410.3799.5998.5827.9427.2345.2352.3260.4030.087
211.00410.5159.9159.2338.8148.3697.1775.4722.9980.547
n = 50
0.11.5520.8150.7410.7030.9661.1901.1291.2241.1651.210
0.314.9690.1050.0470.0300.0780.0750.2800.5590.5660.881
0.58.3453.1900.0490.0250.0210.0220.0250.0290.0350.184
0.79.6347.3351.6940.0310.0200.0220.0270.0290.0330.039
0.910.3538.7235.7120.8980.0770.0270.0280.0300.0320.039
110.5789.1266.8712.6190.6450.0670.0270.0300.0330.039
1.311.06910.1298.6086.5485.0643.3590.3290.0330.0340.038
1.511.26310.4579.2687.7876.8015.6482.5760.2790.0320.038
1.711.37110.6559.6768.5677.8547.0514.9081.9680.2980.037
211.44910.83310.0609.2758.7938.2766.9475.0792.5600.461
n = 100
0.12.1020.3990.8080.9450.9240.9290.8911.0121.2331.120
0.317.1850.1410.0330.0180.0130.0140.0180.1420.2580.453
0.58.6243.7680.0560.0150.0110.0150.0170.0180.0220.028
0.79.8097.6461.8280.0240.0110.0130.0180.0190.0200.023
0.910.5598.7645.8120.9270.0700.0180.0170.0200.0210.023
110.8709.3127.0582.6480.6450.0570.0170.0190.0210.023
1.311.34210.3068.6916.6195.0683.3120.2970.0200.0200.023
1.511.49410.7279.3797.8806.8455.6462.4840.2510.0210.021
1.711.63210.9609.8488.6757.9327.1014.8661.8730.2720.022
211.73911.10210.2979.4228.9108.3636.9735.0402.4200.430

Share and Cite

MDPI and ACS Style

Ghosh, A.; Basu, A. A Generalized Relative (α, β)-Entropy: Geometric Properties and Applications to Robust Statistical Inference. Entropy 2018, 20, 347. https://doi.org/10.3390/e20050347

AMA Style

Ghosh A, Basu A. A Generalized Relative (α, β)-Entropy: Geometric Properties and Applications to Robust Statistical Inference. Entropy. 2018; 20(5):347. https://doi.org/10.3390/e20050347

Chicago/Turabian Style

Ghosh, Abhik, and Ayanendranath Basu. 2018. "A Generalized Relative (α, β)-Entropy: Geometric Properties and Applications to Robust Statistical Inference" Entropy 20, no. 5: 347. https://doi.org/10.3390/e20050347

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop