An Extended Result on the Optimal Estimation under Minimum Error Entropy Criterion

The minimum error entropy (MEE) criterion has been successfully used in fields such as parameter estimation, system identification and the supervised machine learning. There is in general no explicit expression for the optimal MEE estimate unless some constraints on the conditional distribution are imposed. A recent paper has proved that if the conditional density is conditionally symmetric and unimodal (CSUM), then the optimal MEE estimate (with Shannon entropy) equals the conditional median. In this study, we extend this result to the generalized MEE estimation where the optimality criterion is the Renyi entropy or equivalently, the \alpha-order information potential (IP).

x p x dx   R * . Using the MSE as risk, the optimal estimate of X is simply the conditional mean ( ) mean (. ) y p y    . The popularity of the MSE is due to its simplicity and optimality for linear Gaussian case [8,9,11]. However, MSE is not always a superior risk function especially for non-linear and non-Gaussian situations, since it only takes into account the second order statistics. p -power error [12], Huber's M-estimation cost [15], and the risk-sensitive cost [1], etc. For general Bayes risk (3), there is no explicit expression for the optimal estimate unless some conditions on () lx or/and conditional density () p x y are imposed. As shown in [6], if () lxis even and convex, and the conditional density () p x y is symmetric in x , the optimal estimate will be the conditional mean (or equivalently, the conditional median).
Besides the traditional Bayes risk functions, the error entropy (EE) can also be used as a risk function in estimation problems. Using Shannon's definition of entropy [3], the EE risk function is As the entropy measures the average dispersion or uncertainty of a random variable, its minimization makes the error concentrated. Different from conventional Bayes risks, the "loss function" of the EE risk (4) is log ( ) g px  , which is directly related to the error's PDF. Therefore, when using the EE risk, we are nonlinearly transforming the error by its own PDF. In 1970, Weidemann and Stear published a paper * In this paper, .  is the Dirac delta function. § The mode of a continuous probability distribution is the value at which its PDF attains its maximum value. ** The MAP estimate is a limit of Bayes estimator (under the 0-1 loss function), but generally not a Bayes estimator. entitled "Entropy Analysis of Estimating Systems" [18] in which they studied the parameter estimation problem using the error entropy as a criterion functional. They proved minimizing the error entropy is equivalent to minimizing the mutual information between error and observation, and also proved the reduced error entropy is upper-bounded by the amount of information obtained by observation. Later, Tomita et al [17] and Kalata and Priemer [10] studied the estimation and filtering problems from the viewpoint of the information theory and derived the Kalman filter as a minimum-error-entropy (MEE) linear estimator. Like most Bayes risks, the EE risk (4) has no explicit expression for the optimal estimate unless some constraints on the conditional density () p x y are imposed. In a recent paper [2], Chen and Geman proved that, if () p x y is conditionally symmetric and unimodal (CSUM), the MEE estimate (the optimal estimate under EE risk) will be the conditional median (or equivalently, the conditional mean or mode). Table 1 gives a summary of the optimal estimates for several risk functions.  In statistical information theory, there are many extensions to Shannon's original definition of entropy.
Renyi's entropy is one of the parametrically extended entropies. Given a random variable X with PDF () px,  -order Renyi entropy is defined by [14]  In recent years, the EE risk (6) has been successfully used as an adaptation cost in information theoretic learning (ITL) [4,5,13]. It has been shown that the nonparametric kernel (Parzen window) estimator of Renyi entropy (especially when 2   ) is more computationally efficient than that of Shannon entropy [13].
The argument of the logarithm in Renyi entropy, denoted by , is called the  -order information potential (IP) † † [13]. As the logarithm is a monotonic function, the minimization of Renyi entropy is equivalent to the minimization (when 1   ) or maximization (when 1   ) of information potential. In practical applications, information potential has been frequently used as an alternative to Renyi entropy [13].
A natural and important question now arises: what is the optimal estimate under the generalized EE risk (6)? We do not know the answer to this question in the general case. In this work, however, we will extend the results by Chen and Geman [2] to a more general case and show that, if the conditional density () p x y is CSUM, the generalized MEE estimate will also be the conditional median (or equivalently, the conditional mean or mode). † † This quantity is called information potential since each term in its kernel estimator can be interpreted as a potential between two particles (see [13] for the physical interpretation of kernel estimator of information potential).

II. MAIN THEOREM AND THE PROOF
In this section, the discussion is focused on the  -order information potential (IP), but the conclusions drawn can be immediately transferred to Renyi entropy. The main theorem of the paper is as follows. If  -order information potential ( Remark: As ( | ) p x y is CSUM, the conditional median () y  in Theorem 1 is the same as the conditional mean () y  and conditional mode () y  . According to the relationship between information potential and Renyi entropy, the inequalities in (7) are equivalent to Proof of the Theorem: In this work, we give a proof for the univariate case ( 1 n  ). A similar proof can be easily extended to the multivariate case ( 1 n  ). In the proof we assume, without loss of generality, that y  , ( | ) p x y has median at 0 x  , since otherwise we could replace ( | ) p x y by ( ( ) | ) p x y y   and work instead with conditional densities centered at 0 x  . The road map of the proof is similar to that contained in [2]. First, we prove the following proposition: Then for all : m g  for which ( ) Remark: It is easy to observe that Proof of the Proposition: The proof is based on the following three lemmas.
c) For any Proof of Lemma 1: See the proof of Lemma 1 in [2].

Remark:
The transformation h hm  in Lemma1 is also called the "rearrangement" of h [7]. By Lemma . Therefore, to prove Proposition 1, it suffices to prove