Highly Efﬁcient Robust and Stable M -Estimates of Location

: This article is partially a review and partially a contribution. The classical two approaches to robustness, Huber’s minimax and Hampel’s based on inﬂuence functions, are reviewed with the accent on distribution classes of a non-neighborhood nature. Mainly, attention is paid to the minimax Huber’s M- estimates of location designed for the classes with bounded quantiles and Meshalkin-Shurygin’s stable M- estimates. The contribution is focused on the comparative performance evaluation study of these estimates, together with the classical robust M- estimates under the normal, double-exponential (Laplace), Cauchy, and contaminated normal (Tukey gross error) distributions. The obtained results are as follows: (i) under the normal, double-exponential, Cauchy, and heavily-contaminated normal distributions, the proposed robust minimax M- estimates outperform the classical Huber’s and Hampel’s M- estimates in asymptotic efﬁciency; (ii) in the case of heavy-tailed double-exponential and Cauchy distributions, the Meshalkin-Shurygin’s radical stable M- estimate also outperforms the classical robust M- estimates; (iii) for moderately contaminated normal, the classical robust estimates slightly outperform the proposed minimax M- estimates. Several directions of future works are enlisted.

The reasons of research in this field of statistics are of a general mathematical nature: the conceptions of "optimality" and "stability" are mutually complementary in performance evaluation for almost all mathematical procedures, and the trade-off between them is often a sought goal.
It is not rare that the performance of optimal solutions is rather sensitive to small violations of the assumed conditions of optimality. In statistics, the classical example of such unstable optimal procedure is given by the least squares estimates, in which performance under small deviations from normality can become disastrous [5].
Since the term "stability" is overloaded in mathematics, the term "robustness" being its synonym is at present conventionally used in statistics and in optimal control theory: in general, it means the stability of statistical inference under uncontrolled violations of accepted distribution models.
In present, there are two main approaches to robustness: historically, the first global minimax approach of Huber (quantitative robustness) [5] and the local approach of Hampel based on influence functions (qualitative robustness) [6]. Within the first approach, the least informative (favorable) distribution minimizing Fisher information over a certain distribution class is obtained with the subsequent use of the asymptotically optimal maximum likelihood parameter estimate for this distribution. In this case, the minimax approach gives the guaranteed accuracy of robust estimates, that is, the asymptotic variance of the optimal parameter estimate is upper-bounded for distributions from the aforementioned class.
Within the second approach, a parameter estimate is defined by its desired influence function, which determines the qualitative robustness properties of an estimate, such as its low sensitivity to the presence of gross outliers in the data, to the data rounding-off, to the missing data, etc.
In what follows, we consider these methodologies in detail focusing on the optimization and variational calculus tools used in both aforementioned approaches. Within Huber's minimax approach, we review the conventional least informative (favorable) distributions obtained for the neighborhoods of a Gaussian [5], as well as those designed for a variety of the non-standard distribution classes of a non-neighborhood nature [7]. Within Hampel's local approach [6], we mostly emphasize its recently developed stable estimation branch with the originally posed variational calculus problems and rather prospective results on their application to robust statistics [8].
While this paper focuses on particular topics in the field of robust statistics, it is worth noting a few comprehensive reviews also covering the present state of art in this field, namely Reference [9][10][11][12][13][14].
An outline of the remainder of the article is as follows. In Section 2, a general problem setting for the design of minimax variance M-estimates of location is recalled. In Section 3, the globally stable Meshalkin-Shurygin's M-estimates are described. In Section 4, a comparative performance evaluation of the conventional robust M-estimates of location with the several novel proposed M-estimates is examined (univariate setting is considered throughout the paper), and several unforeseen and unexpected results have been obtained. In Section 5, some conclusions are given.

Preliminaries
The minimax principle aims at the worst case suggesting for it the best solution [2]; thus, this approach provides a guaranteed result [5]. However, being applied to the problem of estimation of location, it yields a robust version of the principle of the maximum likelihood [2]. Usually, estimation of location is of a primary interest, and, in this study, we focus on it.
Let x 1 , . . . , x n be a sample from a distribution with density p(x − θ) from a convex class P, where θ is a location parameter. Further, we assume that p is a symmetric distribution density; hence, θ is the center of symmetry to be estimated.
An M-estimate T n of θ is a solution to the following minimization problem: where ρ(u) is called the function of contrast [15]: ρ(x i − θ) is a measure of difference between the observation x i and the estimated center of symmetry. The following particular cases of (1) are of a particular interest: (i) for ρ(u) = u 2 , we have the least squares method with the sample mean x as the estimate of location; (ii) for ρ(u) = |u|, we arrive at the least absolute values method with the sample median estimate med x; and (iii), mostly importantly, for a given density p, the choice ρ(u) = − log p(u) yields the maximum likelihood (ML) estimate of location. It is more convenient to formulate the properties of M-estimates in terms of the derivative of the function of contrast ψ(u) = ρ (u) called a score function. In this case, an M-estimate is defined as a solution to the following implicit estimating equation Under rather general conditions of regularity imposed on the class Ψ of score functions ψ and on the class P of densities p (their various forms can be found in Reference [2,5,6]), M-estimates are consistent and asymptotically normal with the asymptotic variance AV: where For M-estimates (2), the following result holds [5].
Theorem 1. (Huber, 1964) Under regularity conditions, M-estimates satisfy the minimax property where p * (x) is the least informative (favorable) density p * minimizing Fisher information for location I(p) over the class P: From (4) and (5), it follows that the minimax function of contrast and score function are given by the maximum likelihood method for the least informative density p * : Thus, the pair (ψ * , p * ) is the saddle-point of the functional AV(ψ, p). The right-hand part of inequality (4) is the Rao-Cramér inequality: whereas its left-hand part guarantees the asymptotic accuracy of robust minimax estimation with the following upper bound upon the asymptotic variance of the minimax variance robust M-estimate of location: AV(ψ * , p) ≤ 1/I(p * ).
The key point of the minimax approach is the solution of the variational problem (5): further, various classes P with the corresponding least informative densities p * and minimax estimates are enlisted. Now, we recall the Huber's classical solution for ε-contaminated normal distributions (Tukey's gross-error model): where ϕ(x) = (2π) −1/2 exp(−x 2 /2) is the standard normal distribution density.
Theorem 2. (Huber, 1964) In the class P ε , the least informative density p * and the optimal score function has the following form [2]: where the parameters C, D, and k satisfy the conditions of norming, continuity, and continuous differentiability of the solution at x = k: Finally, we get the linear bounded score ψ Huber (x) = max{−k, min{x, k}}, where k depends on the value of the contamination parameter ε, as follows: the values of the parameter k = k(ε) are tabulated in Reference [5].
The particular cases of this solution for ε = 0 and ε → 1 are given by k → ∞ (the sample mean) and k = 0 (the sample median), respectively. The Huber's score function ψ Huber (x) is a robust version of the ML estimation: in the center |x i − θ| ≤ k, the data are processed by the ML method, and they are trimmed within the exponential distribution tails |x i − θ| > k. In the limiting case of a completely unknown density as ε → 1, the minimax variance M-estimate of location tends to the sample median.

Free Extremals of the Basic Variational Problem
Consider the problem of minimization of Fisher information for location (5) under two basic side conditions of non-negativeness and norming: , and rewrite this minimization problem as Introducing the Lagrange multiplier λ related to the norming condition, we obtain the following differential equation for the function q(x): The general solutions of this harmonic oscillator equation have the well-known forms depending on the sign of λ = 4k 2 : (i) the exponential q(x) = C 1 e kx + C 2 e −kx , the cosine q(x) = C 1 sin kx + C 2 cos kx, and the linear q(x) = C 1 + C 2 x, where k = ± √ λ/2. Further, we show that all these forms work in the structures of least informative distribution densities.

Least Informative Distributions
The neighborhoods of normal, generally, are not the only models of interest. In real-life applications, the information about the distribution central part tails, its moments, and/or subranges is rather often available. The empirical distribution and quantile functions, histograms, and kernel estimates, together with their tolerance limits, provide other examples. To enhance the efficiency of robust estimates, this information can be used in minimax settings.
Further, we deal with symmetric distribution densities p(−x) = p(x). Evidently, distribution densities must also satisfy the non-negativeness and norming conditions common for all classes. For brevity, we do not write out all these conditions any time we define a distribution density class. Now, we enlist several examples of distribution classes convenient for the description of a prior knowledge about data distributions [7].
All distribution densities with bounded variances are the members of this class. Evidently, the Cauchy-type distributions do not belong to it. The least informative density in this class is normal [17]: (3) The class P 3 of approximately normal distribution densities is defined by Equation (7): The class of finite distributions: P 4 = p : l −l p(x) dx = 1 . This class defines the boundaries of the data (i.e., the inequality |X| ≤ l holds with probability one), and there is no more information about this distribution besides the boundary conditions of smoothness: p(±l) = p (±l) = 0. The least informative density in this class has the cosine-squared form [15]: (5) The class of distributions with a bounded interquantile distribution mass: The parameters l and β are assumed given. The restriction upon the interquantile mass means that P(|X| ≤ l) ≥ 1 − β. We can redefine this class in a different way as the class with an upper-bounded interquantile range IQR β = P −1 1 + β 2 − P −1 1 − β 2 ≤ 2l, where P(x) is a probability density function. The least informative density in this class has both the cosine and exponential parts working at the center and tail areas, respectively [7], where the constants A 1 , A 2 , B 1 , and B 2 are determined from the simultaneous equations (restrictions) of the class P 5 , namely the conditions of normalization and upon the distribution interquantile range, and the conditions of continuity and continuous differentiability at x = l (for details, see Reference [7]). It is worth noting that the classes P 1 and P 4 are the particular cases of the class P 5 when β → 0 and β = 1, respectively.

Hampel's Robust and Shurygin's Stable Estimates of Location
Robust methods have lower sensitivity to possible departures from the accepted distribution models as compared to conventional statistical methods. To analyze the sensitivity of estimation, it is natural to have its specific indicator. In what follows, we introduce these indicators, namely the influence function and related to it measures.

Hampel's Local Approach to Robustness
Let P be a distribution function corresponding to p ∈ P, the class of distribution densities, and let T(P) be a functional defined in a subset of all distribution functions. The natural estimate defined by T is T n = T(P n ), i.e., the functional computed at the sample distribution function P n .
The influence function IF(x; T, p) of this functional is defined as where ∆ x is the degenerate distribution taking mass 1 at x [6]. The influence function measures the impact of an infinitesimal contamination at x on the value of an estimate, formally being the Gâteaux derivative of the functional T(P). For an M-estimate with a score function ψ, the influence function is proportional to it: IF(x; ψ, p) = ψ(x) /B(ψ, p), where the term B(ψ, p) stands in Equation (3) for the asymptotic variance of M-estimates.
Based on the influence function, several local measures of robustness are defined [6], including the gross-error sensitivity of T at p: where e f f (ψ, p) = 1 AV(ψ, p)I(p) formally leads to the Huber's minimax linear bounded score ψ Huber (x) = max{−k, min{x, k}} in the class of contaminated normal distributions [6]. This particular result confirms the following general observation: the best estimates within both approaches, Huber's minimax and Hampel's local, are rather close in their performances.

Meshalkin-Shurygin's Stable Estimates of Location
This topic is partially reversal to the conventional setting: the maximum of some measure of sensitivity is minimized under the guaranteed value of the estimate variance or efficiency.
The conventional point-wise local measures of sensitivity, such as the influence and change-of-variance functions [6] are not appropriate here-a global indicator of sensitivity is desirable. We show that a novel global indicator of robustness proposed by Shurygin [18], the estimate stability, is closely related to the classical variation of the functional of the estimate asymptotic variance. Although the related theory has been developed for stable estimation of an arbitrary parameter of the underlying distribution, here, we focus on stable estimation of location.
A measure of the M-estimate sensitivity called variance sensitivity (VS) is introduced in Reference [19]. Formally, it is defined as the Lagrange functional derivative of the asymptotic variance (3): Equation (11) gives a global measure of the stability of an M-estimate in a model, where the outliers occur uniformly anywhere on the real line. The boundness of the Lagrange derivative (11) holds under the condition of square integrability of ψ with the corresponding redescending scores when ψ(x) → 0 for |x| → ∞.
In Reference [7], it is shown that the principal part of the variation δAV(ψ, p) of the asymptotic variance AV(ψ, p) with respect to ||δp|| is proportional to the variance sensitivity (11) or to the Lagrange derivative of the asymptotic variance: δAV(ψ, p) ∝ ∂AV(ψ, p)/∂p . Further, consider the following optimization problem: what is the minimum variance sensitive score function for a given distribution density p? The solution of this optimization problem is given by the minimum variance sensitive (MVS) score function: The estimate with this optimal score function (12) is called as the estimate of minimum variance sensitivity with VS min = VS(ψ MVS , p). We define a global measure of the stability of any M-estimate comparing an estimate variance sensitivity with its minimum 0 ≤ stb(ψ, p) = VS min (p)/VS(ψ, p) ≤ 1 .
A number of optimization criteria settings with different weights for efficiency and stability of M-estimates of location have been proposed and solved; in other examples, efficiency is maximized under guaranteed stability, and vice versa [18]. Practically in all these cases, we deal with the score functions ψ opt (x) with the following limiting forms: the maximum likelihood case ψ opt (x) → ψ ML (x) = −p (x)/p(x) when the requirement of high efficiency mostly matters and the opposite redescending case ψ opt (x) → ψ MVS (x) = −p (x) when the requirement of high stability is important.
A compromise case is given by a stable estimate called radical with equal efficiency and stability, e f f (ψ, p) = stb(ψ, p), desirably both highly efficient and stable: its score function is given by Finally, it should be noted that the minimum sensitivity and radical estimates belong to the class of M-estimates with the exponentially weighted maximum likelihood score functions previously proposed by Meshalkin [20].

Asymptotic Efficiency of M-Estimates: A Comparative Study
Here, we compare the asymptotic efficiency performance of various robust and stable estimates of location at the conventional in robustness studies distributions: some particular results can be found in Reference [19,21]. We mainly focus on the minimax variance estimates for distributions with bounded interquantile ranges, as until present, their performance has not been thoroughly studied. It is important that some obtained results are unexpected and surprising.

Asymptotic Efficiency
The asymptotic efficiency of M-estimates is numerically computed as follows:

Conclusions
From Table 1, it follows: (1) As usual, the performance of the sample mean under heavy-tailed Cauchy and contaminated normal distributions is awful. Designed for these models, Huber's and Hampel's M-estimates perform well except the Laplace distribution case. This distribution with moderately heavy tails against a sharp peak at the center is a rather tough test for the asymptotic performance of M-estimates of location, especially as compared to the Cauchy distribution case. Recall that the Laplace and distributions close to it are the least informative ones in wide classes of distributions, for instance, in the class P β with the parameter β close to unit (the corresponding minimax variance M-estimates perform quite well in these cases). For a statistical user, the version with a bounded IQR (interquartile range, β = 1/2) seems a reasonable choice. (2) Surprisingly, the proposed minimax variance M-estimates with the scores ψ β outperform the classical robust Huber's and Hampel's M-estimates at the normal, although the shape of the least informative distribution is not at all normal: note that the Taylor expansion of the cosine 2 -bell shape is close to the exponential one with small values of β. Moreover, these M-estimates are better than the classical robust estimates in heavy-tailed distribution models. We explain this effect by the nature of the distribution class P β -it is one of the most wide possible distribution classes. Finally, these M-estimates and their statistical properties can be obtained in a closed analytical form.  Finally, we outline the prospective future works: (i) an extension of the proposed minimax variance M-estimates to the multivariate case and the classes with simultaneously bounded subranges of different parameter β values-in the latter case, we may expect a uniformly better performance; (ii) a generalization of the Meshalkin-Shurygin's stable estimates to the multivariate case; and (iii) a thorough comparative study of the small sample performance of M-estimates of location-Monte Carlo experiments show a slightly better performance of the classical Huber's and Hampel's estimates in this case.