An Entropy-Like Estimator for Robust Parameter Identification

This paper describes the basic ideas behind a novel prediction error parameter identification algorithm exhibiting high robustness with respect to outlying data. Given the low sensitivity to outliers, these can be more easily identified by analysing the residuals of the fit. The devised cost function is inspired by the definition of entropy, although the method in itself does not exploit the stochastic meaning of entropy in its usual sense. After describing the most common alternative approaches for robust identification, the novel method is presented together with numerical examples for validation.


Introduction
In spite of their practical importance in many fields of science and technology, apparently there is no universal and well-accepted definition of outlying data.This may have resulted from the wide range of fields (statistics, data mining, signal processing, time-series analysis, system identification, telecommunications, etc.) dealing with outliers from different perspectives.Broadly speaking, we can think of outliers as data values that appear to be inconsistent with the rest of the set.In model identification for automatic control applications as well as in the other fields of science and technology, outlying data may have a dramatic impact on parameter estimation.This paper describes the basic ideas behind a novel prediction error parameter identification algorithm for static models exhibiting high robustness with respect to outlying data.Given the low sensitivity to outliers, these can be more easily identified by analysing the residuals of the fit.The proposed algorithm is based on the minimization of a particular cost function of the model prediction error residuals.The devised cost function is inspired by the definition of Gibbs' entropy and shares the same mathematical properties of the entropy associated to a set of probability values p i , although the method in itself does not exploit the stochastic or information theoretic meaning of entropy in its usual sense (hence the name entropy-like).Contrary to alternative robust parameter identification methods (as the Least Median of Squares (LMS)), if the model is sufficiently regular with respect to the parameter vector θ, the derivatives of the proposed penalty function with respect to θ can be analytically computed allowing to exploit gradient and Hessian matrix information in the numerical minimization routine.Robustness to outliers is obtained as a consequence of the fact that the used cost function rewards unevenly distributed residuals rather than some kind of weighted mean square error (MSE).In particular, the minimization of the devised entropy-like function rewards the presence of a majority of low relative errors and a minority of large ones.
After reporting on the state of the art on robust parameter identification in Section 2., the proposed method is outlined and discussed in Section 3. The basic algorithm properties and associated problems are addressed in Section 4. and numerical results are provided in Section 5. Conclusions and future research directions are finally addressed in Section 6.

Robust Parameter Identification: A Brief Summary of the State of the Art
Consider a static system y i = f (x i1 , x i2 , . . ., x im , θ r ) + ε i : i = 1, 2, . . ., N being θ r ∈ R m×1 the unknown parameter vector, y i ∈ R the response variable, x i1 , x i2 , . . ., x im the explanatory variables and ε i the error term.Index i runs on the number of observations N that is assumed to be strictly larger than m.The error term ε i is assumed to be a normally distributed random variable with zero mean.Denoting with Z N = {(y i , x i1 , x i2 , . . ., x im ) : i = 1, 2, . . ., N } the set of the available observations, a regression estimator T is an algorithm associating to Z N an estimate θ of θ r , namely T (Z N ) = θ.Prediction error estimators T are designed based on the properties of the regression residuals where ŷi are the predicted responses ŷi = f (x i1 , x i2 , . . ., x im , θ).The most popular prediction error estimators are the Least Squares (LS) and weighted LS estimators defined respectively as where r ∈ R N ×1 is the residual vector r = (r 1 , r 2 , . . ., r N ) T and Γ ∈ R N ×N a symmetric positive definite (or eventually semidefinite) matrix of weights.Given the symmetric nature of Γ, there is no substantial loss of generality in assuming it to be diagonal, i.e., there exists an orthogonal N × N matrix Ω such that Ω Γ Ω T = diag(γ 1 , γ 2 , . . ., γ N ), so that the WLS estimator results in θWLS = arg min θ where r = Ω r.If the weight matrix Γ in the WLS estimator is chosen to be the covariance of the zero mean normally distributed error term ε = (ε 1 , ε 2 , . . ., ε N ) T , namely Γ = E[ε ε T ], the corresponding estimator (some times called the Markov Estimator) coincides with the Maximum Likelihood (ML) estimator defined as θML = arg max where p(Z N | θ ) is the probability of the data set Z N given θ.
If, moreover, the error terms ε i are assumed to be independent and identically distributed (i.i.d.) the above ML estimator minimizes the sample version of the entropy associated to p(•), namely θML = arg min θ The properties of the celebrated Maximum Likelihood, Least Squares and Weighted Least Squares estimators are very well known and will not be discussed here.Yet it should be stressed that these methods are also very well known to be highly sensitive to outliers [1,2], i.e., to elements of Z N that do not comply with the model (1).Intuitively the lack of robustness of all ML related methods should not surprise: if Z N is contaminated by wrong data (i.e., data that should not be in Z N ), maximizing the likelihood of Z N conditioned to θ means maximizing the probability that the data set does contain wrong data.The issue of designing robust (with respect to outliers) parameter estimators has been explored in many fields of science following often different approaches.In order to motivate and compare the solution proposed in this paper with the state of the art, some of the alternative known approaches are briefly summarized: based on the observation that the ML estimator minimizes the sample version of the entropy associated to the i.i.d.model errors ε i , in [3,4] a minimum entropy (ME) estimator is proposed.The main idea consists in estimating the probability density function of residuals p(r i (θ)) with a kernel based method (using radial basis functions, by example) and then computing the parameter vector estimate θME as the argument minimizing the entropy associated to p.The idea behind this approach is that minimizing the estimated entropy of the residuals will force them to have a minimum dispersion.This approach, although computationally demanding, is indeed appealing and has also been pursued within system identification in [5].
A different perspective to the problem can be found in the rich statistics literature on the subject.A milestone reference is the book by Peter J. Huber [2] describing (among the rest) the use of M-estimators (where the M is reminiscent of Maximum Likelihood).M-estimators can be thought of as generalizations of the LS estimator (3) where the square of residuals is replaced by an alternative penalty function ρ(•) with a unique zero in the origin and such that ρ(z) = ρ(−z), namely Indeed Laplace or L 1 regression is a special case of equation ( 8) corresponding to ρ(•) = | • |, i.e., the absolute value of residuals.Unfortunately, L 1 regression (as LS, i.e., L 2 regression) turns out to be quite sensitive to outliers [1] and is computationally complex even for models that are linear in θ.M-estimators (as the other parameter estimators) are more easily analysed in the special case that the model ( 1) is linear in the parameters, namely: or in vector notation With reference to the model (9) and to the M-estimator (8), if ρ(•) is differentiable, the computation of θM can be computed solving a system of m equations as Denoting with ψ the derivative of ρ with respect to the generic component of θ, the above m equations are usually reported as where the index j is omitted for the sake of brevity.Most often the design of M-estimators is performed by selecting the ψ(•) function in equation ( 12) rather than ρ(•) itself.In this case the corresponding Mestimator is classified as Ψ-type.Robustness to outliers is then sought for by selecting the ψ(•) function so that it saturates to a constant positive value when its argument (i.e., the residual) is larger than a certain threshold.Eventually the ψ(•) function can be selected so that it even goes to zero for sufficiently large residuals, in this case ψ(•) is said to be redescending.Examples of popular redescending ψ(•) functions include Hampel's three-part function for parameters 0 < a ≤ b < c, Tukey's biweight ψ(t) function for some positive k: or Andrew's sine wave (for some positive ω): The most popular non redescending design for ψ(t) is perhaps Huber's ψ(t) function for some positive k: corresponding to Winsorized Least Squares.Indeed the ψ(•) function associated to the LS estimator is simply ψ(t) = t and the one associated to L 1 regression is ψ(t) = sign(t).
A possible measure of the robustness of an estimator is given by the finite sample breakdown point [1]: assume that an estimator T associates the estimate θT to a given data set and it measures the least fraction of contamination that can arbitrarily bias the estimate.The asymptotic breakdown point is obtained as the limit of bdp(T, N ) for N tending to infinity and it is usually expressed as a percentage.It is well known that in LS regression problems even a single data point can arbitrarily affect the estimate.Denoting with T LS the LS-estimator, its finite sample breakdown point results in bdp(T LS , N ) = 1/N and its asymptotic breakdown point is 0%.Unfortunately M-estimators are also reported [1] to have 0% asymptotic breakdown point in spite of exhibiting good finite sample performance in practical applications.Some Ψ-type M -estimators may not be scale equivariant.Scale equivariance is a property according to which if the data values y j should be all scaled by a constant c, i.e., y j −→ c y j ∀ j, the corresponding estimate θ would scale the same way, i.e., θ −→ c θ.Of course this is a very important property for linear in the parameters models.To ensure proper scaling behavior of M -estimates that should not be scale equivariant, one can normalize the residuals with an estimate σ of the standard deviation of the data.Namely, the normalized Ψ-type M -estimate equations would take the form: where the σ needs to be estimated as well.Of course this increases the complexity of the overall problem.
A possible robust estimate of the standard deviation [2] can be taken to be for some positive constant k being MAD the Median Absolute Deviation defined as: M AD := med{|r j − µ|}.
where med{r h } is the median of the set {r h : h = 1, 2, . . ., N }.The constant k can be chosen to achieve consistency with the standard deviation of a known probability distribution: by example, should the r j values be distributed normally with standard deviation σ, the constant k would need to be approximately equal to 1.4826 for σ to be a consistent estimate of σ.For a more detailed discussion about estimating the scale of residuals refer to [2] (Chapter 5).
A common feature to all M -estimators is the structure described by equation ( 8) of the objective function to be minimized: notably, according to this structure, each residual contributes to the objective function independently from the others.Of course residuals are related to each other through the common "generating" model, yet according to the very definition (8) of M -estimators, the contribution to the objective function of the i-th residual does not depend on the other residuals.In a very loose sense, one could say that the above presented M -estimators are somehow "local" in nature as they strive for robustness trying to give a finite (or zero, for redescending M -estimators) weight to single residuals that exceed a threshold.Each residual contributes to the objective function based on its only value regardless the overall distribution.
An alternative approach is to minimize a "global" measure of the scatter of all residuals.Indeed the novel robust (or resistant) estimators presented in [1] have a different structure with respect to "local" (in the above sense) M -estimators: these are the Least Median of Squares (LMS) and Least Trimmed Squares (LTS) estimators.Both estimators appear to have excellent robustness properties: by their very definition they are computed minimizing objective functions that measure the overall distribution of residuals.In particular the LMS (not be confused with the Least Mean Square value) is defined [1] as: This estimator is shown [1] to achieve the maximum breakdown point possible (i.e., 50%) although its computation is numerically nontrivial and its performance in terms of asymptotic efficiency is poor.The LMS estimator has also an appealing geometrical interpretation that can be more easily described in the scalar case p = 1, i.e., when θ is a scalar and the data model is the line y = θ x: in this case, the LMS estimate θLMS of θ corresponds to the center line ( θLMS x) of the thinnest (measured in the vertical direction, i.e., y-axis) stripe containing [N/2] + 1 of the data points being [N/2] the integer part of N/2.This geometrical interpretation can be extended to the general case of linear in the parameters models as in equation ( 9).The LMS estimator can be also interpreted as a special case of a Least Quantile of Squares (LQS) estimator [1] (pp. 124-125): this is a class of estimators including as a special case also the L ∞ or minmax estimator defined as: A second robust estimator (having the same breakdown point of the LMS) presented and analysed in [1] is the Least Trimmed Squares (LTS) estimator defined as: being (r 2 ) i:N the ordered sequence of the squared residuals (first squared, then ordered), namely where (r 2 ) j:N = r 2 j for all j = 1, 2, . . ., N .Notice that in spite of the similarity with traditional Least Squares, the computation of the LTS estimate is not obvious as the dependance of the ordered sequence of the squared residuals from the parameter vector is by no means trivial.For a (not up-to-date, but still valid) discussion about numerical issues related to the computation of LTS and LMS estimates, see [1].Results relative to the complexity and the issues related to the computation of LMS estimates are described in [6] and [7].
Other robust estimators used mostly in statistics include L-estimators, R-estimators and S-estimators: the first are computed as linear combinations of order statistics of the residuals.They are mostly used in location (p = 1) problems and are usually simple to compute (example, in location problems, θL = N i=1 a i y i:N for proper constants a i ), although they have been shown [1] (p. 150) to achieve poor results when compared with alternative robust solutions.R-estimators are based on ranks of the residuals: such estimators have been studied from the early 1960s and, under certain conditions, have been shown to have the same asymptotic properties of M -estimators [1] (p. 150).S-estimators have been suggested in [1] and are based on M -estimates of the scale of the residuals.As reported in [1] (p. 208), S-estimates are rather complex to be computed and simulation results suggest that they do not perform better than the LMS.
One of the limits of the LMS estimator is its slow (N −1/3 ) asymptotic convergence rate (notice that LTS is shown to converge at the "usual" rate of N −1/2 ) [1].In the attempt to improve convergence of the LMS estimate, it was suggested in [1] to use a so called "Reweighted" Least Squares (RLS) approach: the basic idea is to use a first robust estimate s 0 of the scale of residuals (by example based on the MAD (21)) to compute binary weights for each residual as: where c is an arbitrary threshold (usually equal to 2.5).Weights equal to zero will correspond to data points that will be completely ignored, while weights equal to one will correspond to data points used for the next step of the algorithm.Once that weights w * have been computed according to equation (26), a second estimate of the scale is computed as ( [1] pp. 44-45): Then a new set of weights w † i is computed (hence the name "reweighted") on the basis of the new scale estimate σ * , namely: and the final estimate of θ is computed as a (Re-) Weighted Least Squares estimate according to: Accordingly, the final scale estimate is computed as in equation ( 27) but with the w † i weights in place of the w * i ones.Numerical simulations reported in [1] show that the above described Reweighted Least Squares (RLS) solution has very nice finite sample properties, although the hope that this solution could also improve the rate of asymptotic efficiency has been shown to be false in [8].Of course many variants to this reweighted schema are possible: by example weights can be computed with a smooth function ( [1] p. 129) rather than a binary one, or they can be computed adaptively [9], or even based on Pearson residuals [10] giving rise to a one step robust estimator.The detailed discussion of these (and other) variants to the RLS approach goes beyond the scope of this paper and will be omitted for the sake of brevity.
To conclude this very brief overview of robust estimation methods, it should be noted that besides the statistics research community, other branches of science have been addressing similar problems exploiting different methods.For example, within the machine learning community, popular approaches include Neural Networks based or Support Vector Machine (SVM) estimators [11].For pattern recognition and computer vision classification problems, voting algorithms are also widely employed.One of the most popular voting algorithm is the Hugh transform or, more generally, the Radon transform [12].This is often used in computer vision applications: it consists in performing a transformation between the image space (pixels) and a parameter space relative to specific curves.In its most common and simple formulation, the Hugh transform is used to detect straight lines in 2D images: a (simplistic) implementation of the method could be summarized as follows.First a set of candidate pixels C p is selected based on a given criteria (by example color, or some other image property).Then, each pixel in C p with coordinates (x, y) "votes" for sampled parameters (a j , b h ) in sets S a = {a 1 , a 2 , . . ., a n : Once that all pixels in C p have been processed, the straight lines in C p are determined by selecting the parameter pairs (a * , b * ) that have been assigned the highest number of votes.This kind of voting algorithm has the advantage of being computationally simple and relatively fast: this approach is popular in image processing applications where real time performance is essential.Notice that the spirit of voting schemas is to sample the parameter space such that the majority of candidate data points agree on a specific point of the parameter space.The selection of the parameter points is performed "globally" after all candidate data points have expressed their vote.Robustness to outliers is naturally obtained through the voting criteria itself.The Hugh transform method can be extended to identify more complex curves than straight lines.Of course the computation time and the memory requirements of the method increases rapidly with the number of data points to process and with the dimension of the parameter space to be sampled.The computational effort associated with the number of data points is due to the fact that each of them is processed before the estimate can be computed.An alternative algorithm that remedies this problem is RANSAC, i.e., Random Sample Consensus [13].
The RANSAC is an iterative algorithm based on random sampling of the data: in short, a subset (candidate inlier data set C inliers ) of the data points is randomly sampled.In its most simple implementation, the size N 0 of this randomly sampled subset is fixed and is one of the design parameters of the algorithm.The elements in C inliers are then fitted to the model through standard methods as, by example, Least Squares.The rest of the data points (not used for estimating the model parameter vector) is tested against the model: data points with residuals below a given threshold, hence that have reached a consensus with the candidate parameter vector of the model, are added to the candidate inlier set C inliers .If the size of C inliers built in this manner is sufficiently large (namely larger than a design parameter threshold N c ), all the data in this set are fitted to the model giving rise to the RANSAC estimate of the parameter vector.Otherwise the whole procedure is repeated for a maximum number of times N max .An estimate of how large N max should be can be obtained on the basis of an estimate of the percentage of outliers, of the size N 0 of the randomly sampled subset and of the desired probability that the dimension of C inliers after testing all the data is at least N c (refer to [13] for details).
Another approach for robust parameter estimation has been developed in the last 15 years within the information and entropy econometrics research community [14]: exploiting (in essence) Laplaces principle of insufficient reason and the information theoretic definition of entropy, a method known as Generalized Maximum Entropy (GME) has been developed [15] for the parameter identification problem.With reference to linear in the parameters models as in equation (10), contrary to all the other discussed approaches, the GME method aims at estimating both the parameter vector θ and the error term ε.From a technical point of view, this goal is pursued by re-parametrizing the model in equation (10) so that θ and ε are expressed as expected values.A basic scenario is the following: assuming for all i = 1, 2, . . ., m, the linearly spaced support vectors z i ∈ R l×1 and v i ∈ R l×1 are defined on the sets [−c i , c i ] and [−d i , d i ] for some l.By example, if l = 5 one would have Such support vectors are then used to define block-diagonal matrices Z and where p i ∈ R l×1 and w i ∈ R l×1 are the (discrete) probability density functions (on supports z i and v i ) of θ i and ε i respectively.Having introduced such discrete probability density functions, the values of p and w are estimated based on the principal of maximum entropy as subject to the consistency (10) and normalization constraints having denoted with (x) k the k-th component of vector x.Solving equation (35) subject to constraints (36), (37) and ( 38) is in general by no means trivial and needs to be accomplished numerically.On the other hand, the entropy function in equation ( 35) to be maximized is strictly concave on the interior of the constraints (37) and (38) implying that a unique solution to the constrained optimization problem exists if the intersection between the constraints is non-empty.Of course the GME estimate θGME of θ will be given by θGME = Z pGME .
The GME method is extremely interesting for its many noteworthy properties.In particular the method also converges when the model matrix G in equations (10) or ( 35) is singular and there is no need for specific assumptions on the distribution of the error term ε.Notice, for example, that in the extreme case where G = 0 and E[y] = 0, the Least Squares or Weighted Least Squares estimators would be ill-defined whereas the GME method would lead to uniform probability density functions p i and w i implying θGME = 0 (if the support vectors z i and v i are symmetrical with respect to zero as in equations (30-31)).Of course this desirable behavior is possible thanks to the prior information on θ and ε (unnecessary within LS and related approaches) that is embedded in the definition of the support vectors z i and v i .
It should be noticed that the GME method is widely used in econometrics research (see [16] for a recent application in this field), but still poorly exploited in other application domains that could greatly benefit from it (consider, for example, system identification and control engineering where robustness is a must [17,18]).Notice that the GME approach guarantees high robustness with respect to singular regression models, but in its standard formulation described above it does not offer specific benefits with respect to the presence of outliers.

An Entropy-Like Estimator
The proposed method, similarly to M-estimators, LTS and LMS-estimators, builds on the minimization of a properly defined penalty (or cost) function.The novelty of the method is related to the definition of the cost function: the aim is to find a cost function able to give a "global" measure of the scatter of the residuals.As explained in the following, such function will be built on the basis of the concept of (Gibbs) entropy.
Given the residual r i as in equation ( 2), define: namely the LS estimation cost.Then define the relative squared residuals q i as if and finally The function H enjoys all the mathematical properties of a normalized entropy [19] function associated to the sequence of "probability", like q i : i = 1, 2, . . ., N .In particular: Notice that the hypothesis D = 0 in equation ( 41) is needed just to prevent the singular situation occurring when the LS fit is perfect.This is not a practical limit, as prior to computing H one can always check if the LS fit is perfect.In such case there is of course no need to compute any other estimate of the parameters.Also notice that for null values of q i the terms 0 log 0 in equation ( 42) are zero (recall that x log x = log x x and that 0 0 = 1).
When the relative squared residuals q i are properly defined (i.e., D = 0), the H function is a measure of their spread.When they are not properly defined, it is simply because the residuals are all identically null which corresponds to a null value of H exactly as in the case when all the residuals are zero except one.In Physics the entropy of a system admitting N discrete states with probabilities p 1 , p 2 , . . ., p N is computed as − N i=1 p i log p i .It is well known that such function is a sensitive measure of the dispersion of the probabilities.Configurations with only a small fraction of highly probable states have a lower entropy of configurations where most states are approximately equally probable.Motivated by this fact, the function H is defined with the aim of measuring the dispersion of the relative squared residuals.In particular given that the entropy-like function H as defined by equation (42) depends on θ through the residuals r i (equation ( 2)), the following estimator is proposed: where LEL stands for Least Entropy-Like.Such name was chosen with the twofold objective (i) of underlining that the H function is not properly an entropy and (ii) of avoiding confusion with the Minimum Entropy estimation approach described in Section 2..The idea behind the θLEL estimator defined in ( 46) is that such estimate will correspond either to making all the residuals null, or to making the relative squared residuals as little equally distributed as possible according to the H function minimization criteria (46).Notice that due to the normalization of the relative squared residuals q i (41), forcing them to be "as little equally distributed as possible" means that "most" residual r i will need to be "small" (with respect to the normalization constant D, i.e. the Least Squares cost) and "a few" of the residuals r i will need to be "large".Data points corresponding to these "large" residuals are outlier candidates.Stated differently, the reason for robustness with respect to outliers is that the devised penalty function does not directly measure the (weighted) mean square error (that tends to level out or "low pass" residuals), but rather the dispersion or variability of the relative squared errors.Before discussing in greater detail the properties of the proposed estimator, it should be noticed that, in general, there is no guarantee for the H function to have a unique minimum with respect to θ.The entropy-like penalty function H is nonlinear and may have multiple local minima.The minimization of H needs to be carried out numerically with particular attention to the initialization of θ: indeed the proposed estimator should be regarded as local in nature.

Non uniqueness of the LEL-estimator
One of the most relevant properties of the proposed estimator is its non-uniqueness due to the nonlinear nature of H.In particular, as is well known [19], entropy is invariant under translation of the probability density function.Indeed this property of entropy needs to be explicitly taken into account in all entropy related parameter estimation approaches including, i.e., the ME (Section 2.) [3].As for the LELestimator, besides the possible invariance of H under translation of the relative squared residuals q i (41), H is also invariant under scaling of the residuals r i (2).Indeed, indicating with r := (r 1 , r 2 , . . ., r N ) T the residual vector, given the definitions (40, 41) and (42), the relative squared residuals q i and the H function are invariant under scaling of r, namely H(r) = H(λr) for any scalar λ = 0.Such invariance property may impact on the computation of the estimator θLEL (46): suppose that two distinct values θ 1 and θ 2 of the parameter vector exist such that for some constant λ = 0, then H(θ 1 ) = H(θ 2 ) potentially jeopardizing any minimization routine of H. Yet, interestingly, the potential singularity associated with the scaling situation (47) is absent if the underlying model is linear in θ.Consider a linear in the parameters model as in equation (10), then equation (47) would be: If λ = 1 and θ 1 = θ 2 , equation (49) implies that matrix G is not full rank and the non-uniqueness of the LEL-estimator in this case would be actually inherited by the non-uniqueness of the Least Squares estimate.If, on the contrary, λ = 1 and G has full rank, equation (49) implies that y belongs to the range of G: in this case the Least Squares estimate θLS = (G T G) −1 G T y yields a perfect fit making any other estimator useless.
The above simple analysis reveals that for linear in the parameters models, the non-uniqueness of the LEL estimator due to residual scaling is not an issue, as such situation may only occur if the standard Least Squares fit is perfect.Indeed, prior to implementing any estimator, one should always analyze the quality of the Least Squares estimator, in particular for linear in the parameters models .
The above does not mean that linear in the parameters models admit a unique LEL-estimate when the LS fit is not perfect: it only shows that the eventual non-uniqueness is not due to residual scaling phenomena (47), but rather to the nonlinear structure of H or its invariance under translation of the relative squared residuals q i .

Computational Issues
Given the local nature of the LEL estimate, how should one compute the minimum in equation ( 46)?According to the experience so far acquired with simulated [20] and real data (work is in progress with detecting planes in 3D range-image camera data), the computation of θLEL can be successfully performed locally and numerically from an initialization value sufficiently close to the real value of θ.Of course this is by no means trivial being the real value of θ unknown: for models linear in the parameters experience has shown that good results may be achieved by running any numerical minimization routine m times (where m is the dimension of θ, i.e. θ ∈ R m×1 ) from m initial values θ i chosen as: (50) where θ0 is either an initial guess based on prior information, or another estimate as, by example, the Least Squares one θ0 = θLS .The θ i : i = 2, . . ., m values can be computed based on a Gram-Schmidt algorithm.Should the m LEL estimates thus computed not coincide, the obvious best (local) solution among them will be the one corresponding to the least value of H. Numerical examples relative to the above described heuristics are provided in Section 5..As for the minimization algorithm to be used in computing θLEL , any state of the art optimization routine for nonlinear equations can be a suitable candidate.One can eventually exploit gradient and Hessian information knowing the structure of the penalty function H. On the basis of equations (40, 41) and (42), by direct calculation it follows that the gradient, by example, has components:

LEL-estimator breakdown point
Assessing the breakdown point for the LEL estimator is not an easy task.To determine the finite sample breakdown point according to the definition given in equation ( 17), the least fraction of outliers possibly causing a diverging bias in the estimate should be found.Given the H function property (44) it can be stated that a single outlier in a data set is not able to move H( θLEL ) from its absolute minimum, i.e., zero.This means that a worst case estimate of the finite sample breakdown point over N data points is 2/N , i.e., it is double with respect to the Least Squares finite sample breakdown point 1/N .Yet this means that the the LEL asymptotic breakdown point would be 0%, i.e., not any better than the Least Squares estimate.Nevertheless, extensive simulations have shown that the LEL estimator has an excellent finite sample behavior and, in particular, that it may be effectively used to spot outliers by analyzing the residuals.As known, the Least Squares method may perform poorly when used in this way because of its intrinsic tendency to low pass outliers.

Why should one use the LEL estimator?
In the previous sections it was shown that the proposed LEL estimator is local in nature and, in general, cannot be computed analytically as the Least Squares one.Moreover its asymptotic breakdown point is 0% whereas alternative estimators as the Least Median of Squares (LMS) or the Least Trimmed Squares (LTS) can achieve 50% asymptotic breakdown point.It is thus natural to ask why should one consider it: the answer is to be found in its most interesting numerical properties when compared to alternatives as the LMS or LTS.Indeed the computation of the LMS or LTS estimates is extremely complex: standard algorithms [1] may be very time consuming as they are not far from performing an exhaustive search in the parameter space.Moreover, the LMS cost function med i {r 2 i } (22) when computed on the finite sample of available data can be extremely erratic as a function of the parameter vector θ (examples are reported in the following of the paper).To the contrary, the proposed H function (besides enjoying properties 43, 44 and 45) is as smooth as the square residuals r 2 i as a function of θ.This allows the chosen minimization algorithm to compute its minima with much less effort.Of course the choice of using the LEL estimator or an alternative with a more favorable breakdown point will strongly depend on the application.Further considerations on application scenarios that could benefit from the proposed LEL estimator are reported in the closing section of the paper.

Numerical Results
As a first toy example to investigate the properties of the proposed cost function H as compared to the Least Squares (LS) and Least Median of Squares alternatives, consider the data set depicted in Figure (1): there are 100 points depicted as dots having x−coordinates equally spaced in the range [−10, 10].The y−coordinates are computed as y = x + being a Gaussian random noise of zero mean and unit variance.Moreover there are other 25 data points depicted with a diamond shape: these have x−coordinates normally distributed with mean 5 and unit variance, while their y−coordinates are normally distributed with mean 10 and unit variance.The total data set is made of the these 2 subsets, namely it contains 125 points that can be thought of as 100 values satisfying a linear model y = x (with some noise), plus 25 outliers.
The reference model for the above described data set is y = θ x being y and x ∈ R 125×1 known while θ ∈ R is the unknown scalar to be estimated.Given the presence of the outliers, the LS estimate is expected to be biased with respect to the "real" value 1. Indeed its value results in θLS = Given that the unknown parameter is a scalar, the LS, LMS and LEL cost functions can be easily plotted as functions of θ.The LMS, LS (normalized) and LEL cost functions are computed as where med i {•} is the sample median over the set in argument and N = 125.The LS cost f LS (θ) in equation ( 55) is normalized such that f LS ( θLS ) = 1 whereas the LEL cost function f LEL (θ) in ( 56) is the H function of equation (42) except for checking if N j=1 (y j − θx j ) 2 is null or not (unnecessary in practice).These cost functions are sampled in the range θ ∈ [−10, 10] with 10 4 equally spaced values of the slope θ: the resulting plots are depicted in Figure (2) at different zoom levels.As expected, the LS cost has a (unique) minimum in 1.16.The LMS cost has a discontinuous and rather erratic behavior making it difficult to accurately determine where its minima are.The LEL cost function has a regular plot (the function is actually smooth in this case) and it appears to have a sharp local minimum in θ = 1.Notice that the LEL function has also a local minimum close to θ = 2 confirming the local nature of the proposed estimator.To explore the behavior of the proposed approach on a multidimensional problem, consider the following model: or, in vector notation, where ω 1 and ω 2 are known, but θ r is not.Assume that a data set (y, x) : x, y ∈ R N ×1 is available and that ε ∈ R N ×1 is a vector of zero mean, normally distributed noise (eventually with known covariance).The following numerical experiment (Case 1) is performed: the "real" value of the parameter vector θ r is randomly generated (each component is the rounded value of a uniformly distributed number in the range [−100, 100]) yielding θ r = (63, 2, 11, −29) T .The values of ω 1 and ω 2 are chosen to be ω 1 = 1 and ω 2 = 2 and the noise term ε is normally distributed (i.i.d.) with zero mean and variance equal to 10.The independent variable x is generated as a uniform ramp of N = 10 4 values in the range [−10, 10], whereas y is computed according to equation (58).Given these numerical values, the LS estimate of θ r may be computed as θLS = (G T G) −1 G T y resulting in θLS = (62.92,2.03, 11.02, −29.03)T .The corresponding LEL estimate (Case 1) is computed according to its definition (46).In particular the minimization of the H function is performed 4 times starting from 4 different initialization values computed as in equations (50-51) with θ 0 = θLS .The minimization routine is the FMINSEARCH multidimensional unconstrained nonlinear minimization (Nelder-Mead) of Matlab (Version 7.8.0.347 (R2009a)).The results of these minimizations are summarized in Table 1: the first column refers to the values of θ used to initialize the minimization routine.The second column refers to the local minimum that was found and the third column refers to the value of H in such local minimum.Notice that the top element of the first column (case 1A) is θLS .Also notice that cases 1C and 1D lead to the same local minimum and that the least value of H is obtained in case 1A.Nevertheless in all four cases the value of H is relatively high (recall that H ∈ [0, 1]) and the differences among the four cases (in particular 1A, 1C and 1D) are extremely small, i.e., poorly significant.The plot of the (x, y) data together with the LS and LEL fits (1A, 1B, 1C/D) are reported in Figure (3).The LEL-1A estimate is very close to the LS one (that is very close to the real parameter vector θ r ) and in Figure (3) the fitted data G θLEL−1A and G θLS appear to be almost perfectly overlapping with the original data y (first row in Figure ( 3)).To the contrary, the fitting behavior of the other two LEL estimates is clearly less accurate.A quantitative criteria to determine unambiguously which of the four (LS, LEL-1A, LEL-1B, LEL-1C/D) estimates is the "best" can be the value of the median of the fitting errors.More precisely, the median of the absolute fitting errors or of the squared fitting errors.These values are reported in Table 2.The results summarized in Table 2 suggest that the LS estimate, in this case, should be preferred to the LEL one.To cross check this result, it may be useful to graphically inspect the plot of the sorted absolute values of the fitting errors as depicted in Figure (5).
These results should not be surprising as in the given setting (no outliers and additive zero mean normally distributed noise) the LS is guaranteed to be the optimal estimator.Yet things change considerably if the data set (x, y) is corrupted so that some of the data (a minority) will not satisfy the above hypothesis.Assume, for example, that a fraction of the available y values (say 10%) are multiplied by a random gain in the range [0, 10] (due to some data recording or communication problem, it does not really matter here).In particular this kind of corruption (Case 2) is generated taking exactly the same y vector of Case 1 and multiplying 10% randomly selected components of y (i.e., 1000 randomly selected y values) each by a different random number uniformly distributed in the range [0, 10].The resulting data is plotted in Figure (4).
The LS estimate of θ based on this corrupted data (Case 2) results in θLS = (87.44,2.68, 15.95, −39.96)T that appears to be significantly distant from the real value θ r (and from the Case 1 LS estimate).The LEL estimate is computed exactly as described for Case 1, but using the new (Case 2) LS estimate as initialization value θ0 .
The results are summarized in Table 3.Notice that the minimization routine always converges to the same value that appears to be very close to the real one.Moreover, the least value of H is significantly smaller than in Case 1 suggesting that the distribution of the relative squared residuals has a smaller dispersion than in Case 1.The plots of the LS fit, the LEL fit and the case 2 data is depicted in Figure (6).As for Case 1, based on the only plots of the fitted data, it is not obvious which model is performing better.Yet in terms of the median of the absolute values of the residuals (Table 4), the LEL estimate is certainly to be preferred.Indeed the plot of the sorted absolute residuals in Figures (7)(8) reveals that the great majority (about 90%) of the data is significantly closer to the LEL fit rather than the LS fit.
The Case 2 experiment has been repeated 100 times with different values of θ r , namely each time its components were rounded values of uniformly distributed numbers in the range [−100, 100].In each of the 100 iterations all the random variables used were different realizations.Each of the 100 iterations gave similar results to the ones described, i.e., a LEL estimate was computed that had lower median of absolute residuals with respect to the LS estimate and was closer to the real parameter vector.As for computational effort, the minimization of the H function was performed with the FMINSEARCH multidimensional unconstrained nonlinear minimization (Nelder-Mead) of Matlab (Version 7.8.0.347where the error was computed as the sample standard deviation of all the iterations.Recalling that the number of data points was always N = 10 4 this result is rather interesting as it suggests that the proposed method can be eventually employed for on line applications, at least for models of comparable size.The A and B estimates coincide and correspond to the least value of H among the three.Hence the best (local) LEL estimate of θ is to be considered θLEL−A/B .Nevertheless, interestingly this estimate does not correspond to the least value of the median of squares cost.The θLEL−C estimate performs better in terms of the median of squares cost criteria.For a graphical interpretation of these results, refer to Figure (9) where the residuals, scaled by their median absolute deviation MAD (21) scale estimate, are depicted.Comparing the bottom plot in Figure (9) with the equivalent plot for the LMS estimate in [1] (p. 95), one can arguably conclude that the θLEL−C estimate is (very) close to the LMS one.This shows that the LEL and LMS criteria differ and should not be considered equivalent, although in spirit both are defined so that the residual scatter is somehow minimized.The Hawkins-Bradu-Kass example also shows that the LEL estimate can be affected by the presence of bad leverage points (outliers): notice that the central plot in Figure (9) reveals how the (best) LEL estimate (A/B) accommodates the 10 bad leverage points within the fit and excludes the four x i -space outliers 11, 12, 13 and 14.Although from the LEL criteria perspective one could also argue that the first 10 points are not outliers (or bad leverage points), whereas the following 4 are.Indeed the very definition of outlying data should be given according to a fitting criteria.The interpretation of similar results without an a priori agreement on the definition of outlier will always be debatable.A novel prediction error method for robust parameter identification in the presence of outlying data has been presented.The approach builds on minimizing a cost function inspired by the concept of (Gibbs) entropy although the probabilistic or information theoretic meaning of entropy is not explicitly involved.The function to be minimized inherits the smoothness properties of the data model, hence if the model is sufficiently well behaved, any off-the-shelf unconstraint numerical minimization routine can be exploited.Although the asymptotical breakdown point of the algorithm is not any better than standard Least Squares, numerical examples were provided showing an excellent finite sample behavior.The function to be minimized may have multiple minima, hence the proposed approach is structurally local in nature.A simple heuristic method to select a family of initialization values for the local minimization has been suggested and tested on several examples.Potential outlying data can be identified by analyzing the (sorted) plot of the absolute (or squared) residuals.
One of the significant properties of the proposed method is relative to the low computational effort required to compute the parameter estimate; such property may be exploited in online applications.As an example, consider sensor signal processing in robotics or automatic controls.Assume that range or imaging data are acquired by a robot or a vehicle and that a set of features needs to be extracted from the data online.Typical examples may include sonar data acquired by marine or areal vehicles or images acquired by any moving robot equipped with a computer vision system.If a reliable estimate θk of the features is known at time k, it is often a reasonable assumption that they will be "approximately close" to θk at time k + 1.In this case, the local nature of the LEL estimator may not be a serious issue, in the sense that at time k + 1 one may estimate θk+1 through the proposed LEL method using θk as the initialization value.Notice that the standard implementation of many alternative robust methods as RANSAC, the Hough transform, LMS or M-estimators would most probably be much more demanding from a computational point of view.Ongoing work is in to test the use of the LEL approach with experimental data, in particular for the identification of planes from 3D data acquired by a robot using a range camera (Figure (10)).
Z N , namely θT = T (Z N ).Denote with Z c N the set obtained by replacing c data points in Z N with arbitrary values and with b(c, T, N ) = sup Z c n T (Z N ) − T (Z c N ) ( • denotes an arbitrary norm in R m ) the maximum bias that can be induced in the estimate by such contamination of the data set.The finite sample breakdown point bdp(T, N ) of T over N is defined as bdp(T, N ) = min c c N : b(c, T, N ) −→ +∞

Figure 1 .
Figure1.Linearly distributed data y = x with zero mean unit variance noise (100 round dots) plus 25 outliers normally distributed around the point(5,10).Refer to the text for details.

Figure 2 .
Figure 2. LMS, LS and LEL costs as function of the line slope.Refer to the text for details.

Figure 4 .
Figure 4. Case 2 corrupted data (red dots) and original data (solid blue line).

Figure 5 .
Figure 5. Case 1 sorted LS and LEL fitting errors in absolute value.

Figure 9 .
Figure 9. LS and LEL residuals of the Hawkins -Bradu -Kass data set fitting.The dashed lines indicate the ±2.5 values.

Figure 10 .
Figure 10.3D range camera image of an office with LEL based detected (in red) wall.

Table 3 .
LEL estimates: Case 2 (refer to text for details).

Table 4 .
[1]ian of fitting errors: Case 2.Last 1000 sorted LS (dashed line) and LEL (solid line) absolute fitting errors (i.e., residuals). ) from the University of Cologne Statistical Resources http://www.unikoeln.de/themen/statistik/index.e.html (follow the links DATA and then Cologne Data Sets).The first 10 values of this artificially generated data set correspond to bad leverage points, i.e. outliers that can significantly affect the LS estimate (refer to[1]for more details).Points 11, 12, 13 and 14 are outliers in x i , namely they lay far from the bulk of the rest of the data in x i space, but their y values agree with the model.A LEL estimate of θ is computed by minimizing the corresponding H function from three different initialization values computed as in equations (50-51) using the Least Squares estimate θLS as a θ0 .The three so computed LEL estimates are labelled as A, B and C. Their values are listed in [1]ng y ∈ R 75×1 , θ ∈ R 3×1 and G = [x 1 x 2 x 3 ] ∈ R 75×3 .The y, x 1 , x 2 and x 3 values are given by the Hawkins -Bradu -Kass data set[1](Chapter 3, section 3) available in electronic format (together with Figure 8.