Improving the Efﬁciency of Robust Estimators for the Generalized Linear Model

: The distance constrained maximum likelihood procedure (DCML) optimally combines a robust estimator with the maximum likelihood estimator with the purpose of improving its small sample efﬁciency while preserving a good robustness level. It has been published for the linear model and is now extended to the GLM. Monte Carlo experiments are used to explore the performance of this extension in the Poisson regression case. Several published robust candidates for the DCML are compared; the modiﬁed conditional maximum likelihood estimator starting with a very robust minimum density power divergence estimator is selected as the best candidate. It is shown empirically that the DCML remarkably improves its small sample efﬁciency without loss of robustness. An example using real hospital length of stay data ﬁtted by the negative binomial regression model is discussed.


Introduction
For a long time, early robust parametric procedures were unable to combine a high level of outlier-resistance and a high level of efficiency under the model. Ref. [1] suggested one of the first tools to join robustness and efficiency, i.e., trimming some outliers identified with the aid of least absolute deviations followed by the computation of the least squares estimator from the remaining data. For the linear model, several proposals with the same purpose followed. An adaptive combination of the least squares and the least absolute deviations estimators was introduced by [2]. Their approach could be adapted in order to achieve the minimum asymptotic variance over a scale family of densities. Ref. [3] proposed MM estimators, which have a 50% breakdown point and asymptotic efficiency as close to one as desired. Ref. [4] proposed τ estimates, which combine the same two properties as MM estimators. Ref. [5] introduced adaptive cutoff values based on an initial highly robust and possibly inefficient estimator (such as an S estimator) to define a region Z such that observations out of Z are considered as outliers. The maximum likelihood (ML) estimator is then computed after removal of the outliers. Since, under the model, Z tends to the whole space for increasing sample size, the final estimator is asymptotically fully efficient. A similar idea was used by [6] to compute a highly robust and fully efficient truncated ML estimator for regression with asymmetric errors. Ref. [7] used the adaptive cutoff values to define adaptive weights for a least weighted squares estimator, which is asymptotically efficient and highly robust.
Several robust methods for the generalized linear model (GLM) have also been proposed. Ref. [8] made use of diagnostic tools based on ML residuals. This tool may, however, be affected by the well-known "masking effect" of ML residuals described in [9] (Section 4.3). Other authors introduced different types of M estimators ( [10][11][12][13][14][15], among others) where the trade-off between robustness and efficiency is controlled by means of one or more tuning constants. A fully efficient and highly robust estimator was proposed Stats 2021, 4 in [16] using a conditional ML principle, after removal of outliers not belonging to an adaptive region which tends to the whole space.
Unfortunately, when n is not very large, the finite sample efficiency of most estimators, including the fully efficient ones, may be much smaller than the asymptotic one. To overcome this shortcoming, ref. [17] proposed a regression estimator with the maximum breakdown point and high finite sample efficiency. However, the method suffers from a serious loss of robustness. Ref. [18] introduced a general method to improve the finite sample efficiency of a robust estimator. Details were given only for the cases of the linear model and the multivariate location and scatter of a random vector. The method assumes that an initial robust estimator, not necessarily with high finite sample efficiency, is available. Then, the improved estimator-called the distance constrained maximum likelihood (DCML) estimator-maximizes the likelihood function subject to the estimate being sufficiently close to the initial one. As a distance, the Kullback-Leibler divergence from the maximum likelihood estimated model was chosen. For the linear model, it turns out that the final estimator is a convex combination of the robust and the maximum likelihood estimator. In principle, the procedure may be applied to any parametric model; however, for the GLM, the solution of the constrained optimality problem is more involved.
In this paper, we consider a simple modification of the DCML principle for the GLM. The modification exploits only the most crucial elements of the original proposal: it directly defines the DCML as a convex combination of the robust and the ML estimator and uses a quadratic distance between the coefficients. We explore the performance of the DCML method by means of Monte Carlo experiments considering some well-known robust estimators, for which the public domain R software is available, as initial estimators. To limit our presentation, we focus on the Poisson model. We first compare the efficiency and robustness of the initial estimators in a simple regression case. We then select the best starting estimators and assess the performance of the DCML in critical multiple regression scenarios for p = 5, 10, 20, and n = 5p, when the covariables follow a spherical normal distribution or a spherical t distribution with four degrees of freedom. We finally describe a bootstrap experiment with real hospital length of stays fitted by a negative binomial regression model.
For short definitions of the CUBIF, RQL, and MT estimators and their tuning parameters, we refer to [9]. CUBIF is computed with the help of the R package "robcbi"; a function to compute the RQL estimators is provided in "robustbase", and functions to compute the MT estimators can be found in the R package "poissonMT". There is no public domain software to compute MD estimators, but the loss function defined in [14] can easily be minimized with the help of the R function "optim". However, since the loss function of MD estimators may have multiple local minima, we use a specialized subsampling algorithm to find the absolute minimum with a high probability.
Particular attention is given to the MCML estimator. The MCML starts with the computation of a highly robust, but possibly inefficient initial estimator. Using randomized quantile residuals [19] based on this estimator, outliers are detected with the help of adaptive cutoff values as in [5,6]. Finally, a conditional ML estimator is computed, where observations are constrained to belong to a region free of detected outliers. Since in the absence of outliers, this region tends to the whole space when the sample size increases, the asymptotic efficiency of the MCML estimator is 100%. For moderate sample sizes, the MCML markedly improves the efficiency of the initial estimator; unfortunately, for small sample sizes, the MCML efficiency is disappointingly low; nevertheless, the MCML maintains a similar degree of robustness as the initial estimator. It turns out, however, that the MCML is a good candidate for the DCML method. Computational tools to compute the MCML estimator are made available in the R package "robustnegbin". In the original paper, as well as in the package, the initial step is based on a combination of the maximum rank correlation estimator [20] and CUBIF. Here, we propose a simpler version, where the initial step is replaced by an MD estimator.

The DCML for GLM
We consider a GLM: relating a response y to a p-components explanatory covariate x. The conditional expected response E(y|x) = µ x depends on a linear predictor β T 0 x through a link function g, i.e., β T 0 x = g(µ x ), where β 0 is a p-component vector of coefficients. In addition, σ is a dispersion parameter. In this paper, we focus on the Poisson regression model with the log-link, i.e., β T 0 x = log(µ x ) and σ = 1, which may be extended to the negative binomial (NB) model with σ = 1. We also write µ x = h(β T 0 x), where h is the inverse of g; in the Poisson case, µ x = exp(β T 0 x). If an intercept term is present, we sometimes use the notations x T = (1, x * T ) and β T 0 = (α, β * T 0 ), where α is the intercept. Suppose thatβ K is a robust consistent estimator of β 0 , not necessarily efficient, and thatβ H is the ML estimator of β 0 . We measure the distance betweenβ K andβ H by: whereĈ is a robust estimate of the covariance matrix ofβ K −β H . We define the DCML estimatorβ D as the convex combination: such that t minimizes ∆ tβ H + (1 − t)β K ,β H under the constraint: and δ is an adequately chosen constant. It immediately follows that the optimal t is: as in Formula (8) in [18]. Under the model and for large n, the distribution of ∆(β K ,β H ) is approximately χ 2 p . Therefore, in order to control the efficiency ofβ D , it seems reasonable to take δ as a quantile of the χ 2 p distribution such that the efficiency attains a satisfactory level. Monte Carlo simulation shows that a large quantile is usually required because the finite sample distribution of ∆(β K ,β H ) may be much more skewed than χ 2 p . We compare a few candidate estimatorsβ K for which the R software is easily available. These candidates belong to the M estimator family (see e.g., [9]). Asymptotically, an M estimatorβ K of regression is a functional of the cdf F of (x, y), satisfying an estimating equation: where ψ(x, y, β) is a given function that may depend on the model parameters. The influence function ofβ K is: where: and ψ (x, y, β) = ∂/∂βψ(x, y, β). The asymptotic covariance matrix ofβ K is given by: where: The ψ functions and the influence functions of the candidate estimators of Poisson regression are provided in Appendix A. For the ML estimator, we have: The covariance matrix C ofβ K −β H is obtained with the help of the influence functions IF H (x, y, F) and IF K (x, y, F): where: The distribution F is of the form F(y|x)G(x). To estimate the cdf F(y|x), we use the cdf Fμ x ,σ K (y|x) whereμ x = h(β T K x) andσ K is a robust estimate of σ (σ K = 1 in the Poisson case). Using the empirical cdf G n (x) to estimate G, we obtain: etc. In general, these estimates are not robust. Therefore, as often proposed, we remove all the observations such that Md(x * i ) 2 > q p−1,0.95 , where Md is the Mahalanobis distance of the point x * i from the center of the covariate distribution based on a very robust (with a maximum 50% breakdown point) covariance such as the minimum volume ellipsoid [21] enclosing 80% of the sample and q p−1,0.95 is the 95% quantile of a χ 2 p−1 distribution. In the simple regression case, we remove observations such that |x i −median(x i )| ≤ 2 × MAD(x i ). We then obtain an estimate ofĈ of C replacing M k , V K , M H , V H , and K HK by their robust estimates in (10). Note that, in order to have a positive definiteĈ, the same robust estimatesβ K andμ x = h(β T K x) must be used in the evaluation ofM k ,V K , as well asM H ,V H , andK HK . It is easy to show that, for any fixed weight t, the total variance ofβ D is smaller than the total variance ofβ K . Unfortunatelyβ D is not robust in this case. Nevertheless, under the model, ∆(β K ,β H ) is very often smaller than δ, and t 0 equals one; therefore, we may expect thatβ D is more efficient thanβ K . This is confirmed by the simulations in Section 6.
The constraint (3) makes the optimal t 0 a data dependent weight. Since x-outliers are removed, the eigenvalues ofĈ −1 are bounded away from zero. Let λ 0 > 0 be the smallest eigenvalue. Then, implies thatβ D cannot be unbounded whenβ K is bounded; in other words,β D has the same breakdown point asβ K . A formal proof for the linear model case whenβ K is an MM estimator was provided by [18]. The empirical results in Section 6 show that, for the best candidatesβ K , the DCMLβ D is highly resistant to heavy point contaminations.
A method to compute an approximate distribution that can be used in the GLM case was also proposed in [18]. We note that the DCML estimator is a nonlinear function ofβ H andβ K , and its distribution is not necessarily normal.

Some Invariance Properties
In general, Poisson regression does not own regression invariance-as in the linear model case-and the Monte Carlo results depend on the chosen coefficients, as well as on the distribution of the predictors. Unfortunately, this makes it impossible to draw very general conclusions. We restrict our attention to spherical and elliptical distributions of x. For these models, the following results may be used to expand the results to a certain extent beyond those provided by the reported experiments.
As usual, we measure the error ofβ K by the mean squared estimation error: or by the mean squared prediction error: The efficiency ofβ K with respect to the ML estimatorβ H is measured by: We consider two models with no intercept and slope vectors β 1 , β 2 . We denote by F kx the cdf of (x, y) when β 0 is replaced by β k (k = 1, 2) in (1). Let T(F) be the functional form of a consistent estimator and let V (F kx ) its asymptotic covariance under F kx .

Theorem 1.
Suppose that x has a spherical distribution with center 0 and that β 1 and β 2 are two slope vectors with the same Euclidean norm. Then, trace(V (F 1x )) = trace(V (F 2x )).
Proof. Let P be the orthogonal matrix such that β 2 = Pβ 1 and let z = Px. Let F kz be the cdf of (z, y) when the slope vector is If an intercept term is present, a similar argument applies to a spherically distributed x * and the slope vector β * . It follows that, if x * has a spherical distribution, the asymptotic efficiency just depends on the intercept and on the norm of the slope vector. This allows us to restrict attention to coefficient vectors of the form β T = (α, β * T ) for a varying intercept α and β * T = (γ, 0, . . . , 0) T for a varying γ. In the Monte Carlo simulations of Section 6, we choose values of α and γ that make the empirical efficiency of the estimators under study very low. We explore the efficiency improvement provided by the DCML procedure in such unfavorable cases.
The case where x * is not centered in 0 can be led back to the centered case by changing the intercept term.
In the case where x has an elliptical distribution such that x = Az, where A is a regular p × p matrix and z has a spherical distribution with E(z) = 0 and E(zz T ) = I p−1 , we define . Thus, using the MSPE, the elliptical case can be led back to the spherically symmetric case.

Monte Carlo Scenarios for Poisson Regression
We examine the following specific candidate estimators of Poisson regression: We consider both the case where leverage points are not removed and the case where they are removed. If necessary, to avoid confusion, a suffix "r" in the estimator name is added in the second case (e.g., MD04r, MD07r, etc.). The estimators CUBIF11, RQL01, MTI, and MD10 are (close to) the most robust, but are inefficient representatives of their families. In principle, they can be used as starting values to compute two stage estimators such as the MCML estimators. The estimators CUBIF18, RQL14, MT23, MD04, MD07, MCML04, MCML07, and MCML10 are considered as robust and efficient estimators.
Any estimator of this list can act as aβ K candidate in a DCML procedure. However, to illustrate the performance of the DCML, we first reduce the number of candidates on the grounds of their efficiency and robustness levels evaluated with the help of a quick Monte Carlo experiment. For this purpose, we use a simple Poisson regression model: where In a second set of experiments, we explore the DCML procedure starting at selected candidates. We use Poisson regression models with p coefficients: where The intercept α is considered as a nuisance parameter. Simulations are performed for α = 1, two values of γ (γ = 0.25 and 0.50), three values of p (p = 5, 10, 20), and two distributions of x * : x * ∼ N(0, I p−1 ) and x * ∼ t 4 (0, I p−1 ).
To measure the quality of an estimatorβ, we use the empirical MSEE: whereβ # i is the estimate of β * 0 based on the ith replication. It turns out that for most robust estimators, a very small fraction of errors β # i − β * 0 2 is extremely large. This is a well-known feature of robust estimators, and as [22] noted, "despite its seductive simplicity, the ordinary variance is not an adequate measure of performance of a robust estimate: it is much too sensitive to the irrelevant extreme tail behavior of the estimate". We therefore use the upper 1% trimmed mean (in place of the usual mean) to evaluate the MSEE (13).
The following "simple" DCML estimators are compared:

Simulations for the Nominal Simple Regression Model
In a first experiment, we drew 1000 samples of size 100 from Model (11). To compare the candidate estimators, we first computed the finite sample efficiencies Eff 1n and the asymptotic efficiencies reported in Table 1. The results of the first column were obtained using the complete (unweighted) data. To compute the second column, leverage points such that |x i −median(x i )| ≤ 2 × MAD(x i ) were removed, except for the MCML estimators. We observe that the efficiencies of CUBIF, RQL, MT, and MD depend, as expected, on the choice of the tuning parameters; with the full data, they are slightly lower than their asymptotic values and markedly lower when leverage points are removed. By definition, the MCML estimators remove data points only when they are discordant with the fitted model and hence are asymptotically fully efficient. In our experiment with n = 100, their finite sample efficiencies are almost independent of the starting estimator MD04, MD07, or MD10. Further experiments reported below show, however, that for smaller n, the efficiency of the MCML is directly related to the efficiency of the starting estimator. In any case, the MCML efficiencies are disappointingly lower than their asymptotic value of one.

Simulations for the Contaminated Simple Regression Model
In a second experiment, samples from Model (11) were corrupted with point contaminations of the form ((1, x out ) T , y out ) where x out varied in the set {−3, −2, −1, 0, 1, 2, 3, 4, 5} and y out varied in the set {0, 1, 2, 10, 20, 30, 40, 60}. This kind of contamination is generally the least favorable one and allows the evaluation of the maximal bias an estimator can incur. We generated 100 samples of size 90; then, for each pair (x out , y out ), we added 10 observations with identical outliers of the form ((1, x out ) T , y out ); thus, the contamination level was 10%.
We first compared the four less efficient estimators: CUBIF11r, RQL01r, MTIr, and MD10r. The MSEEs of these estimators are displayed in Figure 1 as functions of y out . For x out = −3, 3, 4, 5, 6, the MSEEs of all estimators are very small because leverage points are removed and the graphs for x out ≥ 3 are not shown. Important differences are observed for x out between −2 and 2, and MD10r clearly provides the overall most robust behavior. For this reason, MD10r is considered as a good starting value for the MCML estimator MCML10. Then, we compared the more efficient estimators RQL14, MT23, MD04, and MCML10. As Figure 2 shows, MCML10 generally provides the lowest MSEE. We conclude that MCML10 has a good robustness level and a full asymptotic efficiency, but a poor finite sample efficiency. It is therefore a natural candidate to study the performance of the DCML procedure.

Simulations for the Nominal Multiple Regression Model
To compare the DCML estimators, we first used 1000 samples of size n = 5p from Model (12) for p = 5, 10, 20 and computed empirical efficiencies Eff 1n . It turns out that the efficiency of the MD estimators is a decreasing function of |γ|; however, its dependence on α is quite weak (e.g., Figure 3). In addition, experiments with x * ∼ t 4 (0, I p−1 ) and γ a bit larger than 0.5 or α > 1 gave rise to computational crashes due to the occurrence of extremely high values of exp(β K x). This suggests that α = 1, γ = 0.5, x * ∼ t 4 (0, I p−1 ) is already a quite unfavorable scenario. We report the results for this case in Table 2. Results for the more favorable choice α = 1, γ = 0.25, x * ∼ t 4 (0, I p−1 ) are shown in Table 3.
As expected, higher efficiencies were obtained with the same values of γ and α, but for x * ∼ N(0, I p−1 ); they are not reported.   The DCML procedure was computed for several values of δ, but only the results for δ = χ 2 p,0.0 = 0, δ = χ 2 p,0.5 , δ = χ 2 p,0.95 , and δ = χ 2 p,0.99 are shown. Note that the results corresponding to δ = 0 are the efficiencies of the starting estimators. We observe that for p = 5 (n = 25) and p = 10 (n = 50), MD07r and MD10r are extremely inefficient. MCML07 and MCML10 somehow improve them, but are still very inefficient; they can exceed a 90% efficiency level only for n = 100. Good efficiency levels for all values of p (and n) are attained when the DCML procedure is started with MCML+ and MCML*, i.e., with MCML++ and MCML*+.

Bootstrapping Real Data and NB Fits
We consider 92 hospital stays belonging to the Major Diagnostic Category "polytrauma" of the SwissDRG system. These data were described in [16], where the relationship between "length of stay" (y in days) and four covariates-"number of diagnosis" (x 1 ), "number of treatments" (x 2 ), "age of the patient" (x 3 in years), and "sex of the patient" (x 4 = 0 for males and x 4 = 1 for females)-was analyzed. Overdispersion is a usual feature of this kind of data, and the NB regression model has often been proposed to take it into account. In [16] an NB model with predictor exp(β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 3 x 4 ) was estimated using the MCML procedure. The contamination fraction based on the MCML procedure was estimated at about 10%.
In a first experiment, we drew 1000 samples with a replacement of size 92, and for each sample, we estimated the six coefficients and the dispersion parameter by means of the ML, MD10r, MCML10, and MCML10 + D estimators of NB regression. We then computed the bootstrap coefficient means and standard errors, after 1% trimming of the most extreme cases. In addition, we also computed the (1% trimmed) total variance of the estimates. The results are reported in Table 5. We observe that the coefficient means are a bit lower than those reported in [16]; the most important difference between ML and MCML10 is the estimated intercept. However, the standard errors are somewhat larger than those based on asymptotic approximations. The standard errors and total variance of MD10r are the largest ones; those of MCML10 and ML are smaller and very similar. As expected, the DCML procedure (with δ = χ 2 p,0.99 ) makes MCML10 closer to ML; however, it does not improve the variance estimates and does not seem to be very valuable in this case. In a second experiment, we drew 1000 bootstrap samples of size 50 taken from a subset of the original data. Unfortunately, our estimation procedures of the six parameter model experienced several numerical malfunctions because many subsamples had ill-conditioned model matrices. We therefore used the four parameter predictor exp(β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 ). The results are shown in Table 6. We observe that the robust intercept means are still lower than the ML means. We also notice the awfully large standard errors of the MCML10 coefficient estimates due to their extreme tail behavior shown in the boxplots of Figure 7. Nevertheless, the DCML succeeds in getting rid of these tails and substantially reduces the standard errors and total variance. Table 6. Bootstrap coefficient means and standard errors of the ML, MDE10r, MCML10, and MCML10 + D estimators of the NB regression for length of stay data (n = 50); var is the total variance; rej is the number of rejected samples.  Remark: In our bootstrap experiments, several numerical breakdowns occurred. Both the function "glm.nb" (from the R package "MASS") and our subsampling algorithm to compute MD10r stopped working in the presence of ill-conditioned model matrices. Other computations failed due to the occurrence of extreme predicted values, and remedies had to be implemented. For n = 50, a subsampling algorithm based on the Poisson model was used to compute the initial coefficient estimates followed by the solution of the MD10r equation for the NB dispersion parameter. As noted in Marazzi et al. (2019, Remark 5), although this procedure does not provide consistent initial coefficients under the NB model, the final MCML estimates are consistent. In addition, both for n = 50 and n = 92, samples associated with extreme predictions (larger than 25,000) were rejected to avoid program breakdowns. In the subsampling procedures, these samples were simply replaced with harmless ones, but we did not touch the "glm.nb" and "optim" functions. For these reasons, some of the results in Tables 5 and 6 are based on less than 1000 samples; the number of rejections is reported.

Conclusions
Robust estimators may be very inefficient when the sample size is not large; this shortcoming is also frequent for asymptotically fully efficient estimators. As a remedy, Ref. [18] introduced the DCML method in the linear model case. In this paper, we introduced a simple modification of the original DCML that can be used for the GLM. The new DCML is simply defined as the convex combination of the robust and the ML estimator that minimizes a quadratic distance from the ML estimator under the constraint that the distance from the robust estimator is smaller than a given bound δ.
With the help of Monte Carlo experiments, we explored the performance of the DCML in the particular case of Poisson regression. For this purpose, we introduced a simplified version of the fully efficient MCML estimator where a very robust MD estimator is used as the initial step. A small Monte Carlo comparison of several published competitors indicated that this MCML was a very good candidate for the DCML procedure. We also introduced two iterated versions of the DCML procedure.
Unfortunately, Poisson regression is not regression invariant. This makes it difficult to infer general conclusions from the Monte Carlo experiments, in particular to find a general rule to set the bound δ. It seems, however, that a large quantile of the χ 2 p distribution (δ ≥ χ 2 p,0.95 ) is a convenient choice. We restricted our experiments to the case of spherical distributions of the covariates showing that, in this case, the efficiency essentially depends on the length of the slope vector. Monte Carlo experiments were based on a very unfavorable design. We observed, however, that the small sample efficiency of our best starting estimator MCML10 could be remarkably improved-without loss of robustness-with the help of the DCML method. It seems therefore reasonable to assume that the procedure works in practice as has been shown in an example with hospital length of stay fitted with the help of an NB regression model.

Data Availability Statement:
The data used in Section 6 are available in the R package "robustnegbin". Q K = E F ψ (x, y, β) 2 xx T .
The RQL estimator: The estimating equation of the RQL estimator with tuning parameter c is: where ϕ c is Huber's function and: Thus, the ψ-function of the RQL estimator is: The formulae to compute M K and Q K were given in [10]. The function "glmrob" in "robustbase" implements these formulae.
The MD estimator: The estimating equation of the MD estimator with tuning parameter τ is: . We obtain: We have: The MCML estimator: The MCML estimator is based on the conditional density of y given x and z ∈ [a, b], where z ∼ U[0, 1]. This distribution can be written as: where W a,b (µ x ) is defined in the following. For any c, let: y * x (c) = max{y : F µ x (y) ≤ c} and: t c,x = (F µ x (y * x (c) + 1) − c)/ f µ x (y * x (c) + 1), where F µ denotes the cdf of f µ . Put: Then: We approximately compute the influence function assuming that a and b are fixed. The estimating equation is: We obtain: where ψ µ (y, µ) = (∂/∂µ)ψ(y, µ).