Robust Statistical Inference in Generalized Linear Models Based on Minimum Renyi’s Pseudodistance Estimators

Minimum Renyi’s pseudodistance estimators (MRPEs) enjoy good robustness properties without a significant loss of efficiency in general statistical models, and, in particular, for linear regression models (LRMs). In this line, Castilla et al. considered robust Wald-type test statistics in LRMs based on these MRPEs. In this paper, we extend the theory of MRPEs to Generalized Linear Models (GLMs) using independent and nonidentically distributed observations (INIDO). We derive asymptotic properties of the proposed estimators and analyze their influence function to asses their robustness properties. Additionally, we define robust Wald-type test statistics for testing linear hypothesis and theoretically study their asymptotic distribution, as well as their influence function. The performance of the proposed MRPEs and Wald-type test statistics are empirically examined for the Poisson Regression models through a simulation study, focusing on their robustness properties. We finally test the proposed methods in a real dataset related to the treatment of epilepsy, illustrating the superior performance of the robust MRPEs as well as Wald-type tests.


Introduction
Generalized linear models (GLMs) were first introduced by Nelder and Wedderburn [1] and later expanded upon by McCullagh and Nelder [2]. The GLMs represent a natural extension of the standard linear regression models, which enclose a large variety of response variable distributions, including distributions of count, binary, or positive values. Let Y 1 , . . . , Y n be independent response variables. The classical GLM assumes that the density function of each random variable Y i belongs to the exponential family, having the form for i = 1, . . . , n, where the functions a(φ), b(θ i ) and c(y, φ) are known. Therefore, the observations are independent but not identically distributed, depending on a location parameter θ i , i = 1, . . . , n, and a nuisance parameter φ. Further, we denote by µ i the expectation of the random variable Y i and we assume that there exists a monotone differentiable function, so called link function g, verifying g(µ i ) = x T i β, with β = (β 1 , . . . , β k ) ∈ R k (k < n) the regression parameter vector. The k × 1-vector of explanatory variables, x i , is assumed to be nonrandom, i.e., the design matrix is fixed. Correspondingly, the location parameter depends on the explanatory variables θ = θ x T β For each of the response variables Y i , the RP between the theoretical density function belonging to the exponential family, f i (y, γ), and the true density underlying the data, g i , can be defined, for α > 0 as where k = 1 α(α + 1) log g i (y) α+1 dy does not depend on γ = (β T , φ) T . We consider (y 1 , . . . , y n ) a random sample of independent but nonhomogeneous observations of the response variables with fixed predictors (x 1 , . . . , x n ). Since only one observation of each variable Y i is available, a natural estimate of its true density g i is the degenerate distribution at the the observation y i . Consequently, in the following we denote g i the density function of the degenerate variable at the point y i . Then, substituting in (2) the theoretical and empirical densities, yields to the loss If we consider the limit when α tends to zero we get Last expression coincides with the Kullback-Leibler divergence, except for the constant k.
More details about Kullback-Leiber divergence can be seen in Pardo [28]. For the seek of simplicity, let us denote The expression (3) can be rewritten as Based on the previous idea, we shall define an objective function averaging the RP between all the the RPs. Since minimizing R α ( f i (y, γ), g i ) in γ is equivalent to maximizing log V i (Y i , γ), we define a loss function averaging those quantities as Based on (5), we can define the MRPE of the unknown parameter γ, γ α , by at α = 0. The MRPE coincides with the MLE at α = 0, and therefore the proposed family can be considered a natural extension of the classical MLE. Now, since the MRPE is defined as a maximum, it must annul the first derivatives of the loss function given in (5). The estimating equations of the parameters β and φ are given by For the first equation, we have The previous partial derivatives can be simplified as See Ghosh and Basu [13] for more details. Now using the simplified expressions, we can write the estimating equation for β as Subsequently, the estimating equation for φ, is given by and thus, the estimating equation for φ is given by being Under some regularity conditions, Castilla et al. [27] established the consistency and asymptotic normality of the MRPEs under the INIDO setup. Before stating the consistence and asymptotic distribution of the MRPEs for the GLM, let us introduce some useful notation. We define for all j, l = 1, 2 and i = 1, . . . , n.
Theorem 1. Let Y 1 , . . . , Y n be a random sample from the GLM defined in (1). The MRPE γ α = ( β T α , φ α ) T is consistent and its asymptotic distribution is given by where X denotes the design matrix, I k is the k-dimensional identity matrix and the matrices Ψ n and Ω n are defined by ,...,n and D * j = diag m ji (γ) i=1,...,n , , j.k = 1, 2.
Proof. The consistency is proved for general statistical models in Castilla et al. [27] and the asymptotic distribution of the MRPEs for GLM is derived in Jaenada and Pardo [3].

Wald Type Tests for the GLMs
In this section, we define Wald-type tests for linear null hypothesis of the form being γ = (β T , φ) T , M a (k + 1) × r full rank matrix and m = (m 1 , . . . , m r ) T a r-dimensional vector (r ≤ k + 1). If the nuisance parameter φ is known, as with logistic and Poisson regression, the matrix M = L k×r . Additionally, choosing gives rise to a null hypothesis defined by a linear combination of the regression coefficients, β, with φ known or unknown. Further, the simple null hypothesis is a particular case when choosing M as the identity matrix of rank k, In the following we assume that there exist a matrix A α (γ) verifying The Wald-type tests, based on the MRPE, for testing (11) are defined by The following theorem presents the asymptotic distribution of the Wald-type test statistics, W n ( γ α ).

Theorem 2.
The Wald-type test W n ( γ α ) follows asymptotically, under the null hypothesis presented in (11), a chi-square distribution with degrees of freedom equal to the dimension of the vector m in (12) Under the null hypothesis given in (11) the asymptotic distribution of the Wald-type test statistics is a chi-square distribution with r degrees of freedom.
Now, the result follows taking into account that γ α is a consistent estimator of γ 0 .
Based on the previous convergence, the null hypothesis in (11) is rejected, if being χ 2 r,α the 100(1 − α) percentile of a chi-square distribution with r degrees of freedom. Finally, let γ 1 be a parameter point verifying M T γ 1 = m, i.e., γ 1 is not on the null hypothesis. The next result establishes that the Wald-type tests given in (14) are consistent (see Fraser [29]). Theorem 3. Let γ 1 be a parameter point verifying M T γ 1 = m. Then the Wald-type tests given in (14) are consistent, i.e., lim Proof. See Appendix A.

Remark 1.
In the proof of the previous Theorem was established the approximate power function of the Wald-type tests defined in (13), From the above expression, the necessary sample size n for the Wald-type tests to have a predetermined power, π 0 , is given by n = [n * ] + 1, with and [·] the integer part.
In accordance with Maronna et al. [30], the breakdown point of the estimators γ α of a parameter γ is the largest amount of contamination that the data may contain such that γ α still gives enough information about γ. The derivation of a general breakdown points it is in general not easy, so it may deserve a separate paper where it may be jointly considered the replacement finite-sample breakdown point introduced by Donoho and Huber [31]. Although breakdown point is an important theoretical concept in robust statistics, perhaps is more useful the definition of breakdown point associated to a finite sample: replacement finite-sample break down point. More details can be seen in Section 3.2.5 of Maronna et al. [30].

Influence Function
We derive in this section the IF of the MRPEs of the parameters γ = (β T , φ) T and Wald-type statistics based on these MRPEs, W n ( γ α ). The influence function (IF) of an estimator quantifies the impact of an infinitesimal perturbation in the true distribution of the data on the asymptotic value of the resulting parameter estimate (in terms of the corresponding statistical functional). An estimator is said to be robust if its IF is bounded. If we denote G = (G 1 , . . . , G n ) the true distributions underlying the data, the functional T α (G) and associated to the MRPE for the parameters γ is such that The IF of a estimator is defined as the limiting standardized bias due to infinitesimal contamination. That is, given a contaminated distribution at the point (y t , In the following, let us denote are the functionals associated the parameters β and φ, respectively. Then, they must satisfy the estimating equations of the MRPE given by and N * i (y i , gamma) are defined in Section 2. Now, evaluating the previous equation at the contaminated distribution G ε , implicitly differentiating the estimating equations in ε and evaluating them at ε = 0, we can obtain the expression of the IF for the GLM.
We first derive the expression IF of MRPEs at the i 0 − th direction. For this purpose, we consider the contaminated distributions Here, only the i 0 -th component of the vector of distributions is contaminated. If the true density function g i of each variable belongs to the exponential model, we have that the MRPE when the true distribution underlying the data is G i 0 ,ε . Based on Remark 5.2 in Castilla et al. [27] the IF of the MRPE at the i 0 − th direction with (y i 0 , x i 0 ) the point of contamination is given by In a similar manner, the IF in all directions (i.e., all components of the vector of distributions are contaminated) has the following expression x i 0 0 1 , with (y 1 , x 1 ), . . . , (y n , x n ) the point of contamination. We next derive the expression of the IF for the Wald-type tests presented in Section 3. The statistical functional associated with the Wald-type tests for the linear null hypothesis (11) at the distributions G = (G 1 , . . . , G n ), ignoring the constant n, is given by Again, evaluating the Wald-type test functionals at the contaminated distribution G ε and implicitly differentiating the expression, we can get the expression of it IF. In particular, the IF of the Wald-type test statistics at the i 0 -th direction and the contamination point (y i 0 , x 0 ) is given by Evaluating the previous expression at the null hypothesis, M T T α (G) = m, the IF becomes identically zero, Therefore, it is necessary to consider the second order IF of the proposed Wald-type tests. Twice differentiating in W α (G ε ), we get Finally, the second order IF of the Wald-type tests in all directions is given by To asses the robustness of the MRPEs and Wald-type test statistics we must discuss the boundedness of the corresponding IF. The boundedness of the second order IF of the Wald-type test statistics is determined by the boundedness of the IF of the MRPEs. Further, the matrix Ψ n (γ) is assumed to be bounded, so the robustness of the estimators only depend on the second factor of the IF. Most standard GLMs enjoy such properties for positives values of α, but the influence function is unbounded at α = 0, corresponding with the MLE. As an illustrative example, Figure 1

Numerical Analysis: Poisson Regression Model
We illustrate the proposed robust method for the Poisson regression model. As pointed out in Section 1 the Poisson regression model belongs to the GLM with known shape parameter φ = 1, location parameter θ i = x T i β and known functions b(θ i ) = exp(x T i β) and c(y i ) = − log(y i !). Since the nuisance parameter is known, for the seek of simplicity in the following we only use β = γ. In Poisson regression, the mean of the response variable is linked to the linear predictor through the natural logarithm, i.e., µ i = exp(x T i β). Thus, we can apply the previous proposed method to estimate the vector of regression parameters β with objective function given in Equation (5).
The results provided are computed in the software R. The minimization of the objective function is performed using the implemented optim() function, which applies the Nelder-Mead iterative algorithm (Nelder and Mead [32]). Nelder-Mead optimization algorithm is robust although relatively slow. The corresponding objective function T α n (γ) given in (5) is highly nonlinear and requires the evaluation of nontrivial quantities. Further, the computation of the Wald-type test statistics defined in (13) requires to evaluate the covariance matrix of the MRPEs, involving nontrivial integrals. Some simplified expressions of the main quantities defined throughout the paper for the Poisson regression model, such as L i α (β), K 1i (y, β), N i (y, β), m 1i (β), m 11i (β) or l 11i (β), are given in the Appendix B. There is no closed expression for these quantities, and they need to be approximated numerically. Since the minimization is iteratively performed, computing such expressions at each step of the algorithm and for each observation may entail an increased computational burden. Nonetheless, the complexity is not significant for low-dimensional data. On the other hand, the optimum in (5) need not to be uniquely defined, since the objective function may have several local minima. Then, the choice of the initial value of the iterative algorithm is crucial. Ideally, a good initial point should be consistent and robust. In our results the MLE is used as initial estimate for the algorithm.
We analyze the performance of the proposed methods in Poisson regression through a simulation study. We asses the behavior of the MRPE under the sparse Poisson regression model with k = 12 covariates but only 3 significant variables. We set the 12-dimensional regression parameter β = (1.8, 1, 0, 0, 1.5, 0, . . . 0) and we generate the explanatory variables, x i , from the standard uniform distribution with variance-covariance matrix having Toeplitz structure, with the (j, l)-th element being 0.5 |j−l| , j, l = 1, . . . , p. The response variables are generated from the Poisson regression model with mean To evaluate the robustness of the proposed estimators, we contaminate the responses using a perturbed distribution of the form (1 − b)P (µ i ) + bP (2µ i ), where b is a realization of a Bernoulli variable with parameter ε so called the contamination level. That is, the distribution of the contaminated responses lies in a small neighbourhood of the assumed model. We repeat the process R = 1000 for each value of α.  Furthermore, it is to be expected that the error of the estimate decreases with larger samples sizes. In this regard, Figure 3 shows the MSE for different values of α = 0, 0.1, 0.3, 0.5 and 0.7, against the sample size in the absence of contamination (left) and under 5% of contamination. Our proposed estimators are more robust than the classical MLE with almost all contaminated scenarios, since the MSE committed is lower for all positives values of α than for α = 0 (corresponding to the MLE), except for too small sample sizes. Conversely, the MLE is, as expected, the most efficient estimator in absence of contamination, closely to our proposed estimators with α = 0.1, 0.3, highlighting the importance of α in controlling the trade-off between efficiency and robustness. In this regard, values of α about 0.3 perform the best taking into account the low loss of efficiency and the gain in robustness. Finally, note that small sample sizes adversely affect to greater values of α. On the other hand, one could be interested on testing the significance of the selected variables. For this purpose, we simplify the true model and we examine the performance of the proposed Wald-type test statistics under different true coefficients values. In particular, let us consider a Poisson regression model with only two covariates, generated from the uniform distribution as before, and the linear null hypothesis That is, we are interested in assessing the significance of the second variable. The sample size if fixed at n = 200 and the true value of the component of the regression vector is set β 1 = 1. We study the power of the tests under increasing signal of the second parameter β 2 and increasing contamination level. Here, the model is contaminated by perturbing the true distribution with (1 − b)P (µ i ) + bP ( µ i ), where µ i = x T i β is the mean of the Poisson variable in the absence of contamination, µ i = x T i β is the contaminated mean, with β = (1, 0), and b is a realization of a Bernoulli variable with probability of success ε. Table 1 presents the rejection rate of the Wald-type test statistics for different true values of β 2 under different contaminated scenarios. As expected, stronger signals produce higher power for all Wald-type test. Moreover, the power of the Wald-type test statistics based on the MLE decreases when increasing the contamination, whereas the power of the statistics based on the MRPEs with positives values of α keeps sufficiently high. Then, our proposed robust estimators are able to detect the significance of the variable even in heavily contaminated scenarios.

Example I: Poisson Regression Regression
We finally apply our proposed estimators in a real dataset arising from Crohn's disease. The data were first studied in Lô and Ronchetti [33] to asses the adverse events of a drug. The clinical study included 117 patients affected by the disease, for whom information was recorded for 7 explanatory variables: BMI (body mass index), HEIGHT, COUNTRY (one of the two countries where the patient lives), SEX, AGE, WEIGHT, and TREAT (the drug taken by the patient in factor form: placebo, Dose 1, Dose 2), in addition to the response variable AE (number of adverse events). Lô and Ronchetti [33] considered a Poisson regression model for the Crohn data and determined that only variables Dose 1, BMI, HEIGHT, SEX, AGE, and COUNTRY may be essentially significant. Further, they flagged observations 23rd, 49th, and 51st to be highly influential on the classical analysis. Table 2 presents the estimated coefficient of the explanatory variable when fitting the Poisson regression model. Robust methods suggest higher coefficients for the variables BMI and AGE, whereas fewer values for the coefficients of the categorical variables COUNTRY, SEX, Dose 1. Following the discussion in Lô and Ronchetti [33], classical tests may not select variable AGE to be significant. Then, we propose testing the significance of that variable using Wald-type test statics based on different values α. Table 3 shows the p-values of the corresponding tests with null hypothesis H 0 :AGE = 0, with the original data and after removing the outlying observations. The MLE rejects the significance of the variable AGE when the original data are used, whereas the Wald-type test statistics with positives values of α indicate strong evidence against the null hypothesis. In contrast, if the influential observations are removed, all Wald-type test statistics agree in the significance of the variable. This example illustrates the robustness of the proposed statistics.

Example II: Binomial Regression
We finally illustrate the applicability of the MRPE for robust inference in the binomial regression model. We examine the damaged carrots dataset, first studied in Phelps [34] and later discussed by Cantoni and Ronchetti [8] and Ghosh and Basu [13] to illustrate robust procedures for binomial regression. The data contain 24 samples, among which the 14th observation was flagged as an outlier in the y-space but not a leverage point. The data are issued from a soil experiment and give the proportion of carrots showing insect damage in a trial with three blocks and eight dose levels of insecticide. The explanatory variables are the logarithm transform of the dose (Logdose) and two dummy variables for Blocks 1 and 2.
Binomial regression is a natural extension of the logistic regression when the response variable Y does not follow a Bernoulli distribution but a Binomial distribution counting the number of successes in a series of m independent Bernoulli trials. Binomial regression model belongs to the GLM with known shape parameter φ = 1, location parameter θ i = x T i β and functions b(θ i ) = m log 1 + exp(x T i β) and c(y i ) = log(( m y i )). The mean of the response variable is then linked to the linear predictor through the logit function, i.e., Table 4 presents the estimated coefficients of the regression vector for the carrots data using the MLE and robust MRPEs when the model is fitted with the original data and the model fitted without the outlying observation. The results provided are computed in the same manner as in Section 5, adapting the corresponding quantities in Equation (5) for the binomial model. All integrals involved were numerically approximated, and the MLE is used as initial estimate for the optimization algorithm. The influence of observation 14 stands out when using the MLE; the estimated coefficients are remarkably different when fitting the model with and without observation 14. In contrast, all methods estimate similar coefficients after removing the outlying observation, coinciding with the robust estimates for moderately high values of the tuning parameter α.

Conclussions
In this paper, we presented the MRPE and Wald-type test statistics for GLMs. The proposed MRPEs and statistics have appealing robustness properties where the data are contaminated due to outliers or leverage points. MRPEs are consistent and asymptotically normal and represent an attractive alternative to the classical nonrobust methods. Additionally, robust Wald-type test statistics, based on the MRPEs, were developed. Through the study of the IFs and the development of an extensive simulation study, we proved their robustness from a theoretical and practical point of view, respectively. In particular, we illustrated the superior performance of the MRPEs and the corresponding Wald-type tests for the Poisson regression model. where φ N(0,1) (t) represents the distribution function of a standard normal distribution evaluated at t. Finally, lim n→∞ P γ 1 W n ( γ α ) > χ 2 r,α = 1.

Appendix B. Poisson Regression Model
We derive here some explicit expression for the particular case of the Poisson regression. Following the discussion in Section 5, we denote here γ = β since the nuisance parameter is known, φ = 1. The Poisson distribution with parameter e x T i β is given by i β e yx T i β , y = 0, 1, . . . .
Differentiating its logarithm with respect to the regression vector, we get so we can write K 1i (y, β) = y − e x T i β .
Further, we have that so the estimating equations of the Poisson regression model are given by For α = 0, we have N i (y i , β) = 0 and L i α (β) = 1 so the estimating equations are given by yielding to the maximum likelihood estimating equations. On the other hand, the asymptotic distribution of β α is given by