Robust Model Selection Criteria Based on Pseudodistances

In this paper, we introduce a new class of robust model selection criteria. These criteria are defined by estimators of the expected overall discrepancy using pseudodistances and the minimum pseudodistance principle. Theoretical properties of these criteria are proved, namely asymptotic unbiasedness, robustness, consistency, as well as the limit laws. The case of the linear regression models is studied and a specific pseudodistance based criterion is proposed. Monte Carlo simulations and applications for real data are presented in order to exemplify the performance of the new methodology. These examples show that the new selection criterion for regression models is a good competitor of some well known criteria and may have superior performance, especially in the case of small and contaminated samples.


Introduction
Model selection is fundamental to the practical applications of statistics and there is a substantial literature on this issue. Classical model selection criteria include, among others, the C p -criterion, the Akaike Information Criterion (AIC), based on the Kullback-Leibler divergence, and the Bayesian Information Criterion (BIC) as well as a General Information Criterion (GIC) which corresponds to a general class of criteria which also estimates the Kullback-Leibler divergence. These criteria have been proposed respectively in [1][2][3][4], and represent powerful tools for choosing the best model among different candidate models that can be used to fit a given data set. On the other hand, many classical procedures for model selection are extremely sensitive to outliers and to other departures from the distributional assumptions of the model. Robust versions of classical model selection criteria, which are not strongly affected by outliers, have been proposed for example in [5][6][7]. Some recent proposals for robust model selection are criteria based on divergences and minimum divergence estimators. We recall here, the Divergence Information Criteria (DIC) based on the density power divergences introduced in [8], the Modified Divergence Information Criteria (MDIC) introduced in [9] and the criteria based on minimum dual divergence estimators introduced in [10].
The interest on statistical methods based on divergence measures has grown significantly in recent years. For a wide variety of models, statistical methods based on divergences have high model efficiency and are also robust, representing attractive alternatives to the classical methods. We refer to the monographs [11,12] for an excellent presentation of such methods, for their importance and applications. The pseudodistances that we use in the present paper were originally introduced in [13], where they are called "type-0" divergences, and corresponding minimum divergence estimators have been studied. They are also presented and extensively studied in [14] where they are called γ-divergences, as well as in [15] in the context of decomposable pseudodistances. Like divergences, the pseudodistances are not mathematical metrics in the strict sense of the term. They satisfy two properties, namely the nonnegativity and the fact that the pseudodistance between two probability measures equals to zero if and only if the two measures are equal. The divergences are moreover characterized by the information processing property, that is, the complete invariance with respect to statistically sufficient transformations of the observation space. In general, a pseudodistance may not satisfy this property. We have adopted the term pseudodistance for this reason, but in literature we can also encounter the other terms mentioned above.
The pseudodistances that we consider in this paper have also been used to define robustness and efficiency measures, as well as the corresponding optimal robust M-estimators following the Hampel's infinitesimal approach in [16]. The minimum pseudodistance estimators for general parametric models have been studied in [15] and consist of minimizing an empirical version of a pseudodistance between the assumed theoretical model and the true model underlying the data. These estimators have the advantage of not requiring any prior smoothing and conciliate robustness with high efficiency, providing a high degree of stability under model misspecification, often with a minimal loss in model efficiency. Such estimators are also defined and studied in the case of the multivariate normal model, as well as for linear regression models in [17,18], where applications for portfolio optimization models are also presented.
In the present paper we propose new criteria for model selection, based on pseudodistances and on minimum pseudodistance estimators. These new criteria have robustness properties, are asymptotically unbiased, consistent and compare well with some other known model selection criteria, even for small samples.
The paper is organized as follows-Section 2 is devoted to minimum pseudodistance estimators and to their asymptotic properties, which will be needed in the next sections. Section 3 presents new estimators of the expected overall discrepancy using pseudodistances, together with corresponding theoretical properties including robustness, consistency and limit laws. The new asymptotically unbiased model selection criteria are presented in Section 3.3, where the case of the univariate normal model and the case of linear regression models are investigated. Applications based on Monte Carlo simulations and on real data, illustrating the performance of the new methodology in the case of linear regression models, are included in Section 4.

Minimum Pseudodistance Estimators
The construction of new model selection criteria is based on using the following family of pseudodistances (see [15]). For two probability measures P and Q admitting densities p and q respectively with respect to the Lebesgue measure, the family of pseudodistances of order γ > 0 is defined by and satisfies the limit relation lim where R 0 (P, Q) := ln q p dQ is the modified Kullback-Leibler divergence. Let (P θ ) be a parametric model indexed by θ ∈ Θ, where Θ is a d-dimensional parameter space, and p θ be the corresponding densities with respect to the Lebesgue measure λ. Let X 1 , . . . , X n be a random sample on P θ 0 , θ 0 ∈ Θ. For γ > 0 fixed, a minimum pseudodistance estimator of the unknown parameter θ 0 from the law P θ 0 is defined by replacing the measure P θ 0 in the pseudodistance R γ (P θ , P θ 0 ) by the empirical measure P n pertaining to the sample, and then minimizing this empirical quantity with respect to θ on the parameter space. Since the middle term in R γ (P θ , P θ 0 ) does not depend on θ, these estimators are defined by θ n = arg min or equivalently as where , these estimators can be written as The optimum given above need not be uniquely defined.
On the other hand, and here θ 0 is the unique optimizer, since R γ (P θ , The following regularity conditions of the model will be assumed throughout the rest of the paper. (C1) The density p θ (x) has continuous partial derivatives with respect to θ up to the third order (for all x λ-a.e.).
(C2) There exists a neighborhood N θ 0 of θ 0 such that the first-, the second-and the third-order partial derivatives with respect to θ of h(x, θ) are dominated on N θ 0 by some P θ 0 -integrable functions.
Theorem 1. Assume that conditions (C1), (C2) and (C3) are fulfilled. Then (a) Let B := θ ∈ Θ; θ − θ 0 ≤ n −1/3 . Then, as n → ∞, with probability one, the function θ → 1 n ∑ n i=1 h(X i , θ) attains a local maximal value at some point θ n in the interior of B, which implies that the estimator θ n is n 1/3 -consistent. (b) √ n θ n − θ 0 converges in distribution to a centered multivariate normal random variable with covariance matrix We refer to [15] for details regarding these estimators and for the proofs of the above asymptotic properties.

Model Selection Criteria Based on Pseudodistances
Model selection is a method for selecting the best model among candidate models that can be used to fit a given data set. A model selection criterion can be considered as an approximately unbiased estimator of the expected overall discrepancy, a nonnegative quantity which measures the distance between the true unknown model and a fitted approximating model. If the value of the criterion is small, then the approximated candidate model can be chosen. In the following, by applying the same methodology used for AIC, we construct new criteria for model selection using pseudodistances (1) and minimum pseudodistance estimators.
Let X 1 , . . . , X n be a random sample from the distribution associated with the true model Q with density q and let p θ be the density of a candidate model P θ from a parametric family (P θ ), where θ ∈ Θ ⊂ R d .

The Expected Overall Discrepancy
For γ > 0 fixed, we consider the quantity which is the same as the pseudodistance R γ (P θ , Q) without the middle term that remains constant irrespectively of the model (P θ ) used.
The target theoretical quantity that will be approximated by an asymptotically unbiased estimator is given by where θ n is a minimum pseudodistance estimator defined as in (3). The same pseudodistance is used for both W θ and θ n . The quantity (10) can be seen as an average distance between Q and (P θ ) up to a constant and is called the expected overall discrepancy between Q and (P θ ). The next Lemma gives the gradient vector and the Hessian matrix of W θ and is useful for the evaluation of E[W θ n ] through Taylor expansion. Throughout this paper, for a scalar function ϕ θ (·), the quantity ∂ ∂θ ϕ θ (·) denotes the d-dimensional gradient vector of ϕ θ (·) with respect to the vector θ and ∂ 2 ∂θ 2 ϕ θ (·) denotes the corresponding d × d Hessian matrix. We also use the notationsφ θ andφ θ for the first and the second order derivatives of ϕ θ with respect to θ.
Then, using the continuous mapping theorem, since g(t) = − ln t 1 γ is a continuous function, we get On the other hand, using the asymptotic normality of the estimator R γ (θ 0 ) (according to Theorem 1 (c)) together with the univariate delta method, we obtain the asymptotic normality of Q θ n . The Proposition below summarizes the above asymptotic results.

Robustness Properties of the Estimator
The influence function is a useful tool for describing robustness of an estimator. Recall that, a map T defined on a set of probability measures and parameter space valued is a statistical functional corresponding to an estimator θ n of the parameter θ, whenever θ n = T(P n ), where P n is the empirical measure associated to the sample. The influence function of T at P θ is defined by , δ x being the Dirac measure putting all mass at x. The gross error sensitivity of the estimator is defined by Whenever the influence function is bounded with respect to x, the corresponding estimator is called B-robust (see [19]).
In what follows, for a given γ > 0, we derive the influence function of the estimator Q θ n . The statistical functional associated with this estimator, which we denote by U, is defined by where T is the statistical functional corresponding to the used minimum pseudodistance estimator estimator θ n , namely . Due to the Fisher consistency of the functional T, according to (6), we have T( Proposition 5. When Q = P θ 0 , the influence function of Q θ n is given by Note that the influence function of the estimator Q θ n does not depend on the estimator θ n , but depends on the used pseudodistance. Usually, p γ θ 0 (x) is bounded with respect to x and therefore Q θ n is a robust estimator with respect to W θ 0 . For comparison at the level of the influence function, we consider the AIC criterion which is defined by where L( θ n ) is the maximum value of the likelihood function for the model, θ n the maximum likelihood estimator and d the dimension of the parameter. The statistical functional corresponding to the statistic −2 ln(L( θ n )) is where T here is the statistical functional corresponding to the maximum likelihood estimator. The influence function of the functional V is given by This influence function is not bounded with respect to x, therefore the statistic −2 ln(L( θ n )) is not robust.
For example, in the case of the univariate normal model, for a positive γ, the influence function (23) writes as while the influence function (24) writes as (here θ 0 = (m, σ)). For all the pseudodistances, the influence function (25) is bounded with respect to x, therefore the selection criteria based on the statistic Q θ n will be robust. On the other hand, the influence function (26) is not bounded with respect to x, showing the non robustness of AIC in this case. Moreover, the gross error sensitivities corresponding to these influence functions are These results show that, in the case of the normal model, when γ increases the gross error sensitivity decreases. Therefore, larger values of γ are associated with more robust procedures. For the particular case m = 0 and σ = 1, the influence functions (25) and (26) are represented in Figure 1.

The Case of Univariate Normal Family
The criteria that we propose in this section correspond to the case where the candidate model is a univariate normal model from the family of normal models (P θ ) indexed by θ = (µ, σ). We also suppose that the true model Q belongs to (P θ ).
In the case of the univariate normal model, M γ (θ 0 ) defined in (14) expresses as where V is the asymptotic covariance matrix given by (8) and the matrix A(γ) is given by .
For small positive values of γ, the matrix A(γ) can be approximated by the identity matrix I. According to Theorem 1, √ n( θ n − θ 0 ) is asymptotically multivariate normal and then the statistic Also, for the normal model, it holds Therefore, (18) becomes Using the central limit theorem and asymptotic properties of θ n given in Theorem 1, the following hold Using (30), (31) and (32) we obtain: For the univariate normal family, an asymptotically unbiased estimator of the expected overall discrepancy is given by where θ n is a minimum pseudodistance estimator given by (3).
Under the hypothesis that (P θ ) is the univariate normal model, as we supposed in this subsection, the function h writes as and it can be easily checked that all the conditions (C1)-(C5) are fulfilled. Therefore we can use all results presented in the preceding subsections, such that Proposition 6 is fully justified. Moreover, the selection criteria based on (33) are consistent on the basis of Proposition 4. It should also be noted that the bias correction term in (33) decreases slowly as the parameter γ increases staying always very close to zero (∼ 10 −2 ). As expected, the larger the sample size the smaller the bias correction. As we saw in Section 3.2.2, since the gross error sensitivity of Q θ n is γ * (U, P θ 0 ) = 1 γ , larger values of γ are associated with more robust procedures. On the other hand, the approximation of A(γ) with the identity matrix is realized for values of γ close to zero. Thus, positive values of γ smaller than 0.5 for example could represent choices satisfying the robustness requirement and the approximation of A(γ) through the identity matrix, approximation which is necessary to construct the criterion in this case.

The Case of Linear Regression Models
In the following, we adapt the pseudodistance based model selection criterion in the case of linear regression models. Consider the linear regression model where e ∼ N (0, σ) and e is independent of X. Suppose we have a sample given by the i.i.d. random We consider the joint distribution of the entire data and write a pseudodistance between the theoretical model and the true model corresponding to the data. Let P θ , θ := (α, β, σ), be the probability measure associated to the theoretical model given by the random vector Z = (X, Y) and Q the probability measure associated to the true model corresponding to the data. Denote by p θ , respectively by q the corresponding densities. For γ > 0, the pseudodistance between P θ and Q is defined by Similar to [18], since the middle term above does not depend on P θ , a minimum pseudodistance estimator of the parameter θ 0 = (α 0 , β 0 , σ 0 ) is defined by where P n is the empirical measure associated with the sample. This estimator can be written as where φ σ is the density of the random variable e ∼ N (0, σ). Then, the estimator Q θ n can be written as In order to construct an asymptotic unbiased estimator of the expected overall discrepancy in the case of the linear regression models, we evaluated the second and the third terms from (18).
For values of γ close to 0 (γ smaller than 0.3), we found the following approximation of the matrix where V is the asymptotic covariance matrix of θ n and I is the identity matrix. We refer to [15] for the asymptotic properties of the minimum pseudodistance estimators in the case of linear regression models. Since √ n( θ n − θ 0 ) is asymptotically multivariate normal distributed, using the χ 2 distribution, we obtain the approximation Also, the third term in (18) is given by Then, according to Proposition 3, an asymptotically unbiased estimator of the expected overall discrepancy is given by where Q θ n is given by (39). Note that, using the asymptotic properties of θ n and the central limit theorem, the last two terms in (18) of Proposition 3 are o P ( 1 n ). When we compare different linear regression models, as in Section 4 below, we can ignore the terms depending only on n and γ in (43). Therefore, we can use as model selection criterion the simplified expression which we call Pseudodistance based Information Criterion (PIC).

Simulation Study
In order to illustrate the performance of the PIC criterion (44) in the case of linear regression models, we performed a simulation study using for comparison the model selection criteria AIC, BIC and MDIC. These criteria are defined respectively by AIC = n logσ 2 p + 2 (p + 2) where n the sample size, p the number of covariates of the model andσ 2 p the classical unbiased estimator of the variance of the model, MDIC = nMQθ + (2π) −α/2 (1 + α) 2+p/2 p with α = 0.25 and whereθ is a consistent estimate of the vector of unknown parameters involved in the model with p covariates and fθ is the associated probability density function. Note that MDIC is based on the well known BHHJ family of divergence measures indexed by a parameter α > 0 and on the minimum divergence estimating method for robust parameter estimation (see [20]). The value of α = 0.25 was found in [9] to be an ideal one for a great variety of settings. The above three criteria have been chosen to be used in this comparative study with PIC not only due to their popularity, but also due to their special characteristics. Indeed, AIC is the classical representative of asymptotically efficient criteria, BIC is known to be consistent, while MDIC is associated with robust estimations (see e.g., [20][21][22][23]). Let X 1 , X 2 , X 3 , X 4 be four variables following respectively the normal distributions N (0, 3), N (1, 3), N (2, 3) and N (3, 3). We consider the model Y = a 0 + a 1 X 1 + a 2 X 2 + ε with a 0 = a 1 = a 2 = 1 and ε ∼ N (0, 1). This is the uncontaminated model. In order to evaluate the robustness of the new PIC criterion, we also consider the contaminated model where ε * ∼ N (5, 1) and d 1 , d 2 ∈ [0, 1] such that d 1 + d 2 = 1. Note that for d 1 = 1 and d 2 = 0 the uncontaminated model is obtained.
With a set of four possible regressors there are 2 4 − 1 = 15 possible model specifications that include at least one regressor. These 15 possible models constitute the set of candidate models in our study. More precisely, this set contains the full model (X 1 , X 2 , X 3 , X 4 ) given by as well as all 14 possible subsets of the full model consisting of one (X j 1 ), two (X j 1 , X j 2 ) and three (X j 1 , X j 2 , X j 3 ) of the four regressors X 1 , X 2 , X 3 and X 4 , with j 1 = j 2 = j 3 , j i ∈ {1, 2, 3, 4} and i = 1, 2, 3 .
In our simulation study, for several values of the parameter γ associated with the pseudodistance, we compared the new criterion PIC with the other model selection criteria. Different levels of contamination and different sample sizes have been considered. In the examples presented in this work, d 1 ∈ {0.8, 0.9, 0.95, 1} and n ∈ {20, 50, 100}. Additional examples for n = 30, 75, 200, 500 have been analyzed (results not shown) with similar findings (see below). For each setting, fifty experiments were performed in order to select the best model among the available candidate models. In the framework of each of the fifty experiments, on the basis of the simulated observations, the value of each of the above model selection criteria was calculated for each of the 15 possible models. Then, for each criterion, the 15 candidate models were ranked from 1st to 15th according to the value of the criterion. The model chosen by a given criterion is the one for which the value of the criterion is the lowest among all the 15 candidate models.
Tables 1-12 present the proportions of models selected by the considered criteria. Among the 15 candidate models only 4 were chosen at least once. These four models are the same in all instances and appear in the 2nd column of all Tables.
For small sample sizes (n = 20, n = 30) the criteria PIC and MDIC yield the best results. When the level of contamination is 10% or 20%, the PIC criterion yields very good results and beats the other competitors almost all the time. When the level of contamination is small, for example 5% or when there is no contamination, the two criteria are comparable, in the sense that in many cases the proportions of selected models by the two criteria are very close, so that sometimes PIC wins and sometimes MDIC wins. Tables 1-4 present these results for n = 20, but similar results are obtained for n = 30, too.
For medium sample sizes (n = 50, n = 75), the criteria PIC and BIC yield the best results. The results for n = 50 are given in Tables 5-8. Note that the PIC criterion yields the best results for 0% and 10% contamination. For the other levels of contamination, there are values of γ for which PIC is the best among all the considered criteria. On the other hand, in most cases when BIC wins, the proportions of selections of the true model by BIC and PIC are close.
When the sample size is large (n = 100, n = 200, n = 500), BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance. Tables 9-12 present the results obtained for n = 100.
Thus, the new PIC criterion works very well for small to medium sample sizes and for levels of contamination up to 20%, but falls behind BIC for large sample sizes. Note that for contaminated data, PIC with γ = 0.15 prevails in most of the considered cases. On the other hand, for uncontaminated data, it is PIC with γ = 0.2 that prevails in all the considered instances. It is also worth mentioning that PIC with γ = 0.3 appears to behave very satisfactorily in most cases irrespectively of the proportion of contamination (0%-20%) and the sample size. Observe also that in all cases, AIC has the highest overestimation rate which is somehow expected (see [24]).
Although the consistency is the main focus of the applications presented in this work, one should point out that if prediction is part of the objective of a regression analysis, then model selection carried out using criteria such as the ones used in this work, have desirable properties. In fact, the case of finite-dimensional normal regression models has been shown to be associated with satisfactory prediction errors for criteria such as AIC and BIC (see [25]). Furthermore, it should be pointed out that in many instances PIC has a behavior quite similar to the above criteria by choosing the same models. Also, according to the presented simulation results, the proportion of choosing the true model by PIC is always better than the proportion of choosing the true model by AIC (even in the case of non contaminated data) and sometimes it is better than the proportion of choosing the true model by BIC. These results imply a satisfactory prediction ability for the proposed PIC criterion.
In conclusion, the new PIC criterion is a good competitor of the well known model selection criteria AIC, BIC and MDIC and may have superior performance especially in the case of small and contaminated samples.  X 2  62  56  52  56  66  56 60 Table 2. Proportions of selected models by the considered criteria (n = 20, d 1 = 0.9).

Criteria
Variables   Table 6. Proportions of selected models by the considered criteria (n = 50, d 1 = 0.9).

Criteria
Variables  Table 9. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.8).

Criteria
Variables  Table 11. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.95).

Criteria
Variables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3 Table 13. Hald cement data. Since 4 variables are available, there are 15 possible candidate models (involving at least one regressor) for this data set. Note that the 4 single-variable models should be excluded from the analysis, because cement involves a mixture of at least two components that react chemically (see [27], p. 102). The model selection criteria that have been used are PIC for several values of γ, AIC, BIC and MDIC with α = 0.25. Table 14 shows the model selected by each of the considered criteria. Table 14. Selected models by model selection criteria.

Criteria Variables
PIC, γ = 0.05 X 1 , X 2 , X 4 PIC, γ = 0.15 X 1 , X 2 , X 4 PIC, γ = 0.2 X 1 , X 2 , X 3 PIC, γ = 0.25 X 1 , X 2 , X 4 PIC, γ = 0.3 X 1 , X 2 , X 4 AIC X 1 , X 2 , X 4 BIC X 1 , X 2 MDIC X 1 , X 2 , X 3 Observe that, in this example, PIC behaves similarly to AIC and MDIC having a slight tendency of overestimation. Note that for this specific dataset the collinearity is quite strong with X 1 and X 3 as well as X 2 and X 4 being seriously correlated. It should be pointed out that the model (X 1 , X 2 , X 4 ) is chosen not only by AIC and PIC, but also by C p Mallows' criterion ( [1]) with (X 1 , X 2 , X 3 ) coming very close second. Note further that (X 1 , X 2 , X 4 ) has also been chosen by cross validation ( [28], p. 33) and PRESS ( [26], p. 325). Finally, it is worth noticing that these two models share the highest adjusted R 2 values which are almost identical (0.976 for (X 1 , X 2 , X 4 ) and 0.974 for (X 1 , X 2 , X 3 )) making the distinction between them extremely hard. Thus, in this example, the new PIC criterion gives results as good as other recognized classical model selection criteria.

Conclusions
In this work, by applying the same methodology as for AIC to a family of pseudodistances, we constructed new model selection criteria using minimum pseudodistance estimators. We proved theoretical properties of these criteria including asymptotic unbiasedness, robustness, consistency, as well as the limit laws. The case of the linear regression models was studied in detail and specific selection criteria based on pseudodistance are proposed.
For linear regression models, a comparative study based on Monte Carlo simulations illustrate the performance of the new methodology. Thus, for small sample sizes, the criteria PIC and MDIC yield the best results and in many cases PIC wins, for example when the level of contamination is 10% or 20%. For medium sample sizes, the criteria PIC and BIC yield the best results. When the sample size is large, BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance.
Based on the results of the simulation study and on the real data example, we conclude that the new PIC criterion is a good competitor of the well known criteria AIC, BIC and MDIC with an overall performance which is very satisfactory for all possible settings according to the sample size and contamination rate. Also PIC may have superior performance, especially in the case of small and contaminated samples.
An important issue that needs further investigation is the choice of the appropriate value for the parameter γ associated to the procedure. The findings of the presented simulation study show that, for contaminated data, the value γ = 0.15 leads to very good results, irrespectively of the sample size. Also, γ = 0.3 produces overall very satisfactory results, irrespectively of the sample size and the contamination rate. We hope to explore further and provide a clear solution to this problem, in a future work. We also intend to extend this methodology to other type of models including nonlinear or time series models.
By applying the weak law of large numbers and the continuous mapping theorem, we get