Next Article in Journal / Special Issue
Parametric and Nonparametric Frequentist Model Selection and Model Averaging
Previous Article in Journal
Forecasting Value-at-Risk Using High-Frequency Information
Previous Article in Special Issue
On Diagnostic Checking of Vector ARMA-GARCH Models with Gaussian and Student-t Innovations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generalized Empirical Likelihood-Based Focused Information Criterion and Model Averaging

Graduate School of Economics, Kyoto University, Yoshida-Hommachi, Sakyo-ku, Kyoto, 6068501, Japan
Econometrics 2013, 1(2), 141-156; https://doi.org/10.3390/econometrics1020141
Submission received: 13 May 2013 / Revised: 26 June 2013 / Accepted: 27 June 2013 / Published: 3 July 2013
(This article belongs to the Special Issue Econometric Model Selection)

Abstract

:
This paper develops model selection and averaging methods for moment restriction models. We first propose a focused information criterion based on the generalized empirical likelihood estimator. We address the issue of selecting an optimal model, rather than a correct model, for estimating a specific parameter of interest. Then, this study investigates a generalized empirical likelihood-based model averaging estimator that minimizes the asymptotic mean squared error. A simulation study suggests that our averaging estimator can be a useful alternative to existing post-selection estimators.

1. Introduction

This paper develops model selection and averaging methods for moment restriction models. We first propose a focused information criterion (FIC) based on the generalized empirical likelihood (GEL) estimator [1,2], which nests the empirical likelihood (EL) [3,4] and exponential tilting (ET) [5,6] estimators as special cases. Motivated by Claeskens and Hjort [7], we address the issue of selecting an optimal model for estimating a specific parameter of interest, rather than identifying a correct model or selecting a model with good global fit. Then, as an extension of FIC, this study presents a GEL-based frequentist model averaging (FMA) estimator that is designed to minimize the mean squared error (MSE) of the estimator.
Traditional model selection methods, such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), select a single model regardless of the specific goal of inference [8,9]. AIC selects a model that is close to the true data generating process (DGP) in terms of Kullback-Leibler discrepancy, while BIC selects the model with the highest posterior probability. However, a model with good global fit is not necessarily a good model for estimating a specific parameter. For instance, Hansen [10] considers the problem of deciding the order of autoregressive models. His simulation study demonstrates that the AIC-selected model does not necessarily produce a good estimate of the impulse response. This result reveals that the best model generally differs for different intended uses of the model.
In their seminal work, Claeskens and Hjort [7] established an FIC that is designed to select the optimal model depending on its intended use. Their goal is to select the model that attains the minimum MSE of the maximum likelihood estimator for the parameter of interest, which they call the focus parameter. The FIC is constructed from an asymptotic estimate of the MSE.
Since then, an FIC has been derived for several models. Claeskens, Croux and Kerckhoven [11] proposed an FIC for logistic regressions. Hjort and Claeskens [12] proposed an FIC for the Cox hazard regression model. Zhang and Liang [13] developed an FIC for the generalized additive partial linear model. Models studied in those papers are likelihood-based. However, econometric models are often specified via moment restrictions rather than parametric density functions. This paper indicates that the idea of Claeskens and Hjort [7] is applicable to moment restriction models. Our FIC is constructed using an asymptotic estimate of the MSE of the GEL estimator.
Model selection for moment restriction models is still underdeveloped. Andrews and Lu [14] proposed selection criteria based on the J-statistic of the generalized method of moments (GMM) estimator [15]. Hong, Preston and Shum [16] extended the results of Andrews and Lu to the GEL estimation. Sueishi [17] developed information criteria similar to the AIC. The goal of Andrews and Lu [14] and Hong, Preston and Shum [16] was to identify the correct model, whereas Sueishi [17] selects the best approximating model in terms of Cressie-Read discrepancy. Although these criteria are useful in many applications, they do not address the issue of selecting the model that best serves its intended purpose.
Model averaging is an alternative to model selection. Inference after model selection is typically conducted as if the selected model is the true DGP. However, this ignores uncertainty introduced by model selection. Rather than conditioning on the single selected model, the averaging technique uses all candidate models to incorporate model selection uncertainty. Although Bayesian methods are predominant in the literature [18], there is also a growing FMA literature for likelihood-based models [19,20,21]. See also Yang [22], Leung and Barron [23] and Goldenshluger [24] for related issues.
In the FMA literature, it is often of particular interest to obtain an optimal averaging estimator in terms of a certain loss [25,26,27,28]. This study investigates a GEL-based averaging method that minimizes the asymptotic mean squared error in a framework similar to that of Hjort and Claeskens [21]. A simulation study indicates that our averaging estimator outperforms existing post-model-selection estimators.
Although this study investigates GEL-based methods, in general, its results are readily applied to the two-step GMM estimator, because our results rely only on first-order asymptotic theory. However, the two-step GMM estimator often suffers from a large bias that cannot be captured by first-order asymptotics, even if the model is correctly specified. Because the FIC addresses a trade-off between misspecification bias and estimation variance, the GEL estimator will be more suitable for our framework.
Now, we review related works. DiTraglia [29] proposes an instrument selection criterion for GMM that is based on the concept of FIC. Our approach resembles DiTraglia’s, but his interest is instrument selection, whereas ours is model selection. DiTraglia intentionally uses an invalid large set of instruments to improve efficiency; we intentionally use a wrong small model to improve efficiency. Liu [30] proposes an averaging estimator for the linear regression model by using a local asymptotic framework. Although Liu considers exogenous regressors, we allow endogenous regressors. Martins and Gabriel [31] consider GMM-based model averaging estimators under a framework different from ours.
The remainder of the paper is organized as follows. Section 2 describes our local misspecification framework. Section 3 derives the FIC. Section 4 discusses the FMA estimator. Section 5 provides a simple example for which our methods are applicable. Section 6 presents the result of Monte Carlo study. Section 7 concludes.

2. Local Misspecification Framework

We first introduce our setup. The basic construct follows Claeskens and Hjort [7]. There is a smallest and a largest model in our set of candidate models. The smallest, which we call the reduced model, has a p dimensional unknown parameter vector, θ = ( θ 1 , , θ p ) . The largest, or the full model, has an additional q dimensional unknown parameter vector, γ = ( γ 1 , , γ q ) . The full model is assumed to be correctly specified and nests the reduced model; i.e., the reduced model corresponds to the special case of the full model in which γ = γ 0 = ( γ 0 , 1 , , γ 0 , q ) for some known γ 0 . Typically, γ 0 is a vector of zeros: γ 0 = ( 0 , , 0 ) . An example is given in Section 5.
There are up to 2 q submodels, all of which have θ as the common parameter vector. A submodel treats some elements of γ as unknown parameters and is indexed by a subset, S, of { 1 , , q } . The model, S, contains parameters, γ j , such that j S . Thus, the reduced and full models correspond to S = ϕ and S = { 1 , , q } , respectively. We use “red” and “full” to denote the reduced and full models, respectively.
The focus parameter, μ , which is the parameter of interest, is a function of θ and γ : μ = μ ( θ , γ ) . It could be merely an element of θ . Prior knowledge or economic theories suggest that θ should be estimated, but we are unsure which elements of γ should be treated as unknown parameters. Estimating a larger model usually implies a lesser modeling bias and a larger estimation variance. However, if the reduced model is globally misspecified in the sense that the violation of the moment restriction does not disappear even in the limit, then the misspecification bias asymptotically dominates the variance of the GEL estimator. Thus, we cannot make a reasonable comparison of bias and variance in the asymptotic framework.
A local misspecification framework is introduced to take into account the bias-variance trade-off. Let y 1 , , y n be i.i.d.random vectors from an unknown density, f n ( y ) , which depends on the sample size, n.1The functional form of f n ( y ) is not specified. The full model is defined via the following moment restriction:
E n m ( y i , θ 0 , γ 0 + δ / n ) m ( y , θ 0 , γ 0 + δ / n ) f n ( y ) d y = 0 ,
where m : R d y × Θ × Γ R l is a known vector-valued function up to the parameters. For each n, the true parameter values of θ and γ are θ 0 and γ 0 + δ / n , respectively. Note that γ 0 is known, but θ 0 and δ are unknown. We assume that l > p + q ; i.e., the model is over-identified.
The moment function of the reduced model is m ( y , θ , γ 0 ) . The reduced model is misspecified in the sense that there is no value θ * Θ , such as E n [ m ( y i , θ * , γ 0 ) ] = 0 , for any fixed n. However, if the moment function is differentiable with respect to γ , then (1) implies that the reduced model satisfies:
E n m ( y i , θ 0 , γ 0 ) = E n m ( y i , θ 0 , γ ¯ ) γ δ / n = O ( 1 / n )
for some vector, γ ¯ between γ 0 and γ 0 + δ / n . Thus, even though the moment restriction is invalid at ( θ 0 , γ 0 ) , the violation disappears in the limit. A similar relationship also holds for the other submodels. As the next section reveals, under this framework, the squared bias and variance of the GEL estimator are both of the order, O ( 1 / n ) . Hence, the trade-off between bias and variance can be considered. If δ is sufficiently small, it might be better to set γ = γ 0 rather than estimate γ .
In general, the dimension of the moment function can differ among submodels. For instance, consider a linear instrumental variable model. The model (structural form) can be estimated as long as the number of instruments exceeds or equals the number of unknown parameters. Thus, it is possible to use only a subset of instruments to estimate a submodel. For ease of exposition, however, we consider only the case where the dimension of the moment function is fixed for all submodels.

3. Focused Information Criterion

To construct an FIC, we first derive the asymptotic distribution of the GEL estimator under the local misspecification framework. Newey [32] and Hall [33] obtained a similar result in the case of GMM estimation to analyze the local power properties of specification tests.
A model, S, contains p + q S unknown parameters. The moment function of the model is denoted as m ( y , θ , γ S ) = m ( y , θ , γ S , γ 0 , S C ) , where S C is the complementary set of S. The values of γ j are set to be their null values γ 0 , j for j S C .
Let ρ ( v ) be a concave function on its domain, V , which is an open interval containing zero. We normalize ρ ( v ) , so that ρ 1 ( 0 ) = ρ 2 ( 0 ) = 1 , where ρ j ( v ) = d j ρ ( v ) / d v j . The GEL estimator of ( θ , γ S ) is obtained by solving the saddle-point problem:
( θ ^ S , γ ^ S ) = arg min θ Θ , γ S Γ S max τ T 1 n i = 1 n ρ τ m ( y i , θ , γ S ) ,
where Γ S R q S is the parameter space of γ S and T R l is the set of feasible values of τ . The EL and ET estimators are special cases with ρ ( v ) = log ( 1 v ) and ρ ( v ) = exp ( v ) , respectively. Although θ ^ S has p elements for any S, we adopt the subscript, S, to emphasize that the value of the estimator depends on S.
Let m i = m ( y i , θ 0 , γ 0 ) , m θ i = m ( y i , θ , γ ) θ θ = θ 0 , γ = γ 0 , and m γ i = m ( y i , θ , γ ) γ θ = θ 0 , γ = γ 0 . Furthermore, let m γ S i = m ( y i , θ , γ ) γ S θ = θ 0 , γ = γ 0 . We define:
J S = J 00 J 01 , S J 10 , S J 11 , S = E [ m θ i ] E [ m i m i ] 1 E [ m θ i ] E [ m θ i ] E [ m i m i ] 1 E [ m γ S i ] E [ m γ S i ] E [ m i m i ] 1 E [ m θ i ] E [ m γ S i ] E [ m i m i ] 1 E [ m γ S i ] ,
where E denotes the expectation with respect to f ( y ) lim n f n ( y ) . It is assumed that f ( y ) satisfies:
E [ m i ] = m ( y , θ 0 , γ 0 ) f ( y ) d y = 0 .
For the full model, we denote:
J full = J 00 J 01 J 10 J 11 .
Then, we can write J 01 , S = J 01 π S and J 11 , S = π S J 11 π S , where π S is the projection matrix of size, q S × q , that maps γ to the subvector, γ S : π S γ = γ S .
Let Q ^ n ( θ , γ , τ ) = n 1 i = 1 n ρ ( τ m ( y i , θ , γ ) ) and Q n ( θ , γ , τ ) = E n [ ρ ( τ m ( y i , θ , γ ) ) ] . Furthermore, let Q ( θ , γ , τ ) = E [ ρ ( τ m ( y i , θ , γ ) ) ] . To obtain the asymptotic distribution of the GEL estimator, we impose the following conditions:
Assumption 3.1
1. 
Θ R p , Γ R q , and T R l are compact.
2. 
m ( y , θ , γ ) is continuous in θ Θ and γ Γ for almost every y.
3. 
sup θ Θ , γ Γ , τ T Q ^ n ( θ , γ , τ ) Q n ( θ , γ , τ ) p 0 under the sequence of f n ( y ) .
4. 
Q n ( θ , γ , τ ) Q ( θ , γ , τ ) 0 as n for all θ Θ , γ Γ , and τ T .
5. 
E [ m ( y i , θ , γ ) m ( y i , θ , γ ) ] is nonsingular for all θ Θ and γ Γ .
6. 
( θ 0 , γ 0 ) is the unique solution to E [ m ( y i , θ , γ ) ] = 0 and ( θ 0 , γ 0 ) int ( Θ × Γ ) .
7. 
ρ ( v ) is twice continuously differentiable in a neighborhood of zero.
8. 
E [ m θ i ] and E [ m γ i ] are of full rank.
9. 
sup n E n [ m i 2 + α ] < for some α > 0 .
10. 
m ( y , θ , γ ) is continuously differentiable in θ and γ in a neighborhood, N , of ( θ 0 , γ 0 ) .
11. 
sup θ , γ N n 1 i = 1 n m ( y i , θ , γ ) θ E n m ( y i , θ , γ ) θ p 0 and sup θ , γ N n 1 i = 1 n m ( y i , θ , γ ) γ E n m ( y i , θ , γ ) γ p 0 under the sequence of f n ( y ) .
12. 
E n [ m θ i ] E [ m θ i ] 0 and E n [ m γ i ] E [ m γ i ] 0 as n .
13. 
E n [ m i m i ] E [ m i m i ] 0 as n .
Conditions are rather high-level and strong. Some conditions can be replaced with primitive and weaker conditions [34].
We obtain the following lemma.
Lemma 3.1 Suppose Assumption 3.1 holds. Then, under the sequence of f n ( y ) , we have:
n θ ^ S θ 0 γ ^ S γ 0 , S d N J S 1 J 01 π S J 11 δ , J S 1 .
The proof is given in the Appendix.
If the model, S, is correctly specified, then the limiting distribution of the GEL estimator is N ( 0 , J S 1 ) . Therefore, as usual, local misspecification affects only the mean of the limiting distribution.
Next, we get the asymptotic distribution of the GEL estimator for the focus parameter. Additional notations are introduced. Let Q = ( J 11 J 10 J 00 1 J 01 ) 1 and Q S = ( π S Q 1 π S ) 1 ; i.e., Q and Q S are the lower right block matrices of J full 1 and J S 1 , respectively. Let G S = π S Q S π S Q 1 . We assume that μ ( θ , γ ) is differentiable with respect to θ and γ . Let:
w = J 10 J 00 1 μ θ μ γ and τ 0 2 = μ θ J 00 1 μ θ ,
where the partial derivatives are evaluated at ( θ 0 , γ 0 ) . The true focus parameter is denoted as μ true = μ ( θ 0 , γ 0 + δ / n ) . Moreover, the GEL estimator of μ for the model, S, is denoted as μ ^ S = μ ( θ ^ S , γ ^ S , γ 0 , S C ) . Lemma 3.1 and the delta method imply the following theorem:
Theorem 3.1 Suppose Assumption 3.1 holds. Then, under the sequence of f n ( y ) , we have:
D n n ( γ ^ full γ 0 ) d D N ( δ , Q )
and:
n μ ^ S μ true d Λ S Λ 0 + w ( δ G S D ) ,
where Λ 0 N ( 0 , τ 0 2 ) is independent of D.
The proof is almost the same as that of Lemma 3.3 in Hjort and Claeskens [21], so it is omitted.
Because G full = I and G red = 0 , as the special cases of the theorem, we have:
n μ ^ full μ true d N ( 0 , τ 0 2 + w Q w ) , n μ ^ red μ true d N ( w δ , τ 0 2 ) .
Therefore, in terms of the asymptotic MSE, the reduced model is better than the full model if ( w δ ) 2 < w Q w , which is the case when the deviation of the reduced model from the true DGP is small.
More generally, Theorem 3.1 implies that the MSE of the limiting distribution of μ ^ S is:
mse ( S , δ ) = τ 0 2 + w π S Q S π S w + w ( I G S ) δ δ ( I G S ) w .
The idea behind FIC is to estimate (2) for each model and select the model that attains the minimum estimated MSE.
All components in (2) except δ can be estimated easily by using their sample analogs. However, a consistent estimator for δ is unavailable, because D n converges in distribution to a normal random variable. This difficulty is inevitable, as long as we utilize the local misspecification framework. Because the mean of D D is δ δ + Q , following Claeskens and Hjort [7], we use D n D n Q ^ to estimate δ δ . Then, the sample counterpart of (2) is:
mse ^ ( S ) = τ ^ 0 2 + w ^ π S Q ^ S π S w ^ + w ^ ( I G ^ S ) ( D n D n Q ^ ) ( I G S ) w ^ = w ^ ( I G ^ S ) D n D n ( I G ^ S ) w ^ + 2 w ^ π S Q ^ S π S w ^ + τ ^ 0 2 w ^ Q ^ w ^ ,
which is an asymptotically unbiased estimator for (2). Because the last two terms do not depend on the model, we can ignore them for the purpose of model selection. Let ψ ^ full = w ^ D n and ψ ^ S = w ^ G ^ S D n . Then, our FIC for the model, S, is:
FIC S = w ^ ( I G ^ S ) D n D n ( I G ^ S ) w ^ + 2 w ^ π S Q ^ S π S w ^ = ( ψ ^ full ψ ^ S ) 2 + 2 w ^ S Q ^ S w ^ S ,
where w ^ S = π S w ^ . The bigger the model is, the smaller the first term and the larger the second term in (3). Since w depends on μ , FIC can be used to select an appropriate submodel, depending on the parameter of interest.
Although we consider only the case where μ is a scalar, our FIC is also applicable to a vector-valued focus parameter by viewing each element of the vector as a different scalar-valued focus parameter. Different models might be used to estimate different elements of the vector.
We conclude this section with a remark on the estimation of δ δ . Because we estimate δ δ by D n D n Q ^ , the estimate can be negative definite in finite sample. That means that the squared bias term can be negative. To avoid such cases, as suggested by Claeskens and Hjort [35], we can also use the following bias-corrected FIC:
FIC S * = FIC S if N n ( S ) does not take place w ^ ( I + G ^ S ) Q ^ w ^ if N n ( S ) takes place ,
where N n ( S ) is the event of negligible bias:
w ^ ( I G ^ S ) δ ^ full 2 < w ^ Q ^ π S Q ^ S π S w ^ .
See Section 6.4 of Claeskens and Hjort [35] for details.

4. Model Averaging

This section extends the result of Section 3 to the averaging problem. In the FMA literature, it is often of particular interest to obtain an optimal averaging estimator in terms of a certain loss. We consider a possibility of obtaining the best averaging weights that minimize the MSE in the local misspecification framework. A similar analysis is presented in Liu [30] in the case of linear regression.
Let A be the set of all candidate models. We consider an averaging estimator for the focus parameter of the form:
μ ^ = S A c ( S ) μ ^ S ,
where the weights, c ( S ) , add up to unity. Note that a post-selection estimator of μ can also be written in this form. Let S FIC be the FIC-selected model. Then the post-selection estimator using FIC is:
μ ^ FIC = S A 1 ( S = S FIC ) μ ^ S ,
where 1 ( · ) is the indicator function. Thus, the post-selection estimator is a special case of the averaging estimator.
If the weights are not random, then it is straightforward from Theorem 3.1 that:
n μ ^ μ true d Λ S A c ( S ) Λ S = d Λ 0 + w ( δ δ ^ ( D ) ) ,
where δ ^ ( D ) = S A c ( S ) G S D . Therefore, the asymptotic mean and variance of the averaging estimator are given by:
E [ Λ ] = S A c ( S ) w ( I G S ) δ , Var [ Λ ] = τ 0 2 + S , S A c ( S ) c ( S ) w G S Q G S w .
Thus, there is a set of weights that minimizes the asymptotic MSE of μ ^ .
Suppose there are M candidate models: S 1 , , S M . Let C = ( c ( S 1 ) , , c ( S M ) ) be a vector of averaging weights, which is in the unit simplex in R M :
H = C [ 0 , 1 ] M : i = 1 M c ( S i ) = 1 .
Ignoring τ 0 2 , which does not depend on the model, the optimal weight vector, C * , that minimizes the asymptotic MSE is:
C * = arg min C H C A C ,
where A is an M × M matrix, whose ( i , j ) element is given by:
A [ i j ] = w ( I G S i ) δ δ ( I G S j ) w + w ( G S i Q G S j ) w .
If we replace A with its appropriate estimate, A ^ , we obtain a feasible estimator:
C ^ = arg min C H C A ^ C .
For instance, if we estimate δ δ by D n D n Q ^ , then:
A ^ [ i j ] = w ^ ( I G ^ S i ) ( D n D n Q ^ ) ( I G ^ S j ) w ^ + w ^ ( G ^ S i Q ^ G ^ S j ) w ^ .
Although there is no closed-form solution for (4), it can be solved numerically by a usual quadratic programing algorithm.
Unfortunately, C ^ cannot be a consistent estimator for C * , because there is no consistent estimator for A. Suppose that C A ^ C d C A ˜ C for a random matrix, A ˜ , and for all C H . Then, we have:
C ^ d C ˜ arg min C H C A ˜ C .
Thus, C ^ is random, even in the limit.
Let c ^ ( S i ) and c ˜ ( S i ) be the i-th element of C ^ and C ˜ , respectively. Furthermore, let μ ^ opt = i = 1 M c ^ ( S i ) μ ^ S i denote the averaging estimator using C ^ . Because c ^ ( S i ) and μ ^ S i are both determined through D n , c ^ ( S i ) and n ( μ ^ S i μ true ) converge jointly to c ˜ ( S i ) and Λ S i . Therefore, the limiting distribution of μ ^ opt is given by:
n μ ^ opt μ true d i = 1 M c ˜ ( S i ) Λ S i = d Λ 0 + w δ i = 1 M c ˜ ( S i ) G S i D .
Because weights are random, the limiting distribution is no longer normal. Thus, (5) is not readily applicable for inference. However, as suggested by Hjort and Claeskens [21], (5) implies that:
1 κ ^ n μ ^ opt μ true w ^ D n i = 1 M c ^ ( S i ) G ^ S i D n d N ( 0 , 1 ) ,
where κ ^ is a consistent estimator for ( τ 0 2 + w Q w ) 1 / 2 . This result can be used to construct a confidence interval for μ true .

5. Example

This section gives a simple example to which our methods are applicable. One of the most popular models described by moment restrictions is the linear instrumental variable model. The full model we consider here is:
y i = x i θ + z 1 i γ + u i , E [ z i u i ] = 0 ,
where x i and z 1 i are p × 1 and q × 1 vectors of explanatory variables. Some elements of x i are potentially correlated with u i . The vector of instruments, z i , is l × 1 , which may contain elements of x i and z 1 i . Economic theory suggests that x i should be included in the model, but we are unsure which components of z 1 i should be included. Thus, the reduced model corresponds to the case that γ = γ 0 = ( 0 , , 0 ) .
In this model, J full is given by:
J full = J 00 J 01 J 10 J 11 = E [ x i z i ] E [ z i z i u i 2 ] 1 E [ z i x i ] E [ x i z i ] E [ z i z i u i 2 ] 1 E [ z i z 1 i ] E [ z 1 i z i ] E [ z i z i u i 2 ] 1 E [ z i x i ] E [ z 1 i z i ] E [ z i z i u i 2 ] 1 E [ z i z 1 i ] .
Let u ^ i be the residual from the full model: u ^ i = y i x i θ ^ full z 1 i γ ^ full . Then, for instance, J 00 can be estimated by:
J ^ 00 = 1 n i = 1 n x i z i 1 n i = 1 n z i z i u ^ i 2 1 1 n i = 1 n z i x i .
Other components of J full can be estimated in a similar manner. It also is possible to replace the empirical probability, n 1 , with the GEL-induced probability.
If the focus parameter is the k-th element of θ , then we have:
w ^ = J ^ 10 J ^ 00 1 e k ,
where e k is the k-th unit vector, which have one in the k-th element and zero, elsewhere. On the other hand, if the focus parameter is μ ( θ , γ ) = x θ + z 1 γ for a fixed covariate value ( x , z 1 ) , then:
w ^ = J ^ 10 J ^ 00 1 x z 1 .
To obtain a good estimate of x θ + z 1 γ for a range of covariate values, rather than a single covariate value, we can utilize the idea of Claeskens and Hjort [36], who address minimizing an averaged risk over the range of covariates, rather than the pointwise risk.

6. Monte Carlo Study

We now investigate the performance of post-selection and averaging estimators by a simple Monte Carlo study. Our EL-based methods are compared with EL-based selection methods of Hong, Preston and Shum [16]. The following post-selection and averaging estimators are considered: (i) AIC-like model selection (ii) BIC-like model selection, (iii) FIC model selection and (iv) an averaging estimator, whose weights are given by (4). AIC- and BIC-like criteria are proposed by Hong, Preston and Shum [16] and are given by:
AIC S = 2 i = 1 n log ( 1 τ ^ S m ( y i , θ ^ S , γ ^ S ) ) 2 ( l p q S ) , BIC S = 2 i = 1 n log ( 1 τ ^ S m ( y i , θ ^ S , γ ^ S ) ) ( l p q S ) log n .
We use (6) to estimate J.
We consider the linear instrumental variable model. The DGP is specified by the following equations:
y i = θ 0 + θ 1 x i + θ 2 z 1 i + k = 1 4 γ k n z k + 1 , i + u i , x i = 0 . 3 z 6 i + 0 . 2 z 7 i + 0 . 5 u i ,
where ( θ 0 , θ 1 , θ 2 ) = ( 1 , 1 , 1 ) and ( γ 1 n , γ 2 n , γ 3 n , γ 4 n ) = δ / n for some vector δ = ( δ 1 , δ 2 , δ 3 , δ 4 ) . Exogenous variables, z 1 i , , z 7 i , are normally distributed with mean zero and variance one, and the correlation between z k i and z l i is 0 . 5 | k l | for k l . The vector of instruments is fixed to be z i = ( 1 , z 1 i , , z 7 i ) . The error term, u i , is independent of z 1 i , , z 7 i and is generated from a standard normal distribution. Thus, the moment restriction for the full model is:
E n z i y i θ 0 θ 1 x i θ 2 z 1 i k = 1 4 γ k n z k + 1 , i = 0 .
Table 1. Estimation results; DGP, data generating process; AIC, Akaike information criterion; BIC, Bayesian information criterion; FIC, focused information criterion.
Table 1. Estimation results; DGP, data generating process; AIC, Akaike information criterion; BIC, Bayesian information criterion; FIC, focused information criterion.
DGP
(1)(2)(3)(4)
FullBias-0.104-0.109- 0.089- 0.076
Std0.5440.5330.5090.489
RMSE0.5540.5440.5160.495
ReducedBis-0.279-0.057-0.148-0.048
Std0.7800.4730.9550.448
RMSE0.8280.4770.9650.450
AICBias-0.113-0.099-0.101-0.079
Std0.5590.5570.4970.509
RMSE0.5700.5660.5070.515
BICBias-0.136-0.088-0.104-0.073
Std0.6890.5520.4990.502
RMSE0.7020.5590.5100.507
FICBias-0.139-0.095-0.112-0.076
Std0.5300.5090.4640.452
RMSE0.5480.5170.4770.458
AveragingBias-0.139-0.092-0.107-0.074
Std0.5110.4760.4550.444
RMSE0.5290.4840.4680.450
The focus parameter is μ = θ 1 . In many applications, it is often the case that the only parameter of interest in the linear model is the coefficient of the endogenous regressor. Exogenous regressors are included simply to avoid omitted variable bias. Thus, if the bias is small, it may be better to exclude some regressors to reduce the variance. In this simulation, we include the constant term, x i , and z 1 i in all candidate models, but some elements of ( z 2 i , z 3 i , z 4 i , z 5 i ) may be excluded. That is, some elements of ( γ 1 n , γ 2 n , γ 3 n , γ 4 n ) are set to zero. Therefore, there are 2 4 = 16 submodels in total.
To evaluate the performance of the post-selection and averaging estimators, we calculate the bias, standard deviation and root MSE (RMSE) of each estimator over 1,000 repetitions. For reference, we also report the results of the full and reduced models. The sample size is n = 50 .2 We consider four DGPs: (1) δ = ( 1 , 1 , 1 , 1 ) , (2) δ = ( 1 8 , 1 8 , 1 8 , 1 8 ) , (3) δ = ( 1 , 3 4 , 1 2 , 1 4 ) and (4) δ = ( 1 4 , 3 16 , 1 8 , 1 16 ) . The DGPs (1) and (3) are favorable for the full model, while (2) and (4) are favorable for the reduced model. The results are summarized in Table 1.
Table 1 indicates that there are certain cases where we should avoid using the full model, even if it is the correct model. Performance of the full model is poorer than the FIC-selected model for all GDPs. As the theory suggests, the efficiency gain of FIC over the full model is large when δ is small. The averaging estimator outperforms all post-selection estimators. It is even better than FIC. As is consistent with findings in the literature, averaging is a useful method to reduce the risk of the estimator.

7. Conclusions

This paper studied GEL-based model selection and averaging methods that are designed to obtain an efficient estimator for the parameter of interest. We modified the local misspecification framework of Claeskens and Hjort [7], so that an FIC can be obtained for moment restriction models. Then, we proposed the averaging estimator by extending the idea of FIC.
In the simulation study, we considered the model selection/averaging problem for the linear instrumental variable model. Although some methods have been advocated for selecting/averaging instruments in the literature, there are few studies on the model selection/averaging problem. The result of the simulation suggests that our averaging can be a useful alternative to existing post-selection estimators.

Acknowledgments

The author thanks Ryo Okui for his comments and suggestions. The author also thanks three referees, seminar participants at the University of Tokyo and participants of a summer workshop on economic theory at Otaru University of Commerce for their comments. The author acknowledges financial support from the Japan Society for the Promotion of Science under KAKENHI 23730215.

References

  1. R.J. Smith. “Alternative Semi-Parametric Likelihood Approaches to Generalised Method of Moments Estimation.” Econ. J. 107 (1997): 503–519. [Google Scholar]
  2. W.K. Newey, and R.J. Smith. “Higher Order Properties of GMM and Generalized Empirical Likelihood Estimators.” Econometrica 72 (2004): 219–255. [Google Scholar]
  3. A.B. Owen. “Empirical Likelihood Ratio Confidence Intervals for a Single Functional.” Biometrika 75 (1988): 237–249. [Google Scholar]
  4. J. Qin, and J. Lawless. “Empirical Likelihood and General Estimating Equations.” Ann. Stat. 22 (1994): 300–325. [Google Scholar]
  5. Y. Kitamura, and M. Stutzer. “An Information-Theoretic Alternative to Generalized Method of Moments Estimation.” Econometrica 65 (1997): 861–874. [Google Scholar]
  6. G.W. Imbens, R.H. Spady, and P. Johnson. “Information Theoretic Approaches to Inference in Moment Condition Models.” Econometrica 66 (1998): 333–357. [Google Scholar]
  7. G. Claeskens, and N.L. Hjort. “The Focused Information Criterion.” J. Am. Stat. Assoc. 98 (2003): 900–916. [Google Scholar]
  8. H. Akaike. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Second International Symposium on Information Theory. Edited by B. Petroc and F. Csake. 1973, pp. 267–281. Akademiai Kiado. [Google Scholar]
  9. G. Schwarz. “Estimating the Dimension of a Model.” Ann. Stat. 6 (1978): 461–464. [Google Scholar]
  10. B.E. Hansen. “Challenges for Econometric Model Selection.” Economet. Theor. 21 (2005): 60–68. [Google Scholar]
  11. G. Claeskens, C. Croux, and J.V. Kerckhoven. “Variable Selection for Logistic Regression Using a Prediction-Focused Information Criterion.” Biometrics 62 (2006): 972–979. [Google Scholar]
  12. N.L. Hjort, and G. Claeskens. “Focused Information Criteria and Model Averaging for the Cox Hazard Regression Model.” J. Am. Stat. Assoc. 101 (2006): 1449–1464. [Google Scholar]
  13. X. Zhang, and H. Liang. “Focused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” Ann. Stat. 39 (2011): 174–200. [Google Scholar]
  14. D.W. Andrews, and B. Lu. “Consistent Model and Moment Selection Procedures for GMM Estimation with Application to Dynamic Panel Data Models.” J. Econometrics 101 (2001): 123–164. [Google Scholar]
  15. L.P. Hansen. “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica 50 (1982): 1029–1054. [Google Scholar]
  16. H. Hong, B. Preston, and M. Shum. “Generalized Empirical Likelihood-Based Model Selection Criteria for Moment Condition Models.” Economet. Theor. 19 (2003): 923–943. [Google Scholar]
  17. N. Sueishi. “Information Criteria for Moment Restriction Models.” Unpublished Manuscript. Kyoto University, 2013. [Google Scholar]
  18. J.A. Hoeting, D. Madigan, A.E. Raftery, and C.T. Volinsky. “Bayesian Model Averaging: A Tutorial.” Stat. Sci. 14 (1999): 382–417. [Google Scholar]
  19. S.T. Buckland, K.P. Burnham, and N.H. Augustin. “Model Selection: An Integral Part of Inference.” Biometrics 53 (1997): 603–618. [Google Scholar]
  20. K.P. Burnham, and D.R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer, 2002. [Google Scholar]
  21. N.L. Hjort, and G. Claeskens. “Frequentist Model Average Estimators.” J. Am. Stat. Assoc. 98 (2003): 879–899. [Google Scholar]
  22. Y. Yang. “Adaptive Regression by Mixing.” J. Am. Stat. Assoc. 96 (2001): 574–588. [Google Scholar]
  23. G. Leung, and A.R. Barron. “Information Theory and Mixing Least-Squares Regressions.” IEEE T. Inform. Theory 52 (2006): 3396–3410. [Google Scholar]
  24. A. Goldenshluger. “A Universal Procedure for Aggregating Estimators.” Ann. Stat. 37 (2009): 542–568. [Google Scholar]
  25. B.E. Hansen. “Least Squares Model Averaging.” Econometrica 75 (2007): 1175–1189. [Google Scholar]
  26. A.T.K. Wan, X. Zhang, and G. Zou. “Least Squares Model Averaging by Mallows Criterion.” J. Econometrics 156 (2010): 277–283. [Google Scholar]
  27. B.E. Hansen, and J.S. Racine. “Jackknife Model Averaging.” J. Econometrics 167 (2012): 38–46. [Google Scholar]
  28. Q. Liu, and R. Okui. “Heteroskedasticity-Robust Cp Model Averaging.” Economet. J., 2013. Forthcoming. [Google Scholar]
  29. F.J. DiTraglia. “Using Invalid Instruments on Purpose: Focused Moment Selection and Averaging for GMM.” Unpublished Manuscript. University of Pennsylvania, 2012. [Google Scholar]
  30. C.A. Liu. “A Plug-In Averaging Estimator for Regressions with Heteroskedastic Errors.” Unpublished Manuscript. National University of Singapore, 2012. [Google Scholar] [Green Version]
  31. L.F. Martins, and V.J. Gabriel. “Linear Instrumental Variables Model Averaging Estimation.” Comput. Stat. Data An., 2013. Forthcoming. [Google Scholar]
  32. W.K. Newey. “Generalized Method of Moments Specification Testing.” J. Econometrics 29 (1985): 229–256. [Google Scholar]
  33. A.R. Hall. “Hypothesis Testing in Models Estimated by Generalized Method of Moments.” In Generalized Method of Moments Estimation. Edited by L. Mátyás. Cambridge University Press, 1999, pp. 75–101. [Google Scholar]
  34. P.M. Parente, and R.J. Smith. “GEL Methods for Nonsmooth Moment Indicators.” Economet. Theor. 27 (2011): 74–113. [Google Scholar]
  35. G. Claeskens, and N.L. Hjort. Model Selection and Model Averaging. Cambridge University Press, 2008. [Google Scholar]
  36. G. Claeskens, and N.L. Hjort. “Minimizing Average Risk in Regression Models.” Economet. Theor. 24 (2008): 493–527. [Google Scholar]

A. Appendix

This appendix provides a proof for Lemma 3.1. In this appendix, symbols, p and d , denote convergence in probability and in distribution with respect to the local sequence, f n ( y ) .
Let m i ( θ , γ S , γ S C ) = m ( y i , θ , γ S , γ S C ) and m i ( θ , γ S ) = m ( y i , θ , γ S , γ 0 , S C ) . We define:
τ ( θ , γ S , γ S C ) = arg max τ T E ρ τ m i ( θ , γ S , γ S C .
Condition 5 implies that τ ( θ , γ S , γ S C ) is continuous with respect to ( θ , γ S , γ S C ) . Moreover:
E [ ρ ( τ m i ( θ 0 , γ 0 , S , γ 0 , S C ) ) ] τ τ = 0 = ρ 1 ( 0 ) E m i ( θ 0 , γ 0 , S , γ 0 , S C ) = 0 .
Thus, by concavity of ρ ( v ) , τ ( θ 0 , γ 0 , S , γ 0 , S C ) = 0 .
Let τ ^ S ( θ , γ S ) = arg max τ T n 1 i = 1 n ρ ( τ m i ( θ , γ S ) ) . Then, by construction:
1 n i = 1 n ρ τ ^ S ( θ , γ S ) m i ( θ , γ S ) 1 n i = 1 n ρ τ ( θ , γ S , γ 0 , S C ) m i ( θ , γ S ) .
Also, let L = E ρ τ ( θ 0 , γ 0 , S , γ 0 , S C ) m i ( θ 0 , γ 0 , S ) = ρ ( 0 ) . Then, Condition 6 and the saddle-point property imply that:
E ρ τ ( θ , γ S , γ 0 , S C ) m i ( θ , γ S ) > L
for θ θ 0 and γ S γ 0 , S . Let B ( θ 0 , γ 0 , S , ϵ ) be an open ball of radius, ϵ , around ( θ 0 , γ 0 , S ) . Conditions 1–4 imply:
1 n i = 1 n ρ τ ( θ , γ S , γ 0 , S C ) m i ( θ , γ S ) E ρ τ ( θ , γ S , γ 0 , S C ) m i ( θ , γ S ) p 0
uniformly over θ Θ and γ S Γ S . Thus, for any ϵ > 0 , there exists δ > 0 , such that:
P n inf ( θ , γ S ) Θ × Γ S B ( θ 0 , γ 0 , S , ϵ ) 1 n i = 1 n ρ τ ^ S ( θ , γ S ) m i ( θ , γ S ) < L + δ 0 ,
where P n is the probability under f n ( y ) . Conditions 1–4 also imply τ ^ S ( θ 0 , γ 0 , S ) p τ ( θ 0 , γ 0 , S ) = 0 . Therefore, we obtain:
P n 1 n i = 1 n ρ τ ^ S ( θ 0 , γ 0 , S ) m i ( θ 0 , γ 0 , S ) > L + δ 0 .
Combining (7) and (8), we have θ ^ S p θ 0 and γ ^ S p γ 0 , S . Moreover, we have:
1 n i = 1 n ρ τ m i ( θ ^ S , γ ^ S ) E ρ τ m i ( θ 0 , γ 0 , S ) p 0
uniformly over τ T . Thus, τ ^ S = τ ^ S ( θ ^ S , γ ^ S ) p τ ( θ 0 , γ 0 , S , γ 0 , S C ) = 0 .
Next, we derive the asymptotic distribution. The first-order conditions for ( θ ^ S , γ ^ S , τ ^ S ) are:
0 = 1 n i = 1 n ρ 1 τ ^ S m i ( θ ^ S , γ ^ S ) m θ i ( θ ^ S , γ ^ S ) τ ^ S , 0 = 1 n i = 1 n ρ 1 τ ^ S m i ( θ ^ S , γ ^ S ) m γ S i ( θ ^ S , γ ^ S ) τ ^ S , 0 = 1 n i = 1 n ρ 1 τ ^ S m i ( θ ^ S , γ ^ S ) m i ( θ ^ S , γ ^ S ) ,
where m θ i ( θ , γ S ) = θ m i ( θ , γ S ) and m γ S i ( θ , γ S ) = γ S m i ( θ , γ S ) . By Condition 11 and consistency of the estimator, expanding the first-order conditions around ( θ 0 , γ 0 , S , 0 ) , we obtain:
n θ ^ S θ 0 γ ^ S γ 0 , S τ ^ S = 0 0 E n [ m θ i ] 0 0 E n [ m γ S i ] E n [ m θ i ] E n [ m γ S i ] E n [ m i m i ] 1 0 0 1 n i = 1 n m i + o p ( 1 ) .
Let w i n = η ( m i E n [ m i ] ) , where η is any l × 1 vector, such that η η = 1 . Then, we have:
σ n 2 E n [ w i n 2 ] = η E n [ m i m i ] η + o ( 1 / n )
and:
1 σ n 2 E n w i n 2 1 { | w i | n σ n ϵ } E n | w i n | 2 + α | w i n | α 1 { | w i n | n σ n ϵ } 1 n α / 2 σ n 2 + α ϵ α E n | w i n | 2 + α 0
by Condition 9. Thus, by the Lindeberg-Feller Theorem and Condition 13:
1 n i = 1 n w i n d N ( 0 , η E [ m i m i ] η ) .
Furthermore, by Condition 12, we have:
n E n m i = E [ m γ i ] δ + o ( 1 ) .
Therefore, by the Cramer-Wold device, we obtain:
1 n i = 1 n m i d N E [ m γ i ] δ , E [ m i m i ] ,
which implies the desired result.
  • 1.Although y 1 , , y n is a triangular array, we suppress the additional subscript, n, on y for notational simplicity.
  • 2.Simulations were also conducted for different sample sizes. The results are not reported here, because the difference among candidate models is so small for large n that RMSEs are the almost identical for all models.

Share and Cite

MDPI and ACS Style

Sueishi, N. Generalized Empirical Likelihood-Based Focused Information Criterion and Model Averaging. Econometrics 2013, 1, 141-156. https://doi.org/10.3390/econometrics1020141

AMA Style

Sueishi N. Generalized Empirical Likelihood-Based Focused Information Criterion and Model Averaging. Econometrics. 2013; 1(2):141-156. https://doi.org/10.3390/econometrics1020141

Chicago/Turabian Style

Sueishi, Naoya. 2013. "Generalized Empirical Likelihood-Based Focused Information Criterion and Model Averaging" Econometrics 1, no. 2: 141-156. https://doi.org/10.3390/econometrics1020141

Article Metrics

Back to TopTop