Parametric and Nonparametric Frequentist Model Selection and Model Averaging

Ullah, Aman; Wang, Huansha

doi:10.3390/econometrics1020157

Open AccessArticle

Parametric and Nonparametric Frequentist Model Selection and Model Averaging

by

Aman Ullah

^* and

Huansha Wang

Department of Economics, University of California, Riverside, CA 92521-0427, USA

^*

Author to whom correspondence should be addressed.

Econometrics 2013, 1(2), 157-179; https://doi.org/10.3390/econometrics1020157

Submission received: 27 June 2013 / Revised: 17 July 2013 / Accepted: 13 September 2013 / Published: 20 September 2013

(This article belongs to the Special Issue Econometric Model Selection)

Download Versions Notes

Abstract

:

This paper presents recent developments in model selection and model averaging for parametric and nonparametric models. While there is extensive literature on model selection under parametric settings, we present recently developed results in the context of nonparametric models. In applications, estimation and inference are often conducted under the selected model without considering the uncertainty from the selection process. This often leads to inefficiency in results and misleading confidence intervals. Thus an alternative to model selection is model averaging where the estimated model is the weighted sum of all the submodels. This reduces model uncertainty. In recent years, there has been significant interest in model averaging and some important developments have taken place in this area. We present results for both the parametric and nonparametric cases. Some possible topics for future research are also indicated.

Keywords:

nonparametric; model selection; model averaging

1. Introduction

Over the last several years many econometricians and statisticians have persistently devoted their efforts in finding various paths to the true model. The uncertainty in correctly specifying the regression model has resulted in a large amount of literature in two major directions: firstly, what variables are to be included and secondly, how they are related with the dependent variable in the model. Thus “what" refers to determining the variables to be included in constructing the model and “how" refers to finding the correct functional form, e.g., parametric (specifications like linear, quadratic, etc.), or in general, nonparametric smoothing methods that do not require specifying a parametric functional form but instead let the data search for a suitable function that describes well the available data, see [1,2] among others.

To determine “what", model selection was first introduced, and it has a huge literature in statistics and econometrics. In fact, in recent years, model selection (variable selection) procedures have become more popular due to the emergence of econometric and statistical models with high dimension (large number) variables. As examples, in labor economics, wage equations can have a large number of regressors [3] and in financial econometrics, portfolio allocation may be among hundreds or thousands of stocks [4]. Such models raise additional challenges of econometric modeling and inference along with the selection of variables. Different tools have been developed based on various estimation criteria. The majority of such procedures involve variable selection by minimizing penalized loss functions based on the least squares and the log-likelihood, and their variants. The adjusted R

^{2}

and residuals sum of squares are the usual variable selection procedures without any penalization. Among the penalized procedures we have Akaike information criterion (AIC) [5], Mallows

C_{p}

procedure [6], Bayesian information criterion (BIC) by [7], cross-validation method by [8], generalized cross-validation (GCV) by [9], and the focused information criterion (FIC) by [10]. We note that the traditional AIC and BIC are based on least squares (LS), maximum likelihood (ML), or Bayesian principles, and the penalization is based on the

l_{0}

-norm for the parameters entering in the model, with the result penalization is proportional to the number of nonzero parameters. Both AIC and BIC are variable selection procedures and do not provide estimators simultaneously. On the other hand the bridge estimator in [11,12] uses the

l_{q}

-norm (

q > 0

), and for

0 < q \leq 1

provides a way to combine variable selection and parameter estimation simultaneously. Within this class the least absolute shrinkage and selection operator (LASSO;

q = 1

) has become the most popular. For

q = 2

we get the ridge estimator [13]. For a detailed review of model selection in high dimensional modeling, see [14], and the books [15,16]. Similarly, in the context of empirical likelihood estimation and generalized methods of moments estimators, model selection criteria have been introduced by [17,18], among others.

Model selection is an important step for empirical policy evaluation and forecasting. However, it may produce unstable estimators because of bias in model selection. For example, a small data perturbation or an alternative selection procedure may give a different model. Reference [19] shows that AIC selection results in distorted inference, and [20] explores the negative impact on confidence regions. Reference [21] gives conditions under which post model selection estimators are adaptive, but see [22,23] for their comments that they cannot be uniformly estimated. For a selected model with unstable estimators, [24] provides bagging or bootstrap averaging procedure to reduce their variances for the i.i.d. data, and by [25] for the dependent time series data. But this averaging does not always work, e.g., for large samples and/or in entire parameter space.

Taking the above reasons into consideration, model averaging is introduced as an alternative to model selection. Unlike in model selection, where the model uncertainty is dealt with by econometricians selecting one model from a set of models, in model averaging, we resolve the uncertainty by averaging over the set of models. There is large recent literature on Bayesian model averaging (BMA) and more recently, on frequentist model averaging (FMA). Among the BMA contributions, model uncertainty is considered by setting a prior probability to each candidate model, see [26,27,28,29,30]; for interesting applications in econometrics, see, e.g., [31,32,33]. Also, see [10] for comments on the BMA approach. The main focus here is on the FMA method, which is totally determined by data only and assumes no priors, and it has received much attention in recent years, see [34,35,36,37,38,39,40,41]. Reference [10] provides asymptotic theory. For applications, see [16,42,43]. The concept behind the FMA estimators is related to the ideas of combining procedures based on the same data, which have been considered before in several research areas. For instance, [44] introduces forecast combination and [45,46] suggest combining parametric and kernel estimators of density and regression respectively. Other works include bootstrap based averaging (“stacking") by [24,47,48], information theoretic method to combine density by [49,50], and the mixing of experts models by [51,52]. Similar kinds of combining have been used in computational learning theory by [53,54] and in information theory by [55].

Related to “how", or rather determining the unknown functional forms of econometric models, we use data based nonparametric procedures (e.g., kernel, smoothing spline, series approximation). See, for example, [1,2,56,57], for kernel smoothing procedures, [58] for the spline methods, and [59,60] for the series methods. These procedures help in dealing with the problems of bias and inconsistency in estimation and testing due to misspecifying functional forms. Because of this recent developments on nonparametric model selection and model averaging have taken place.

The current paper is hence focused on a review of parametric and nonparametric approaches to model selection and model averaging mainly from a frequentist point of view, and for independently and identically distributed (i.i.d.) observations. Earlier [14] provides a review of parametric model selections, [61] surveys the FMA estimation, and [62] provides variable selection in semiparametric regression models. To distinguish, our paper hence concentrates on the review of frequentist model selection and model averaging under both parametric and nonparametric settings.

The paper is organized as follows. We first introduce a review of parametric model selection and parametric model averaging in Section 2. Then, in Section 3 we present nonparametric model selection and model averaging procedures. A conclusion follows in Section 4.

2. Parametric Model Selection and Model Averaging

2.1. Model Selection

Let us consider

y_{i}

as a dependent variable and

x_{i} = {(x_{i 1,} . . ., x_{i q})}^{'}

a

q \times 1

vector of explanatory variables/covariates. Then the linear regression model can be written as

y_{i} = x_{i}^{'} β + u_{i} = \sum_{j = 1}^{q} x_{i j} β_{j} + u_{i}, i = 1, . . ., n

(1)

or

y = X β + u

(2)

where y is

n \times 1,

X is

n \times q

,

β = {(β_{1}, . . ., β_{q})}^{'}

, and u is

n \times 1 .

Among the well known procedures for model selection, often used routinely, we are looking at the goodness of fit

R^{2},

adjusted

R^{2}

(

R_{a}^{2}

), and residuals sum of squared (RSS) given by

R^{2} = 1 - \frac{\sum {\hat{u}}_{i}^{2}}{\sum {(y_{i} - \bar{y})}^{2}}, R_{a}^{2} = 1 - \frac{(n - 1) \sum {\hat{u}}_{i}^{2}}{(n - q) \sum {(y_{i} - \bar{y})}^{2}}, R S S = \sum {({\hat{u}}_{i})}^{2}

(3)

where

0 \leq R^{2} \leq 1 .

The model with the highest

R^{2}

(or

R_{a}^{2}

) or smallest RSS is chosen. However

R^{2}

increases or RSS decreases, monotonically as q increases. Further, between

R^{2}

and

R_{a}^{2}

,

B i a s (R_{a}^{2}) \leq B i a s (R^{2})

but

V (R_{a}^{2}) \geq V (R^{2}) .

Thus

R_{a}^{2}

may not always be statistically more efficient (

M S E (R_{a}^{2}) \leq M S E (R^{2})

), see [63] for further detail. Thus

R_{a}^{2}

and RSS are not preferred measures of goodness of fit or model selection. Recently [64] develops a model selection procedure based on the “mean squared prediction error" denoted by MSPE. Consider

(x_{i 1}, . . ., x_{i q}, z_{i}),

i = 1, . . ., n,

as a new observed sample in which

z_{i}

is the “new observed value" and

{\hat{y}}_{i}

is such that

M S P E = \sum E {(z_{i} - {\hat{y}}_{i})}^{2} / n = σ_{u}^{2} (n + q + 1) / n .

When a model has

q = 0

(no explanatory variable),

M S P E = σ_{y}^{2} (n + 1) / n

. Then, using the unbiased estimator of

M S P E_{0} = F P E_{0} = s_{y}^{2} (n + 1) / n,

and of

M S P E = F P E

as

s_{\hat{u}}^{2} (n + q + 1) / n,

in [64] introduces

R_{F P E}^{2} = 1 - \frac{F P E}{F P E_{0}} = \frac{(n - 1) (n + q + 1) R^{2} - 2 q n}{(n - q - 1) (n + 1)}

such that

R_{F P E}^{2} \leq R_{a}^{2} \leq R^{2}

where FPE represents final prediction error. The statistical properties of the bias and MSE of

R_{F P E}^{2},

compared to those of

R_{a}^{2}

and

R^{2},

are analyzed in [65]. Reference [64] has demonstrated that one of the exciting advantages of

R_{F P E}^{2}

is that it can be used for choosing a model with the best prediction ability. Furthermore,

R_{F P E}^{2}

not only overcomes inflation in

R^{2},

it also avoids the problem of selecting an overfitted model with some irrelevant explanatory variables due to using

R_{a}^{2}

. In addition, they indicate that

R_{F P E}^{2}

and AIC, discussed below, are asymptotically equivalent and in model selection

R_{F P E}^{2}

is perfectly consistent with using AIC and is closest with BIC. Thus

R_{F P E}^{2}

can be used simultaneously for goodness of fit as well as for model selection.

2.1.1. AIC, TIC, and BIC

Now we turn to the methods of model selection, AIC in [5], Takeuchi informaiton criterion (TIC) in [66], and BIC in [7]. For this, we first note that if

f (y)

is an unknown true density, and

g (y, θ)

is an assumed density then the Kullback-Leibler Information Criterion (KLIC) is given by

D (f, g) = K L I C (f, g) = E_{f} log (\frac{f (y)}{g (y, θ)}) = E_{f} log f (y) - E_{f} log g (y, θ),

where

E_{f}

is the expectation with respect to

f (y) .

This is an expected “surprise" from knowing f is in fact the true density of

y .

We note that

D (f, g) \geq 0

where equality holds if and only if

g = f

almost everywhere. Further

E_{f} log f (y)

is called the entropy of distribution f; for more on entropy and information, see [67,68].

A concept related to entropy is the quasi maximum likelihood estimator (QMLE)

{\hat{θ}}_{Q M L}

which maximizes the quasi log-likelihood function

L (θ) = L_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} log g (y_{i,} θ)

based on the random sample

Y = (y_{1}, . . ., y_{n})

from

f (y) .

Since

L_{n} (θ) \to^{p} E_{f} [log g (y_{1}, θ)],

it is expected that

{\hat{θ}}_{Q M L}

converges in probability to the maximizer

θ^{*}

of

E_{f} [log g (y_{1}, θ)]

under suitable conditions. Since

E_{f} [log f (y_{1})]

does not depend on θ, QMLE minimizes a random function which converges to

K L I C (f, g) = E_{f} log f (y_{1}) - E_{f} log g (y_{1}, θ) = D (f, g)

Thus

{\hat{θ}}_{Q M L} \to^{p} θ^{*}

where

θ^{*} = a r g {min}_{θ} D (f, g (θ))

is often referred to as the pseudo-true value of

θ .

It is well known that under some regularity conditions

\sqrt{n} ({\hat{θ}}_{Q M L} - θ^{*}) \to^{d} N (0, G {(θ^{*})}^{- 1} I (θ^{*}) G {(θ^{*})}^{- 1})

where

G (θ) = - E_{g} [\partial^{2} log g (y, θ) / \partial θ \partial θ^{'}]

and

I (θ) = E_{g} [\partial log g (y_{1}, θ) \partial log g (y_{1}, θ) / \partial θ \partial θ^{'}] .

When

f (\cdot) = g (\cdot, θ^{*}),

G (θ^{*}) = I (θ^{*})

and

{\hat{θ}}_{Q M L}

is the MLE and it is asymptotically efficient.

Now consider the fitted density

\hat{g} (y) = g (y, {\hat{θ}}_{Q M L})

and

\begin{matrix} K L I C (f, \hat{g}) & = & E_{f} log (\frac{f (y)}{\hat{g} (y)}) \\ = & c - E_{y} log g (y, {\hat{θ}}_{Q M L}) \end{matrix}

where

c = \int f (y) log (f (y)) d y

is free of the fitted model and

E_{y} (\cdot)

denotes the expectation with respect to the true density of

y,

i.e.,

g (y)

here. Then

E [K L I C (f, \hat{g})] = c - E_{Y} E_{y} [log g (y, {\hat{θ}}_{Q M L})] = c - n^{- 1} \sum E_{Y} E_{y_{i}} [log g (y_{i}, {\hat{θ}}_{Q M L})]

where

Y

and y are independent. The expected KLIC can be interpreted as the expected likelihood when

Y

is used for

{\hat{θ}}_{Q M L},

and an independent sample y (with one observation here) used for evaluation. In linear regression, the expected KLIC is the expected squared prediction error. Dropping

c,

and using second order Taylor expansion, it can be shown that

n T = E [K L I C (f, \hat{g})] = - E [L_{n} (\hat{θ})] + t r [I (θ^{*}) G {(θ^{*})}^{- 1}] .

Further, an asymptotically unbiased estimator of T can be written as

\hat{T} = - n^{- 1} {L_{n} (\hat{θ}) - t r (\hat{I} {\hat{G}}^{- 1})}

where

L_{n} (\hat{θ}) = log g (Y, \hat{θ}),

\hat{I} {\hat{G}}^{- 1}

is a consistent estimator of

I (θ^{*}) G {(θ^{*})}^{- 1}

in which

\hat{I} = \frac{1}{n} \sum \frac{\partial log g (y_{i}, θ)}{\partial θ} \frac{\partial log g (y_{i}, θ)}{\partial θ^{'}}

and

\hat{G} = - \frac{1}{n} \sum \partial^{2} log g (y_{i}, θ) / \partial θ \partial θ^{'} .

When the model is correctly specified, that is

g (y, θ^{*})

=

f (y)

,

G (θ^{*})

=

I (θ^{*})

and

t r (I (θ^{*}) G {(θ^{*})}^{- 1}) = q

,

\hat{T} = - n^{- 1} L_{n} (\hat{θ}) + n^{- 1} q

which is related with AIC given by

2 \hat{T} :

A I C = - \frac{2 L_{n} (\hat{θ})}{n} + \frac{2 q}{n} .

(4)

Thus, we can think of AIC as an estimate of the expected 2KLIC based on the assumption that the model is correctly specified. Therefore, selecting a model based on the smallest AIC amounts to choosing the best-fitting model in the sense of having the smallest KLIC. A robust AIC by Takeuchi [66], known as the Takeuchi Information Criterion (TIC), is

T I C = - \frac{2 L_{n} (\hat{θ})}{n} + \frac{2 t r (\hat{I} {\hat{G}}^{- 1})}{n},

which, unlike AIC, does not require

g (y, θ)

to be correctly specified. In general, picking models with the smallest AIC/TIC is selecting fitted models whose densities are close to the true density.

We note that in a linear regression model, the minimization of the AIC reduces to the minimization of the following

A I C = log {\hat{σ}}^{2} + \frac{2 q}{n}

where

{\hat{σ}}^{2} = \frac{{\hat{u}}^{'} \hat{u}}{n} .

It can be shown that

G (θ^{*}) = I (θ^{*})

if

u_{i} | x_{i} \sim N (0, σ^{2}) .

Thus AIC is more appropriate under normality, otherwise it is an approximation for the non-normal and heteroskedastic regression cases.

Further, in a linear regression case, the minimization of TIC can be shown as the minimization of

T I C = log {\hat{σ}}^{2} + \frac{2}{n {\hat{σ}}^{2}} \sum_{i = 1}^{n} h_{i} {\hat{u}}_{i}^{2} + \frac{{\hat{k}}_{4}}{n}

where

{\hat{k}}_{4} = \frac{1}{n {\hat{σ}}^{4}} \sum_{i = 1}^{n} {({\hat{u}}_{i}^{2} - {\hat{σ}}^{2})}^{2}

and

h_{i} = x_{i}^{'} {(X^{'} X)}^{- 1} x_{i} .

When the errors are homoskedastic and normal,

T I C ≃ log {\hat{σ}}^{2} + \frac{2 (q + 1)}{n}

which is close to AIC. Although differences may arise under heteroskedasticity and nonnormality. However, as we change models, typically the results

{\hat{u}}_{i}^{2}

and hence

{\hat{k}}_{4}

may not change much. In this case, TIC and AIC may give similar model selection results.

We note that the BIC due to [7] is

B I C = log {\hat{σ}}^{2} + \frac{(log n) q}{n}

in which the penalty term depends on the sample size and it is generally larger than the penalty term appearing in the AIC. BIC provides a large sample estimator of a transformation of the Bayesian posterior probability associated with the approximation model. In general, by choosing the fitted candidate model corresponding to the BIC criterion, one is selecting the candidate model with the highest posterior probability. A good property of BIC selection is that it provides consistent model selection, see for example [69]. That is, when the true model is of finite dimension, BIC will choose the model with probability tending to 1 as the sample size n increases.

In general, a penalized function can only be consistent if its penalty term (

log n

in BIC) is a fast enough increasing function of n (see [70]). Thus AIC is not consistent as it always has some probability of selecting models that are too large. However, we note that in finite samples, adjusted versions of AIC can behave much better, see for example [71]. Further, since the penalty term of BIC is more stringent than the penalty term of AIC, BIC tends to form smaller models than AIC. However, BIC provides a large-sample estimator of the transformation of the Bayesian posterior probability associated with the approximating model, and AIC provides an asymptotically unbiased estimator of the expected Kullback discrepancy between the generating model and the fitted approximating model. In addition, AIC is asymptotically efficient in the sense that it asymptotically selects the fitted candidate model which minimizes the MSE of prediction, but BIC is not asymptotically efficient. This is because AIC can be advocated when the primary goal of the model is to induce meaningful factors influencing the outcome based on relative importance.

In summary, both AIC and BIC provide well-founded and self-contained approaches to model selection although with different motivations and penalty objectives. Both are typically good approximations of their own theoretical target quantities. Often, this also means that they will identify good models for observed data but both criteria can still fail in this respect. For a detailed simulation and empirical comparison of these two approaches, see [72], and for their properties see [69,73,74]. Both the AIC and the TIC are designed for the likelihood or quasi-likelihood context. They perform in a similar way. Their relationship is similar to the relationship between the conventional and the White covariance matrix estimators for the MSE/QMLE or LS. Unfortunately, despite the merit TIC has theoretically, it does not appear to be widely used perhaps because it needs a very large sample to get good estimates.

2.1.2. FIC

Let us start from the model

y_{i} = x_{i}^{'} β + z_{i}^{'} γ + u_{i}, i = 1, . . ., n

or

y = X β + Z γ + u

where X is an

n \times p

matrix of variables intended (focused) to be included all the time yet the variables in a

n \times q

matrix Z may or may not be included. From the ML estimators (

{\hat{β}}_{l}, {\hat{γ}}_{l}

), corresponding with the l-th model, the predictor for

m_{l} = x^{'} β_{l} + z^{'} γ_{l}

can be written as

\hat{m_{l}} = x^{'} {\hat{β}}_{l} + z^{'} {\hat{γ}}_{l}

at (

x, z

). In [10] provides MSE of

{\hat{m}}_{l} .

The basic idea of FIC is to develop a model selection criterion that chooses the model with the smalllest estimated MSE. Such an MSE-based FIC for the l-th submodel is

{\hat{F I C}}_{l} = {({\hat{ω}}^{'} (I - {\hat{Ψ}}_{l} {\hat{L}}^{- 1}) \hat{γ})}^{2} + 2 {\hat{ω}}^{'} {\hat{Ψ}}_{l} \hat{ω}

where

{\hat{Ψ}}_{l} = π_{l}^{'} {(π_{l} {\hat{L}}^{- 1} π_{l}^{'})}^{- 1} π_{l}

,

\hat{L} = {(Z^{'} M_{x} Z)}^{- 1}

where

M_{x} = I - X {(X^{'} X)}^{- 1} X^{'},

\hat{ω} = X {(X^{'} X)}^{- 1} x - z,

and

π_{l}

captures the projection mappings from the full model to the l-th submodel, such that

ω_{l} = π_{l} ω .

In contrast, from [10],

A I C_{l} = - {\hat{γ}}^{'} {\hat{L}}^{- 1} {\hat{Ψ}}_{l} {\hat{L}}^{- 1} \hat{γ} + 2 |l|

where

|l|

is the number of uncertain parameters in the l-th submodel, shows that when the estimand

m = log f (y, β, γ)

such that

f (y, β, γ)

is the probability density function of the data, the MSE-based FIC is asymptotically equivalent to AIC.

2.1.3. Mallows Model Selection

Let us write the regression model (2) as

y = m + u

where

m = X β .

Then

\hat{m} = \hat{m} (q) = P (q) y,

where

P (q) = X {(X^{'} X)}^{- 1} X^{'} .

The objective is to choose q such that the average mean squared error (risk)

E L (q | X)

is minimum, where

L (q) = \frac{1}{n} {[m - \hat{m} (q)]}^{'} [m - \hat{m} (q)] = \frac{1}{n} {(\hat{β} - β)}^{'} X^{'} X (\hat{β} - β) = \frac{1}{n} u^{'} P (q) u

such that

R (q) = E [L (q) | X] = \frac{1}{n} σ^{2} t r (P (q)) = \frac{σ^{2} q}{n} .

Mallows criterion for selecting q is to minimize

C (q) = \frac{{\hat{u}}^{'} \hat{u}}{n} + \frac{2 σ^{2} q}{n}

where the seceond term on the right hand side is a penalty.

In fact, Mallows criterion is an unbiased estimator of the MSE of the predictive estimator

\hat{m}

of m. This is because

E [L (q) | X] = E [{(\hat{m} - m)}^{'} (\hat{m} - m) / n] = E [\frac{u^{'} P (q) u}{n}] = σ^{2} t r P (q) / n

and

E [C (q) | X] = \frac{σ^{2} (n - q)}{n} + \frac{2 σ^{2} q}{n} = σ^{2} + σ^{2} t r P (q) / n .

But the minimization of

E [L (q) | X]

with respect to q is the same as the minimization of

E [C (q) | X]

since

σ^{2}

does not depend on

q .

Alternatively,

\begin{matrix} \frac{1}{n} {(\hat{m} - m)}^{'} (\hat{m} - m) & = & \frac{1}{n} {(\hat{m} - y + y - m)}^{'} (\hat{m} - y + y - m) \\ = & \frac{1}{n} [{\hat{u}}^{'} \hat{u} + u^{'} u - 2 {\hat{u}}^{'} u] \end{matrix}

and

E [\frac{1}{n} {(\hat{m} - m)}^{'} (\hat{m} - m)] = \frac{1}{n} E [{\hat{u}}^{'} \hat{u} + 2 σ^{2} t r P - σ^{2}] .

So, an unbiased estimator is

({\hat{u}}^{'} \hat{u} + 2 σ^{2} q - σ^{2}) / n

and its minimization is equivalent to the Mallows criterion.

2.1.4. Cross-Validation (CV)

CV is a commonly used procedure for model selection. According to this, the selection of q is made by minimizing

C V (q) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} {\hat{β}}_{- i})}^{2}

where

{\hat{β}}_{- i}

is the LS estimator of β dropping the i-th observations

y_{i},

x_{i}

from the sample. It can be shown that

E [C V (q)] ≃ M S P E (q),

where

M S P E (q) ≃ E {(y_{n + 1} - x_{n + 1}^{'} \hat{β})}^{2} = E {\hat{u}}_{n + 1}^{2}

is the MSE of the forecast error

{\hat{u}}_{n + 1} = y_{n + 1} - {\hat{y}}_{n + 1}

with

{\hat{y}}_{n + 1} = x_{n + 1} \hat{β} .

Thus, CV is an almost unbiased estimator of

M S P E (q) .

This can be shown by first writing the MSPE, based on an out of sample observation from the same distribution as the in sample observation, as

\begin{matrix} M S P E (q) & = & E {(y_{n + 1} - x_{n + 1}^{'} \hat{β})}^{2} = E {\hat{u}}_{n + 1}^{2} \\ = & E u_{n + 1}^{2} + E [{(\hat{β} - β)}^{'} x_{n + 1} x_{n + 1}^{'} (\hat{β} - β)] \\ = & E u_{n + 1}^{2} + M S E (q) \end{matrix}

where

M S E (q) = E [{(\hat{m} (x_{n + 1}) - m (x_{n + 1}))}^{'} (\hat{m} (x_{n + 1}) - m (x_{n + 1}))] = E [{(\hat{β} - β)}^{'} x_{n + 1} x_{n + 1}^{'} (\hat{β} - β)] .

Since

E u_{n + 1}^{2} = σ^{2}

does not depend on q, its selection by

M S P E (q)

and

M S E (q)

are equivalent.

We observe that

{\hat{u}}_{n + 1} = y_{n + 1} - x_{n + 1} \hat{β}

is a prediction error based on first estimating

\hat{β}

based on in sample n observations, and then calculating the error by using the out of sample observation

n + 1 .

Therefore,

M S P E (q)

is the expectation of a squared leave-one-out prediction error when the sample length is

n + 1 .

Using this idea we can also obtain a similar leave-one-out prediction error for each observation

i .

This is given by

{\hat{u}}_{i} = y_{i} - x_{i}^{'} {\hat{β}}_{- i}

based on n observations. Thus,

E {\hat{u}}_{i}^{2} = M S P E (q)

for each i, and

E [C V (q)] = E [\frac{1}{n} \sum_{i = 1}^{n} {\hat{u}}_{i}^{2}] = M S P E (q) .

Further, since

E {\hat{u}}_{n + 1}^{2}

based on

n + 1

observations will be close to

E {\hat{u}}_{i}^{2}

based on n observations,

C V (q)

is an almost unbiased estimator of

M S P E (q) .

The

C V (q)

written above can be rewritten as

C V (q) = \frac{1}{n} \sum_{i = 1}^{n} \frac{{\tilde{u}}_{i}^{2}}{1 - h_{i i}}

where

{\tilde{u}}_{i} = y_{i} - x_{i}^{'} \hat{β},

h_{i i}

is referred to as the leverage effect and it is the diagonal element of the projection matrix

X {(X^{'} X)}^{- 1} X^{'},

see [75]. This expression is useful for calculations. Also, see [74] for a link of

C V (q)

with AIC.

2.1.5. Model Selection by Other Penalty Functions

The issue regarding the model selection has received more attention in recent years because of the challenging problem of estimating models with large numbers of regressors, which may increase with sample size, for example, earning models in labor economics with large number of regressors, financial portfolio models with large number of stocks, and VAR models with hundreds of macro variables.

A different method of variable selection and estimating such models is penalized least squares (PLS), see [14] for a review on this. In fact in this literature estimation of parameters and variables selections are done by using a criterion function involving loss function with a penalization function. Using

l_{p}

-penalized, the PLS estimator and variables selection problem are carried out as

min_{β} [\sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} + λ (\sum_{j = 1}^{q} | β_{j} |^{p})^{1 / p}]

where λ is a tuning or shrinkage parameter and the penalty is the restriction

(\sum_{j = 1}^{q} | β_{j} {|^{p})}^{1 / p} \leq c

(another tuning parameter). For

p = 0,

the

l_{0}

-norm becomes

\sum_{j = 1}^{q} I (β_{j} \neq 0)

with

I (\cdot)

as the usual indicator function which indicates the number of nonzero

β_{j}

for

j = 1, . . ., q .

The AIC and BIC belong to this norm. For

p = 1,

the

l_{p}

-norm becomes

\sum_{j = 1}^{q} | β_{j} | \leq c,

which is used in the LASSO for simultaneous shrinkage estimation [76] and for variable selection. It can be shown analytically that the LASSO method estimates the zero coefficient as zero with positive probability as

n \to \infty .

Next, for

p = 2

the

l_{2}

-norm uses

\sum_{j = 1}^{q} β_{j}^{2} \leq c

and provides ridge type [13] shrinkage estimation but not variable selection. However, if we consider the generalized ridge estimator under

\sum {\hat{λ}}_{j} β_{j}^{2} \leq c

then the coefficient estimates corresponding to

{\hat{λ}}_{j} \to \infty

will tend to zero, see [77].

Further, when

0 < p \leq 1

we get the bridge estimator [11,12] which provides a way to combine variable selection and parameter estimation together with

p = 1

as the LASSO. For adaptive LASSO and other forms of LASSO, see [62,78,79,80]. Also, see the link of LASSO with the least angel regression selection (LARS) by [81].

2.2. Model Averaging

Let us consider m be a parametric or nonparametric model, which can be a conditional mean or conditional variance. Let

{\hat{m}}_{l},

l = 1, . . ., M

be the set of estimators of m corresponding to the different sets of regressors considered in the problem of model selection. Consider

w_{l}

,

l = 1, . . ., M,

to be the weights corresponding to

{\hat{m}}_{l},

where

0 \leq w_{l} \leq 1

and

\sum_{l = 1}^{M} w_{l} = 1 .

We can then define a model averaging estimator of m as

\hat{m} (w) = \sum_{l = 1}^{M} w_{l} {\hat{m}}_{l} .

Below we present the choice of

w_{l}

in linear regression models. For the linear regression model consider the model in (1) or (2) where the dimension of β can tend to

\infty,

as

n \to \infty .

We take M models where l-th model contains

q_{l}

regressors, which is a subvector of

x_{i} .

The corresponding model could be written as

y = X_{l} β_{l} + u,

and the LS estimator of

β_{l}

is

{\hat{β}}_{l} = {(X_{l}^{'} X_{l})}^{- 1} X_{l}^{'} y .

This gives

{\hat{m}}_{l} = X_{l} {\hat{β}}_{l} = P_{l} y

where

P_{l} = X_{l} {(X_{l}^{'} X_{l})}^{- 1} X_{l}^{'} .

The model averaging estimator (MAE) of m is given as

\hat{m} (w) = \sum_{l = 1}^{M} w_{l} {\hat{m}}_{l} = P (w) y

where

P (w) = \sum_{l = 1}^{M} w_{l} P_{l} .

An alternative expression is

\hat{m} (w) = \sum_{l = 1}^{M} w_{l} {\hat{m}}_{l} = \sum_{l = 1}^{M} w_{l} X_{l} {\hat{β}}_{l} = X \hat{β} (w)

where we write

{\tilde{β}}_{l} = (\binom{{\hat{β}}_{l}}{0})

such that

X_{l} {\hat{β}}_{l} = [X_{l}

X_{- l}] (\binom{{\hat{β}}_{l}}{0}) = X (\binom{{\hat{β}}_{l}}{0}) = X {\tilde{β}}_{l}

and

\hat{β} (w) = \sum_{l = 1}^{M} w_{l} {\tilde{β}}_{l} = (\binom{\sum_{l = 1}^{M} w_{l} {\hat{β}}_{l}}{0})

is the MAE of

β .

Thus, for the linear model, the MAE of m corresponds to the MAE of β but this may not hold for the non-linear parameters model.

Now we consider the ways to determine weights.

2.2.1. Bayesian and FIC Weights

Under the Bayesian procedure we assume that there are M potential models and one of the models is the true model. Then, using the prior probabilities that each of the potential models is the true model, and considering the prior probability distributions of the parameters, the posterior probability distribution is obtained as the weighted average of the submodels where weights are the posterior probabilities that the given model is the true model given the data.

The two types of weights considered are then

w_{l} = \frac{exp {- \frac{1}{2} A I C_{l}}}{\sum_{l = 1}^{M} exp {- \frac{1}{2} A I C_{l}}} and w_{l} = \frac{exp {- \frac{1}{2} B I C_{l}}}{\sum_{l = 1}^{M} exp {- \frac{1}{2} B I C_{l}}}

where

A I C_{l} = - 2 log L + 2 q_{l}

and

B I C_{l} = - 2 log L + q_{l} log n .

These are known as smoothed AIC (SAIC) and smoothed BIC (SBIC) weights. While the Bayesian model averaging estimator (BMAE) has a neat interpretation, it searches for the true model instead of selecting an estimator of a model with a low loss function. In simulations it has been found that SAIC and SBIC tend to outperform AIC and BIC estimators, see [82].

As for the FIC, consider the model averaging estimator as

\tilde{m} = \sum_{l = 1}^{M} w_{l} {\hat{m}}_{l}

where

w_{l} = exp (- \frac{1}{2} \frac{F I C_{l}}{κ ω^{'} L ω}) / \sum_{a l l l} exp (\frac{1}{2} \frac{F I C_{l}}{κ ω^{'} L ω})

and κ is an algorithmic parameter, bridging from uniform weighting (κ close to 0) to the hard-core FICC (κ is large). For this and further properties and applications of FIC, see [10] and [82].

2.2.2. Mallows Weight Selection Method

In the linear regression model,

\hat{m} (w) = P (w) y

is a linear estimator with

w \in W_{M} .

So an optimal choice of w can be found following the Mallows criterion described above. The Mallows criterion for choosing weights w is

C (w) = \hat{u} {(w)}^{'} \hat{u} (w) + 2 σ^{2} t r (P (w))

where

\hat{u} (w) = y - \hat{m} (w) = y - \sum_{l = 1}^{M} w_{l} {\hat{m}}_{l} = \sum_{l = 1}^{M} w_{l} (y - {\hat{m}}_{l}) = \sum_{l = 1}^{M} w_{l} {\hat{u}}_{l} = \hat{U} w

and

t r (P (w)) = \sum_{l = 1}^{M} w_{l} t r P_{l} = \sum_{l = 1}^{M} w_{l} q_{l} = q^{'} w

in which

q = {(q_{1}, . . ., q_{M})}^{'},

w = {(w_{1}, . . ., w_{M})}^{'},

{\hat{u}}_{l}

is the residual vector from the l-th model and

\hat{U} = ({\hat{u}}_{1}, . . ., {\hat{u}}_{M})

is an

n \times M

matrix of residuals from all the models. Thus

C (w) = w^{'} {\hat{U}}^{'} \hat{U} w + 2 σ^{2} q^{'} w

is quadratic in

w .

Thus

\hat{w} = a r g min_{w \in W_{M}} C (w),

which is obtained by using the quadratic programming procedure with inequality constraints using Gauss or MATLAB. Then Hansen’s Mallows model averaging (MMA) estimator is

\hat{m} (\hat{w}) = \sum_{l = 1}^{M} {\hat{w}}_{l} {\hat{m}}_{l} .

Following [83], [39] shows that

\frac{L (\hat{w})}{I n f_{w \in W_{M}^{*}} L (w)} \to 1

as

n \to \infty,

and

\hat{w}

is asymptotically optimal in Li’s sense, where

L (\hat{w}) = {(m - \hat{m} (\hat{w}))}^{'} (m - \hat{m} (\hat{w}))

. However, Hansen’s result requires weights belonging to a discrete set and the models to be nested. In [41] improves the result by relaxing discreteness and by not assuming that the models are nested. Their approach is based on deriving an unbiased estimator of the exact MSE of

\hat{m} (w) .

Reference [84] also proposes a corresponding forecasting method, using Mallows model averaging (MMA). He proves that the criterion is an asymptotically unbiased estimator of both the in-sample and the out-of-sample one-step-ahead MSE.

2.2.3. Jackknife Model Averaging Method (CV)

Utilizing the leave-one-out cross validation (CV) procedure, which is also known as the Jackknife procedure, Jackknife model averaging (JMA) method of estimating

m (w)

by [40] relaxes assumptions in [39]. The submodels are now allowed to be non-nested and also the error terms can be heteroskedastic. The sum-of-squared residuals in the JMA method is

C V (w) = \frac{1}{n} {(y - \tilde{m} (w))}^{'} (y - \tilde{m} (w))

where

\tilde{m} (w)

is the vector of the Jackknife estimator computed with the i-th element deleted. To be more specific,

{\tilde{m}}_{l} = X {(X_{l (- i)}^{'} X_{l (- i)})}^{- 1} X_{l (- i)}^{'} y_{- i},

where

X_{l (- i)}

is equal to

X_{l}

with its i-th row deleted and

y_{- i}

is y with the i-th element deleted. Thus

\tilde{u} (w) = \sum_{l = 1}^{M} w_{l} (y - {\tilde{m}}_{l}) = \sum_{l = 1}^{M} w_{l} {\tilde{u}}_{l} = \tilde{U} w

where

\tilde{U} = ({\tilde{u}}_{1}, . . ., {\tilde{u}}_{M})

is an

n \times M

matrix,

{\tilde{u}}_{l} = {({\tilde{u}}_{1 l,} . . ., {\tilde{u}}_{n l})}^{'}

is an

n \times 1

vector in which

{\tilde{u}}_{i l}

is computed with the i-th observation deleted. Then

C V (w) = \frac{1}{n} \tilde{u} {(w)}^{'} \tilde{u} (w) = \frac{1}{n} w^{'} {\tilde{U}}^{'} \tilde{U} w

and JMA weights are obtained by minimizing

C V (w)

with respect to

w = {\tilde{w}}_{l},

and the JMA estimator is

\tilde{m} (w) = \sum_{l = 1}^{M} w_{l} {\tilde{m}}_{l} .

Reference [40] shows the asymptotic optimality, using [83,85], in the sense of minimizing conditional risk which is equivalent to the out-of-sample prediction MSE.

There are many extensions of the JMA method to various other econometric models. Reference [86] does it for the quantile regression model. Reference [82] extends it for the dependent time series models or models with GARCH errors. Also, using MMA method in [39], for models with endogeneity, in [87] develops MMA based two-stage least squares (MATSLS), model averaging limited information maximum likelihood (MALIML), and model averaging Fuller (MAF) estimators.

However, it would be useful to have extensions of the MMA and JMA procedures to the models with GMM or IV estimator. In addition the sampling properties of the average estimators need to be developed for the purpose of statistical inference.

3. Nonparametric (NP) Model Selection and Model Averaging

3.1. NP Model Selection

Let us write the NP model as

y_{i} = m (x_{i}) + u_{i}

where

x_{i}

is i.i.d. with density f and the error

u_{i}

is independent of

x_{i} .

We can write the local linear model as

\begin{matrix} y_{i} & = & m (x) + {(x_{i} - x)}^{'} β (x) + u_{i} \\ = & z_{i} {(x)}^{'} δ (x) + u_{i} \end{matrix}

or

y = Z (x) δ (x) + u

where

z_{i} (x) = [1

{(x_{i} - x)}^{'}]^{'}

so that

Z (x)

is an

n \times (q + 1)

matrix and

δ (x) = [m (x)

{β (x)]}^{'} .

Then the local linear LS estimator (LLLS) of

δ (x)

is

\hat{δ} (x) = {(Z^{'} (x) K (x) Z (x))}^{- 1} Z^{'} (x) K (x) y = P (x) y

where

P (x) = {(Z^{'} (x) K (x) Z (x))}^{- 1} Z^{'} (x) K (x),

K (x) = d i a g (K ((x_{1} - x) / h), . . ., K ((x_{n} - x) / h))

is a diagonal matrix in which the kernel

K ((x_{i} - x) / h) = \prod_{j = 1}^{q} K ((x_{i j} - x_{j}) / h_{j}),

and

h_{j}

is the window-width for the j-th variable. From this, pointwise

\hat{m} (x) = [1

0] \hat{δ} (x),

\hat{β} (x) = [0

1] \hat{δ} (x) .

Further, profiled

\hat{m} = {(\hat{m} (x_{1}), . . ., \hat{m} (x_{n}))}^{'}

can be written as

\hat{m} = P y

where

P = P (h)

is an

n \times n

matrix generated by

[1

0] P (x_{i}) = [1

0] {(Z^{'} (x_{i}) K (x_{i}) Z (x_{i}))}^{- 1} Z^{'} (x_{i}) K (x_{i})

, for

i = 1, . . ., n .

If h is fixed then

\hat{m}

is a linear estimator in y. But it will be a nonlinear estimator in y if

h = \hat{h}

is either obtained by a plug-in estimator or by cross-validation.

With respect to the goodness of fit measures for the NP models we note that

V (y) = V (m (x)) + E [σ^{2} (x)]

So the global population goodness of fit is

ρ^{2} = \frac{V (m (x))}{V (y)} = 1 - \frac{E {[y - m (x)]}^{2}}{V (y)}, 0 \leq ρ^{2} \leq 1

and its sample global estimator is given by

\begin{matrix} R^{2} & = & [1 - \frac{\sum {\hat{u}}_{i}^{2}}{\sum {(y_{i} - \bar{y})}^{2}}] = [1 - \frac{{\hat{u}}^{'} \hat{u}}{y^{'} M_{2} y}] \\ = & 1 - \frac{y^{'} M_{1} (h) y}{y^{'} M_{2} y} \\ = & \frac{y^{'} M_{1}^{*} (h) y}{y^{'} M_{2} y} \end{matrix}

where

\hat{u} = y - \hat{m} = y - P (h) y = M (h) y

(

M (h) = I - P (h)

),

M_{1} (h) = M {(h)}^{'} M (h),

M_{1}^{*} (h) = M_{2} - M_{1} (h),

and

M_{2} = I - \frac{ι ι^{'}}{n}

with ι being an

n \times 1

vector of unit elements. However,

0 \leq R^{2} \leq 1

may not be valid since

\sum {(y_{i} - \bar{y})}^{2} \neq \sum {(\hat{m} (x_{i}) - \bar{y})}^{2} + \sum {\hat{u}}_{i}^{2} .

Therefore, one can use the following modified

0 \leq R_{1}^{2} \leq 1

as

R_{1}^{2} = R^{2} I (a \leq 1)

where

a = \sum {\hat{u}}_{i}^{2} / \sum {(y_{i} - \bar{y})}^{2}

and

I (\cdot)

is an indicator function.

Another way to define a proper global

R^{2}

is to first consider a local

R^{2} (x) .

This is based on the fact that at the point x,

\sum {(y_{i} - \bar{y})}^{2} K (\frac{x_{i} - x}{h}) = \sum {(\hat{m} (x_{i}) - \bar{y})}^{2} K (\frac{x_{i} - x}{h}) + \sum {\hat{u}}_{i}^{2} K (\frac{x_{i} - x}{h})

because

\sum u_{i} K (\frac{x_{i} - x}{h}) = 0

and

\sum (x_{i} - x) u_{i} K (\frac{x_{i} - x}{h}) = 0

due to local linear LS estimation. Thus a local

R^{2} (x)

can be defined as

R^{2} (x) = \frac{\sum {(\hat{m} (x_{i}) - \bar{y})}^{2} K (\frac{x_{i} - x}{h})}{\sum {(y_{i} - \bar{y})}^{2} K (\frac{x_{i} - x}{h})} = \frac{S S R (x)}{S S T (x)}

which satisfies

0 \leq R^{2} (x) \leq 1 .

A global

R_{2}^{2}

is then

R_{2}^{2} = \frac{\int_{x} S S R (x) d x}{\int_{x} S S T (x) d x}, 0 \leq R_{2}^{2} \leq 1

The goodness of fit

R_{1}^{2}

is considered in [88] where they showed its application for the statistically significant variables selection in NP regression.

R_{2}^{2}

is introduced in [89,90]. For the variables selection it may be more appropriate to consider an adjusted

R_{1}^{2}

as

R_{1 a}^{2} = R_{a}^{2} I (b \leq 1)

where

R_{a}^{2} = (1 - \frac{n - 1}{t r M_{1} (h)} \frac{y^{'} M_{1} (h) y}{y^{'} M_{2} y}) = 1 - b .

As a practical matter, the most critical choice in model selection in the nonparametric regression estimation above is the choice of the window-width h and the number of variables q. Further, if instead of considering the local linear estimator taken above and often used, we consider a local polynomial of degree d, then

Z (x)

in

\hat{δ} (x)

would be a

n \times (q d + 1)

matrix and we would need an additional selection for d. Thus the nonparametric goodness of fit measures described above should be considered as

R_{1}^{2} = R_{1}^{2} (h, q, d)

and

R_{1 a}^{2} = R_{1 a}^{2} (h, q, d)

and they can be used for choosing, say h, for fixed q and d, as the value which maximizes

R_{1 a}^{2} (h, q, d)

. We note that

d = 0

is the well known Nadaraya and Watson local constant estimator and for

d = 1

, it is the local linear estimator. Further, for given d and h,

R_{1}^{2} = R_{1}^{2} (q)

and

R_{2}^{2} = R_{2}^{2} (q)

can be used to choose q.

3.1.1. AIC, BIC, and GCV

In the NP case the model selection (choosing q) using AIC is proposed by [91]. This is based on the LCLS estimator,

A I C = log {\hat{σ}}^{2} + \frac{1 + t r P (h) / n}{1 - (t r P (h) + 2) / n}

where

{\hat{σ}}^{2} = {\hat{u}}^{'} \hat{u} / n = y^{'} M_{1} (h) y / n

in which

M_{1} (h) = M {(h)}^{'} M (h)

and

M (h) = I - P (h)

where the (

i, j

)-th element of

P (h)

is

P_{i, j} (h) = K_{i j} / \sum_{l = 1}^{n} K_{i l}

and

K_{i j} = \prod_{s = 1}^{q} h_{s}^{- 1} K ((x_{i s} - x_{j s}) / h_{s}) .

In the same way, we note that

A I C = A I C (h, q, d)

and it can be used to select, for example, h given q and d ([92]) or q given h and d. In the latter case

A I C = A I C (q)

. The result for the

B I C = B I C (q)

procedure in the NP model is not yet known. However, if one considers NP sieve regression of the type

m (x) = \sum_{j = 1}^{q} z_{j} (x) β_{j}

where

z_{j} (x)

are nonlinear function of x and

q,

then BIC is similar to the BIC given in [96]. This includes, for example, special cases of a series expansion in which

z_{j} (x) = x^{j},

and a spline regression in which

m (x) = \sum_{j = 1}^{p} x^{j} β_{j} + \sum_{j = 1}^{r} β_{p + j} (x - t_{j}) I (x \geq t_{j})

with

q = p + r,

t_{j}

as j-th knot, and

I (x \geq t_{j}) = 1

if

x \geq t_{j}

and 0 otherwise.

In [9] an estimate of the minimizer of

E L (q),

called the GCV, is proposed which does not require the knowledge of

σ^{2} .

This can be written as the minimization of

V (q) = \frac{n^{- 1} \sum_{i = 1}^{n} {(y_{i} - \hat{m} (x_{i}))}^{2}}{{(1 - n^{- 1} t r P)}^{2}}

with respect to

q .

It has been shown by [9] that

E [V (q) | x] - σ^{2} ≃ E [L (q) | x]

for large n, and the minimizer

\hat{q}

of

E V (q)

is asymptotically optimal in the sense that

E L (\hat{q}) / {min}_{q} E L (q) = 1

as

n \to \infty .

That is, the MSE of

\hat{q}

tends to be minimum as

n \to \infty .

We note that

L (q)

in parametric and nonparametric cases are given in Section 2.1.3 and Section 3.1.2, respectively.

3.1.2. Mallows Model Selection

Let us write the regression model

y_{i} = m (x_{i}) + u_{i}

where

E [u_{i} | x_{i}] = 0

and

E (u_{i}^{2} | x_{i}) = σ^{2} .

Then, for

m = {(m (x_{1}), . . ., m (x_{n}))}^{'},

y = {(y_{1}, . . ., y_{n})}^{'}

and

u = {(u_{1}, . . ., u_{n})}^{'}

y = m + u .

Let us consider the LLLS estimator of

m,

which is linear in

y,

as

\hat{m} = \hat{m} (q) = P (q) y

where

P = P (h) = P (q)

as defined in section 3.1. When

\hat{h} \to h

for large

n,

\hat{m}

can become asymptotically linear.

Our objective is to choose q such that the average mean squared error (risk)

E [L (q) | x]

is minimum where

L (q) = \frac{1}{n} {(m - \hat{m} (q))}^{'} (m - \hat{m} (q)) .

We note that for

\hat{u} = y - \hat{m} (q)

\begin{matrix} L (q) & = & \frac{1}{n} {(m - \hat{m} (q) y)}^{'} (m - \hat{m} (q) y) \\ = & \frac{1}{n} [{\hat{u}}^{'} \hat{u} + u^{'} u - 2 {\hat{u}}^{'} u] \end{matrix}

and

R (q) = E (L (q) | x) = \frac{1}{n} E [{\hat{u}}^{'} \hat{u} + 2 σ^{2} t r P (q) - σ^{2}]

Further Mallows criterion for selecting q (number of variables in

x_{i}

) is by minimizing

C (q) = \frac{1}{n} {(y - \hat{m} (q))}^{'} (y - \hat{m} (q)) + \frac{2 σ^{2}}{n} t r P (q)

where the second term on the right-hand side is the penalty. Essentially, the minimization of

C (q)

is the same as the minimization of the unbiased estimator of

E [L (q) | x] = R

since

σ^{2}

does not depend on q, see Section 2.1.3 and [6,9].

3.1.3. Cross Validation (CV)

The CV method is one of the most widely used window-width selectors for NP kernel smoothing. We note that the cross-validation estimator of the integrated squared error weighted by the density

f (x)

,

I S E (q) = \int_{x} {(\hat{m} (x) - m (x))}^{2} f (x) d x

is given by

C V (q) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{m}}_{- i} (x_{i}))}^{2}

where

{\hat{m}}_{- i} (x_{i})

is

\hat{m} (x_{i})

after deleting the i-th observations

y_{i},

x_{i}

from the sample. In fact,

C V (q) = \frac{1}{n} \sum_{i = 1}^{n} {(m (x_{i}) - {\hat{m}}_{- i} (x_{i}))}^{2} + \frac{2}{n} \sum_{i = 1}^{n} (m (x_{i}) - {\hat{m}}_{- i} (x_{i})) u_{i} + \frac{1}{n} \sum_{i = 1}^{n} u_{i}^{2}

where the first term on the right-hand side is a good approximation to

I S E (h)

, because the second term is generally negligibly small, and the third term converges to a constant

σ^{2} = E [σ^{2} (x)]

free from h. Therefore

C V (q) = I S E (q) + σ^{2}

asymptotically.

Also, in the case where

m (x)

is a sieve regression, [96] shows that CV is an unbiased estimator of the MSE of prediction error (MSEPE) of m,

M S E P E = E [y_{n + 1} - {\hat{m}}_{(} x_{n + 1})]^{2}

, see section 2.1.4. In addition, the minimization of MSEPE is equivalent to the minimization of MSE and integrated MSE (IMSE) of estimated m for conditional and unconditional x, respectively.

If, instead of the local linear of

m (x_{i})

we consider the local polynomial of order d, then

\hat{m} (x_{i})

is the LPLS estimator [2], and

C V (q) = C V (h, q, d)

continues to hold. For

d = 0

we have a local constant LS (LCLS) estimator developed by [98,99]. For

d = 1

we have the LLLS estimator as considered above. In practice, the values of h and d can be determined by minimizing

C V (h, q, d)

with respect to h and d for given q, which is developed by [100]. For a vector

x_{i}

, if the choice of

h_{j} = {\hat{h}}_{j}

for any j tends to be infinity (very large) then the corresponding variable is an irrelevant variable. This can be observed from a simple example. Suppose the

\hat{m} (x)

for two variables

x_{i 1,}

x_{i 2},

considering the LCLS estimator is

\hat{m} (x_{1}, x_{2}) = \hat{m} (x) = \sum y_{i} K (\frac{x_{i 1} - x_{1}}{h_{1}}) K (\frac{x_{i 2} - x_{2}}{h_{2}}) / \sum K (\frac{x_{i 1} - x_{1}}{h_{1}}) K (\frac{x_{i 2} - x_{2}}{h_{2}}) .

Thus if

h_{2} \to \infty,

then

K (\frac{x_{i 2} - x_{2}}{h_{2}}) = K (0)

is constant and

\hat{m} (x) = \hat{m} (x_{1}, x_{2}) = \sum y_{i} K (\frac{x_{i 1} - x_{1}}{h_{1}}) / \sum K (\frac{x_{i 1} - x_{1}}{h_{1}}) .

Thus a large estimated value of the window-width leads to the exclusion of variables, and hence variables selection.

In a seminal paper [83] shows that Mallows, GCV and CV procedures are asymptotically equivalent and all of them lead to optimal smoothing in the sense that

\frac{\int {(\hat{m} (x, \hat{q}) - m (x))}^{2} d F (x)}{{inf}_{q} \int {(\hat{m} (x, q) - m (x))}^{2} d F (x)} \to^{p} 1

where

\hat{m} (x) = \hat{m} (x, \hat{q}),

given h and d, is an estimator of

m (x)

with

\hat{q}

obtained using one of the above procedures.

Also, [101] demonstrates that for the local constant estimator (

d = 0

and given q),

C V = C V (h, q, 0)

smoothing selectors of h are asymptotically equivalent to GCV selectors. In an important paper, in [92] shows the asymptotic normality of

\hat{m} (x) = \hat{m} (x, \hat{h}),

where

\hat{h}

is obtained by the CV method and

x_{i}

is a vector of mixed continuous and discrete variables. Their extensive simulation results reveal (no theoretical proof) that AIC window-width selection criterion is asymptotically equivalent to the CV method, but for small samples AIC tends to perform better than the CV method. Further, with repect to the comparison of NP and parametric models, their results explain the observations of [102] which finds that NP estimators with smoothing parameters h chosen by CV can yield better prediction relative to commonly used parametric methods for the datasets of several countries. Reference [85] shows that CV is optimal under heteroskedasticity. For GMM model selection which involves selecting moments conditions, see [93]. Also, see [94] for using minimization of empirical likelihood/KLIC and comments by [95] claiming a fundamental flaw in the application of KLIC.

3.2. NP Model Averaging

Let us consider

{\hat{m}}_{l},

l = 1, . . ., M,

to be the set of estimators of m corresponding to the different sets of regressors considered in the model selection. Then

\hat{m} (w) = \sum_{l = 1}^{M} w_{l} {\hat{m}}_{l} = P (w) y

where

{\hat{m}}_{l} = P_{l} y,

P (w) = \sum_{l = 1}^{M} w_{l} P_{l}

and

P_{l}

is the P matrix, as defined before, based here on the variables in the l-th model. Then the choice of w can be determined by applying Mallows criterion (see Section 2.2.2) as

C (w) = w^{'} {\hat{U}}^{'} \hat{U} w + 2 σ^{2} q^{*'} w

where

q^{*} = (t r P (q_{1}), . . ., t r P (q_{M}))

, and

\hat{U} = {({\hat{u}}_{1}, . . ., {\hat{u}}_{M})}^{'}

is a matrix of NP residuals of all the models. Thus we get

\hat{m} (\hat{w}) = \sum_{l = 1}^{M} {\hat{w}}_{l} {\hat{m}}_{l} .

Similarly, as in section 2.2.3, if we calculate

{\tilde{m}}_{l}

by deleting one element of each variable, then w can be determined by minimizing

C V (w) = \frac{1}{n} w^{'} {\tilde{U}}^{'} \tilde{U} w

in which the NP residuals matrix

\tilde{U} = {({\tilde{u}}_{1}, . . ., {\tilde{u}}_{M})}^{'}

with

{\tilde{u}}_{l} = {({\tilde{u}}_{1 l,} . . ., {\tilde{u}}_{n l})}^{'},

and

{\tilde{u}}_{i l}

is computed with the i-th observation deleted.

For the fixed window-width the optimality result of

\hat{w}

can be shown to follow from [83]. However, for

h = \hat{h}

the validity of Li’s result needs further investigation.

4. Conclusions

Nonparametric and parametric models are studied in econometrics and practice. In all applications, the important issue is to reduce model uncertainty by using model selection or model averaging. This paper selectively reviews frequentist results on model selection and model averaging in the regression context.

It is clear that most of the results presented are under the i.i.d. assumption. It is useful to relax this assumption to allow dependence or heterogeneity in the data, see [103] for model selection in dependent time series models using various CV procedures. A systematic study of the properties of estimators based on FMA is warranted. Further, results need to be developed for more complicated nonparametric models, e.g., panel data models and models where variables are endogenous, although for the parametric case see [104,105,106,107,108]. Also, the properties of NP model averaging estimators, when the window-width in kernel regression is estimated are to be developed; although readers can see [96] for NP results of the estimators based on the sieve method.

Acknowledgements

The authors are thankful to L. Su, A.Wan, X. Zhang, and G. Zou for some discussions and references on the subject matter of this paper. They are also grateful to the guest editor, Tomohiro Ando, and anonymous referees for their constructive suggestions and comments. First author is also thankful to the Academic Senate, UCR for its financial support.

Conflicts of Interest

The authors declare no conflict of interest.

References

A. Pagan, and A. Ullah. Nonparametric Econometrics. Cambridge, UK: Cambridge University Press, 1999. [Google Scholar]
Q. Li, and J.S. Racine. Nonparametric Econometrics: Theory and Practice. Princeton, NJ, USA: Princeton University Press, 2007. [Google Scholar]
A. Belloni, and V. Chernozhukov. “L1-penalized quantile regression in high-dimensional sparse models.” Ann. Stat. 39 (2011): 82–130. [Google Scholar] [CrossRef]
C. Zhang, J. Fan, and T. Yu. “Multiple testing via FDRL for large-scale imaging data.” Ann. Stat. 39 (2011): 613–642. [Google Scholar] [CrossRef] [PubMed]
H. Akaike. “Information Theory and An Extension of the Maximum Likelihood Principle.” In International Symposium on Information Theory. Edited by B.N. Petrov and F. Csaki. New York, USA: Springer-Verlag, 1973, pp. 267–281. [Google Scholar]
C.L. Mallows. “Some comments on C_p.” Technometrics 15 (1973): 661–675. [Google Scholar]
G. Schwarz. “Estimating the dimension of a model.” Ann. Stat. 6 (1978): 461–464. [Google Scholar] [CrossRef]
M. Stone. “Cross-validatory choice and assessment of statistical predictions.” J. R. Stat. Soc. 36 (1974): 111–147. [Google Scholar]
P. Craven, and G. Wahba. “Smoothing noisy data with spline functions.” Numer. Math. 31 (1979): 377–403. [Google Scholar] [CrossRef]
G. Claeskens, and N.L. Hjort. “The focused information criterion.” J. Am. Stat. Assoc. 98 (2003): 900–945. [Google Scholar] [CrossRef]
I.E. Frank, and J.H. Friedman. “A statistical view of some chemomtrics regression tools.” Technometrics 35 (1993): 109–135. [Google Scholar] [CrossRef]
W. Fu, and K. Knight. “Asymptotics for lasso-type estimators.” Ann. Stat. 28 (2000): 1356–1378. [Google Scholar] [CrossRef]
A.E. Hoerl, and R.W. Kennard. “Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12 (1970): 55–67. [Google Scholar] [CrossRef]
J. Fan, and J. Lv. “A selective overview of variable selection in high dimensional feature space.” Stat. Sin. 20 (2010): 101–148. [Google Scholar] [PubMed]
P. Bühlmann, and S. Van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York, NY, USA: Springer, 2011. [Google Scholar]
G. Claeskens, and N.L. Hjort. Model Selection and Model Averaging. Cambridge, UK: Cambridge University Press, 2008. [Google Scholar]
D. Andrews, and B. Lu. “Consistent model and moment selection procedures for GMM estimation with application to dynamic panel data models.” J. Econom. 101 (2001): 123–164. [Google Scholar] [CrossRef]
A.R. Hall, A. Inoue, K. Jana, and C. Shin. “Information in generalized method of moments estimation and entropy-based moment selection.” J. Econom. 138 (2007): 488–512. [Google Scholar] [CrossRef]
B.M. Pötscher. “Effects of model selection on inference.” Econom. Theory 7 (1991): 163–185. [Google Scholar] [CrossRef]
P. Kabaila. “The Effect of Model Selection on Confidence Regions and Prediction Regions.” Econom. Theory 11 (1995): 537–549. [Google Scholar] [CrossRef]
P. Bühlmann. “Efficient and adaptive post-model-selection estimators.” J. Stat. Plan. Inference 79 (1999): 1–9. [Google Scholar] [CrossRef]
H. Leeb, and B.M. Pötscher. “The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations.” Econom. Theory 19 (2003): 100–142. [Google Scholar] [CrossRef]
H. Leeb, and B.M. Pötscher. “Can one estimate the conditional distribution of post-model-selection estimators? ” Ann. Stat. 34 (2006): 2554–2591. [Google Scholar] [CrossRef]
L. Breiman. “Heuristics of instability and stabilization in model selection.” Ann. Stat. 24 (1996): 2350–2383. [Google Scholar] [CrossRef]
S. Jin, L. Su, and A. Ullah. “Robustify financial time series forecasting.” Econom. Rev., 2013, in press. [Google Scholar] [CrossRef]
J.F. Geweke. Contemporary Bayesian Econometrics and Statistics. Hoboken, NJ, USA: John Wiley and Sons Inc., 2005. [Google Scholar]
J.F. Geweke. “Bayesian model comparison and validation.” Am. Econ. Rev. Pap. Proc. 97 (2007): 60–64. [Google Scholar] [CrossRef]
D. Draper. “Assessment and propagation of model uncertainty.” J. R. Stat. Soc. 57 (1995): 45–97. [Google Scholar]
J.A. Hoeting, D. Madigan, A.E. Raftery, and C.T. Volinsky. “Bayesian model averaging: A tutorial (with discussion).” Stat. Sci. 14 (1999): 382–417. [Google Scholar]
M. Clyde, and E.I. George. “Model uncertainty.” Stat. Sci. 19 (2004): 81–94. [Google Scholar]
W.A. Brock, S.N. Durlauf, and K.D. West. “Policy evaluation uncertain economic environment.” Brook. Pap. Econ. Act. 2003 (2003): 235–301. [Google Scholar] [CrossRef]
X. Sala-i-Martin, G. Doppelhofer, and R.I. Miller. “Determinants of long-term growth: A Bayesian Averaging of Classical Estimates (BACE) approach.” Am. Econ. Rev. 94 (2004): 813–835. [Google Scholar] [CrossRef]
J.R. Magnus, O. Powell, and P. Prüfer. “A comparison of two model averaging techniques with an application to growth empirics.” J. Econom. 154 (2010): 139–153. [Google Scholar] [CrossRef]
S.T. Buckland, K.P. Burnham, and N.H. Augustin. “Model selection: An integral part of inference.” Biometrics 53 (1997): 603–618. [Google Scholar] [CrossRef]
Y. Yang. “Adaptive regression by mixing.” J. Am. Stat. Assoc. 96 (2001): 574–586. [Google Scholar] [CrossRef]
K.P. Burnham, and D.R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretical Approach. New York, NY, USA: Springer-Verlag, 2002. [Google Scholar]
G. Leung, and A.R. Barron. “Information theory and mixing least-squares regressions.” IEEE Trans. Inf. Theory 52 (2006): 3396–3410. [Google Scholar] [CrossRef]
Z. Yuan, and Y. Yang. “Combining linear regression models: When and how? ” J. Bus. Econ. Stat. 100 (2005): 1202–1204. [Google Scholar] [CrossRef]
B.E. Hansen. “Notes and comments least squares model averaging.” Econometrica 75 (2007): 1175–1189. [Google Scholar] [CrossRef]
B.E. Hansen, and J. Racine. “Jackknife model averaging.” J. Econom. 167 (2012): 38–46. [Google Scholar] [CrossRef]
A.T.K. Wan, X. Zhang, and G. Zou. “Least squares model averaging by mallows criterion.” J. Econom. 156 (2010): 277–283. [Google Scholar] [CrossRef]
G. Kapetanios, V. Labhard, and S. Price. “Forecasting using predictive likelihood model averaging.” Econ. Lett. 91 (2006): 373–379. [Google Scholar] [CrossRef]
A.T.K. Wan, and X. Zhang. “On the use of model averaging in tourism research.” Ann. Tour. Res. 36 (2009): 525–532. [Google Scholar] [CrossRef]
J.M. Bates, and C.W. Granger. “The combination of forecasts.” Oper. Res. Q. 20 (1969): 451–468. [Google Scholar] [CrossRef]
I. Olkin, and C.H. Speigelman. “A semiparametric approach to density estimation.” J. Am. Stat. Assoc. 82 (1987): 858–865. [Google Scholar] [CrossRef]
Y. Fan, and A. Ullah. “Asymptotic normality of a combined regression estimator.” J. Multivar. Anal. 71 (1999): 191–240. [Google Scholar] [CrossRef]
D.H. Wolpert. “Stacked generalization.” Neural Netw. 5 (1992): 241–259. [Google Scholar] [CrossRef]
M. LeBlanc, and R. Tibshirani. “Combining estimates in regression and classification.” J. Am. Stat. Assoc. 91 (1996): 1641–1650. [Google Scholar] [CrossRef]
Y. Yang. “Mixing strategies for density estimation.” Ann. Stat. 28 (2000): 75–87. [Google Scholar] [CrossRef]
O. Catoni. The Mixture Approach to Universal Model Selection. Technical Report; Paris, France: Ecole Normale Superieure, 1997. [Google Scholar]
M.I. Jordan, and R.A. Jacobs. “Hiearchical mixtures of experts and the EM algorithm.” Neural Comput. 6 (1994): 181–214. [Google Scholar] [CrossRef]
X. Jiang, and M.A. Tanner. “On the asymptotic normality of hierarchical mixtures-of-experts for generalized linear models.” IEEE Trans. Inf. Theory 46 (2000): 1005–1013. [Google Scholar] [CrossRef]
V.G. Vovk. “Aggregateing Strategies.” In Proceedings of the 3rd Annual Workshop on Computational Learning Theory, Rochester, NY, USA, 06–08 August 1990; Volume 56, pp. 371–383.
V.G. Vovk. “A game of prediction with expert advice.” J. Comput. Syst. Sci. 56 (1998): 153–173. [Google Scholar] [CrossRef]
N. Merhav, and M. Feder. “Universal prediction.” IEEE Trans. Inf. Theory 44 (1998): 2124–2147. [Google Scholar] [CrossRef]
A. Ullah. “Nonparametric estimation of econometric functionals.” Can. J. Econ. 21 (1988): 625–658. [Google Scholar] [CrossRef]
J. Fan, and I. Gijbels. Nonparametric Estimation of Econometric Functionals. London, UK: Champman and Hall, 1996. [Google Scholar]
R.L. Eubank. Nonparametric Regression and Spline Smoothing. New York, NY, USA: CRC Press, 1999. [Google Scholar]
S. Geman, and C. Hwang. “Diffusions for global optimization.” SIAM J. Control Optim. 24 (1982): 1031–1043. [Google Scholar] [CrossRef]
W.K. Newey. “Convergence rates and asymptotic normality for series estimators.” J. Econom. 79 (1997): 147–168. [Google Scholar] [CrossRef]
H. Wang, X. Zhang, and G. Zou. “Frequentist model averaging estimation: A review.” J. Syst. Sci. Complex. 22 (2009): 732–748. [Google Scholar] [CrossRef]
L. Su, and Y. Zhang. “Variable Selection in Nonparametric and Semiparametric Regression Models.” In Handbook of Applied Nonparametric and Semiparametric Econometrics and Statistics. Edited by A. Ullah, J. Racine and L. Su. Oxford, UK: Oxford University Press, 2013, in press. [Google Scholar]
A.K. Srivastava, V.K. Srivastava, and A. Ullah. “The coefficient of determination and its adjusted version in linear regression models.” Econom. Rev. 14 (1995): 229–240. [Google Scholar] [CrossRef]
V. Rousson, and N.F. Gosoniu. “An R-square coefficient based on final prediction error.” Stat. Methodol. 4 (2007): 331–340. [Google Scholar] [CrossRef]
Y. Wang. On Efficiency Properties of An R-square Coefficient Based on Final Prediction Error. Working Paper; Beijing, China: School of International Trade and Economics, University of International Business and Economics, 2013. [Google Scholar]
K. Takeuchi. “Distribution of information statistics and criteria for adequacy of models.” Math. Sci. 153 (1976): 12–18, In Japanese. [Google Scholar]
E. Maasoumi. “A compendium to information theory in economics and econometrics.” Econom. Rev. 12 (1993): 137–181. [Google Scholar] [CrossRef]
A. Ullah. “Entropy, divergence and distance measures with econometric applications.” J. Stat. Plan. Inference 49 (1996): 137–162. [Google Scholar] [CrossRef]
R. Nishi. “Asymptotic properties of criteria for selection of variables in multiple regression.” Ann. Stat. 12 (1984): 758–765. [Google Scholar] [CrossRef]
E.J. Hannan, and B.G. Quinn. “The determination of the order of an autoregression.” J. R. Stat. Soc. 41 (1979): 190–195. [Google Scholar]
C.M. Hurvich, and C.L. Tsai. “Regression and time series model selection in small samples.” Biometrika 76 (1989): 297–307. [Google Scholar] [CrossRef]
J. Kuha. “AIC and BIC: Comparisons of assumptions and performance.” Sociol. Methods Res. 33 (2004): 188–229. [Google Scholar] [CrossRef]
M. Stone. “An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion.” J. R. Stat. Soc. 39 (1977): 44–47. [Google Scholar]
M. Stone. “1979. Comments on model selection criteria of Akaike and Schwartz.” J. R. Stat. Soc. 41 (1979): 276–278. [Google Scholar]
G.S. Maddala. Introduction to Econometrics. New York, NY, USA: Macmillan, 1988. [Google Scholar]
R. Tibshirani. “Regression shrinkage and selection via the lasso.” J. R. Stat. 58 (1996): 267–288. [Google Scholar]
A. Ullah, A.T.K. Wan, H. Wang, X. Zhang, and G. Zou. A Semiparametric Generalized Ridge Estimator and Link with Model Averaging. Working Paper; Riverside, CA, USA: Department of Economics, University of California, 2013. [Google Scholar]
H. Zou. “The adaptive lasso and its oracle properties.” J. Am. Stat. Assoc. 101 (2006): 1418–1429. [Google Scholar] [CrossRef]
C. Zhang. “Nearly unbiased variable selection under minimax concave penalty.” Ann. Stat. 38 (2010): 894–942. [Google Scholar] [CrossRef]
J. Fan, and R. Li. “Variable selection via nonconcave penalized likelihood and its oracle properties.” J. Am. Stat. Assoc. 96 (2001): 1348–1360. [Google Scholar] [CrossRef]
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. “Least angle regression.” Ann. Stat. 32 (2004): 407–499. [Google Scholar]
X. Zhang, A.T.K. Wan, and S.Z. Zhou. “Focused information criteria, model selection, and model averaging in a tobit model with a nonzero threshold.” J. Bus. Econ. Stat. 30 (2012): 132–143. [Google Scholar] [CrossRef]
K.C. Li. “Asymptotic optimality for C_p, C_L, cross-validation and generalized cross-validation: discrete index set.” Ann. Stat. 15 (1987): 958–975. [Google Scholar] [CrossRef]
B. Hansen. “Least-squares forecast averaging.” J. Econom. 146 (2008): 342–350. [Google Scholar] [CrossRef]
D.W.K. Andrews. “Asymptotic optimality of generalized C_L, cross-validation, and generalized cross-validation in regression with heteroskedastic errors.” J. Econom. 47 (1991): 359–377. [Google Scholar] [CrossRef]
X. Lu, and L. Su. Jackknife Model Averaging for Quantile Regressions. Working Paper; Singapore: School of Economics, Singapore Management University, 2012. [Google Scholar]
G. Kuersteiner, and R. Okui. “Constructing optimal instruments by first-stage prediction averaging.” Econometrica 78 (2010): 697–718. [Google Scholar]
F. Yao, and A. Ullah. “A nonparametric R² test for the presence of relevant variables.” J. Stat. Plan. Inference, 143 (2013): 1527–1547. [Google Scholar] [CrossRef]
L. Su, and A. Ullah. “A nonparametric goodness-of-fit-based test for conditional heteroskedasticity.” Econom. Theory 29 (2013): 187–212. [Google Scholar] [CrossRef]
L.H. Huang, and J. Chen. “Analysis of variance, coefficient of determination and f-test for local polynomial regression.” Ann. Stat. 36 (2008): 2085–2109. [Google Scholar] [CrossRef]
C. Hurvich, J. Simonoff, and C. Tsai. “Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion.” J. R. Stat. Soc. 60 (1998): 271–293. [Google Scholar] [CrossRef]
J. Racine, and Q. Li. “Nonparametric estimation of regression functions with both categorical and continuous data.” J. Econom. 119 (2004): 99–130. [Google Scholar] [CrossRef]
D.W.K. Andrews. “Consistent moment selection procedures for generalized method of moments estimation.” Econometrica 67 (1999): 543–564. [Google Scholar] [CrossRef]
X. Chen, H. Hong, and M. Shum. “Nonparametric likelihood ratio model selection tests between parametric likelihood and moment condition models.” J. Econom. 141 (2007): 109–140. [Google Scholar] [CrossRef]
S.M. Schennach. “Instrumental variable estimation of nonlinear errors-in-variables models.” Econometrica 75 (2007): 201–239. [Google Scholar] [CrossRef]
B. Hansen. Nonparametric Sieve Regression: Least Squares Averaging Least Squares, and Cross-validation. Working Paper; Madison, WI, USA: University of Wisconsin, 2012. [Google Scholar]
H. Liang, G. Zou, A.T.K. Wan, and X. Zhang. “Optimal weight choice for frequentist model average estimators.” J. Am. Stat. Assoc. 106 (2011): 1053–1066. [Google Scholar] [CrossRef]
E.A. Nadaraya. “Some new estimates for distribution functions.” Theory Probab. Its Appl. 9 (1964): 497–500. [Google Scholar] [CrossRef]
G.S. Watson. “Smooth regression analysis.” Sankhya Ser. A 26 (1964): 359–372. [Google Scholar]
P.G. Hall, and J.S. Racine. Infinite Order Cross-validated Local Polynomial Regression. Working Paper; Ontario, Canada: Department of Economic, McMaster University, 2013. [Google Scholar]
W. Härdle, P. Hall, and J.S. Marron. “How far are automatically chosen regression smoothing parameters from their optimum? ” J. Am. Stat. Assoc. 83 (1988): 86–99. [Google Scholar] [CrossRef]
Q. Li, and J. Racine. Empirical Applications of Smoothing Categorical Variables. Working Paper; Ontario, Canada: Department of Economic, McMaster University, 2001. [Google Scholar]
J. Racine. “Consistent cross-validatory model-selection for dependent data: Hv-block cross-validation.” J. Econom. 99 (2000): 39–61. [Google Scholar] [CrossRef]
M. Caner. “A lasso type GMM estimator.” Econom. Theory 25 (2009): 270–290. [Google Scholar] [CrossRef]
M. Caner, and M. Fan. A Near Minimax Risk Bound: Adaptive Lasso with Heteroskedastic Data in Instrumental Variable Selection. Working Paper; Raleigh, USA: North Carolina State University, 2011. [Google Scholar]
P.E. Garcia. Instrumental Variable Estimation and Selection with Many Weak and Irrelevant Instruments. Working Paper; Madison, WI, USA: University of Wisconsin, 2011. [Google Scholar]
Z. Liao. “Adaptive GMM shrinkage estimation with consistent moment selection.” Econom. Theory FirstView (2013): 1–48. [Google Scholar] [CrossRef]
E. Gautier, and A. Tsybakov. High-Dimensional Instrumental Variables Regression and Confidence Sets. Working Paper; Malakoff Cedex, France: Centre de Recherche en Economie et Statistique, 2011. [Google Scholar]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Ullah, A.; Wang, H. Parametric and Nonparametric Frequentist Model Selection and Model Averaging. Econometrics 2013, 1, 157-179. https://doi.org/10.3390/econometrics1020157

AMA Style

Ullah A, Wang H. Parametric and Nonparametric Frequentist Model Selection and Model Averaging. Econometrics. 2013; 1(2):157-179. https://doi.org/10.3390/econometrics1020157

Chicago/Turabian Style

Ullah, Aman, and Huansha Wang. 2013. "Parametric and Nonparametric Frequentist Model Selection and Model Averaging" Econometrics 1, no. 2: 157-179. https://doi.org/10.3390/econometrics1020157

Article Menu

Parametric and Nonparametric Frequentist Model Selection and Model Averaging

Abstract

1. Introduction

2. Parametric Model Selection and Model Averaging

2.1. Model Selection

2.1.1. AIC, TIC, and BIC

2.1.2. FIC

2.1.3. Mallows Model Selection

2.1.4. Cross-Validation (CV)

2.1.5. Model Selection by Other Penalty Functions

2.2. Model Averaging

2.2.1. Bayesian and FIC Weights

2.2.2. Mallows Weight Selection Method

2.2.3. Jackknife Model Averaging Method (CV)

3. Nonparametric (NP) Model Selection and Model Averaging

3.1. NP Model Selection

3.1.1. AIC, BIC, and GCV

3.1.2. Mallows Model Selection

3.1.3. Cross Validation (CV)

3.2. NP Model Averaging

4. Conclusions

Acknowledgements

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI