Abstract
This paper presents recent developments in model selection and model averaging for parametric and nonparametric models. While there is extensive literature on model selection under parametric settings, we present recently developed results in the context of nonparametric models. In applications, estimation and inference are often conducted under the selected model without considering the uncertainty from the selection process. This often leads to inefficiency in results and misleading confidence intervals. Thus an alternative to model selection is model averaging where the estimated model is the weighted sum of all the submodels. This reduces model uncertainty. In recent years, there has been significant interest in model averaging and some important developments have taken place in this area. We present results for both the parametric and nonparametric cases. Some possible topics for future research are also indicated.
1. Introduction
Over the last several years many econometricians and statisticians have persistently devoted their efforts in finding various paths to the true model. The uncertainty in correctly specifying the regression model has resulted in a large amount of literature in two major directions: firstly, what variables are to be included and secondly, how they are related with the dependent variable in the model. Thus “what" refers to determining the variables to be included in constructing the model and “how" refers to finding the correct functional form, e.g., parametric (specifications like linear, quadratic, etc.), or in general, nonparametric smoothing methods that do not require specifying a parametric functional form but instead let the data search for a suitable function that describes well the available data, see [1,2] among others.
To determine “what", model selection was first introduced, and it has a huge literature in statistics and econometrics. In fact, in recent years, model selection (variable selection) procedures have become more popular due to the emergence of econometric and statistical models with high dimension (large number) variables. As examples, in labor economics, wage equations can have a large number of regressors [3] and in financial econometrics, portfolio allocation may be among hundreds or thousands of stocks [4]. Such models raise additional challenges of econometric modeling and inference along with the selection of variables. Different tools have been developed based on various estimation criteria. The majority of such procedures involve variable selection by minimizing penalized loss functions based on the least squares and the log-likelihood, and their variants. The adjusted R and residuals sum of squares are the usual variable selection procedures without any penalization. Among the penalized procedures we have Akaike information criterion (AIC) [5], Mallows procedure [6], Bayesian information criterion (BIC) by [7], cross-validation method by [8], generalized cross-validation (GCV) by [9], and the focused information criterion (FIC) by [10]. We note that the traditional AIC and BIC are based on least squares (LS), maximum likelihood (ML), or Bayesian principles, and the penalization is based on the -norm for the parameters entering in the model, with the result penalization is proportional to the number of nonzero parameters. Both AIC and BIC are variable selection procedures and do not provide estimators simultaneously. On the other hand the bridge estimator in [11,12] uses the -norm (), and for provides a way to combine variable selection and parameter estimation simultaneously. Within this class the least absolute shrinkage and selection operator (LASSO; ) has become the most popular. For we get the ridge estimator [13]. For a detailed review of model selection in high dimensional modeling, see [14], and the books [15,16]. Similarly, in the context of empirical likelihood estimation and generalized methods of moments estimators, model selection criteria have been introduced by [17,18], among others.
Model selection is an important step for empirical policy evaluation and forecasting. However, it may produce unstable estimators because of bias in model selection. For example, a small data perturbation or an alternative selection procedure may give a different model. Reference [19] shows that AIC selection results in distorted inference, and [20] explores the negative impact on confidence regions. Reference [21] gives conditions under which post model selection estimators are adaptive, but see [22,23] for their comments that they cannot be uniformly estimated. For a selected model with unstable estimators, [24] provides bagging or bootstrap averaging procedure to reduce their variances for the i.i.d. data, and by [25] for the dependent time series data. But this averaging does not always work, e.g., for large samples and/or in entire parameter space.
Taking the above reasons into consideration, model averaging is introduced as an alternative to model selection. Unlike in model selection, where the model uncertainty is dealt with by econometricians selecting one model from a set of models, in model averaging, we resolve the uncertainty by averaging over the set of models. There is large recent literature on Bayesian model averaging (BMA) and more recently, on frequentist model averaging (FMA). Among the BMA contributions, model uncertainty is considered by setting a prior probability to each candidate model, see [26,27,28,29,30]; for interesting applications in econometrics, see, e.g., [31,32,33]. Also, see [10] for comments on the BMA approach. The main focus here is on the FMA method, which is totally determined by data only and assumes no priors, and it has received much attention in recent years, see [34,35,36,37,38,39,40,41]. Reference [10] provides asymptotic theory. For applications, see [16,42,43]. The concept behind the FMA estimators is related to the ideas of combining procedures based on the same data, which have been considered before in several research areas. For instance, [44] introduces forecast combination and [45,46] suggest combining parametric and kernel estimators of density and regression respectively. Other works include bootstrap based averaging (“stacking") by [24,47,48], information theoretic method to combine density by [49,50], and the mixing of experts models by [51,52]. Similar kinds of combining have been used in computational learning theory by [53,54] and in information theory by [55].
Related to “how", or rather determining the unknown functional forms of econometric models, we use data based nonparametric procedures (e.g., kernel, smoothing spline, series approximation). See, for example, [1,2,56,57], for kernel smoothing procedures, [58] for the spline methods, and [59,60] for the series methods. These procedures help in dealing with the problems of bias and inconsistency in estimation and testing due to misspecifying functional forms. Because of this recent developments on nonparametric model selection and model averaging have taken place.
The current paper is hence focused on a review of parametric and nonparametric approaches to model selection and model averaging mainly from a frequentist point of view, and for independently and identically distributed (i.i.d.) observations. Earlier [14] provides a review of parametric model selections, [61] surveys the FMA estimation, and [62] provides variable selection in semiparametric regression models. To distinguish, our paper hence concentrates on the review of frequentist model selection and model averaging under both parametric and nonparametric settings.
2. Parametric Model Selection and Model Averaging
2.1. Model Selection
Let us consider as a dependent variable and a vector of explanatory variables/covariates. Then the linear regression model can be written as
or
where y is X is , , and u is
Among the well known procedures for model selection, often used routinely, we are looking at the goodness of fit adjusted (), and residuals sum of squared (RSS) given by
where The model with the highest (or ) or smallest RSS is chosen. However increases or RSS decreases, monotonically as q increases. Further, between and , but Thus may not always be statistically more efficient (), see [63] for further detail. Thus and RSS are not preferred measures of goodness of fit or model selection. Recently [64] develops a model selection procedure based on the “mean squared prediction error" denoted by MSPE. Consider as a new observed sample in which is the “new observed value" and is such that When a model has (no explanatory variable), . Then, using the unbiased estimator of and of as in [64] introduces
such that where FPE represents final prediction error. The statistical properties of the bias and MSE of compared to those of and are analyzed in [65]. Reference [64] has demonstrated that one of the exciting advantages of is that it can be used for choosing a model with the best prediction ability. Furthermore, not only overcomes inflation in it also avoids the problem of selecting an overfitted model with some irrelevant explanatory variables due to using . In addition, they indicate that and AIC, discussed below, are asymptotically equivalent and in model selection is perfectly consistent with using AIC and is closest with BIC. Thus can be used simultaneously for goodness of fit as well as for model selection.
2.1.1. AIC, TIC, and BIC
Now we turn to the methods of model selection, AIC in [5], Takeuchi informaiton criterion (TIC) in [66], and BIC in [7]. For this, we first note that if is an unknown true density, and is an assumed density then the Kullback-Leibler Information Criterion (KLIC) is given by
where is the expectation with respect to This is an expected “surprise" from knowing f is in fact the true density of We note that where equality holds if and only if almost everywhere. Further is called the entropy of distribution f; for more on entropy and information, see [67,68].
A concept related to entropy is the quasi maximum likelihood estimator (QMLE) which maximizes the quasi log-likelihood function
based on the random sample from Since it is expected that converges in probability to the maximizer of under suitable conditions. Since does not depend on θ, QMLE minimizes a random function which converges to
Thus where is often referred to as the pseudo-true value of It is well known that under some regularity conditions
where and When and is the MLE and it is asymptotically efficient.
Now consider the fitted density and
where is free of the fitted model and denotes the expectation with respect to the true density of i.e., here. Then where and y are independent. The expected KLIC can be interpreted as the expected likelihood when is used for and an independent sample y (with one observation here) used for evaluation. In linear regression, the expected KLIC is the expected squared prediction error. Dropping and using second order Taylor expansion, it can be shown that
Further, an asymptotically unbiased estimator of T can be written as
where is a consistent estimator of in which and
When the model is correctly specified, that is =, = and ,
which is related with AIC given by
Thus, we can think of AIC as an estimate of the expected 2KLIC based on the assumption that the model is correctly specified. Therefore, selecting a model based on the smallest AIC amounts to choosing the best-fitting model in the sense of having the smallest KLIC. A robust AIC by Takeuchi [66], known as the Takeuchi Information Criterion (TIC), is
which, unlike AIC, does not require to be correctly specified. In general, picking models with the smallest AIC/TIC is selecting fitted models whose densities are close to the true density.
We note that in a linear regression model, the minimization of the AIC reduces to the minimization of the following
where It can be shown that if Thus AIC is more appropriate under normality, otherwise it is an approximation for the non-normal and heteroskedastic regression cases.
Further, in a linear regression case, the minimization of TIC can be shown as the minimization of
where and When the errors are homoskedastic and normal,
which is close to AIC. Although differences may arise under heteroskedasticity and nonnormality. However, as we change models, typically the results and hence may not change much. In this case, TIC and AIC may give similar model selection results.
We note that the BIC due to [7] is
in which the penalty term depends on the sample size and it is generally larger than the penalty term appearing in the AIC. BIC provides a large sample estimator of a transformation of the Bayesian posterior probability associated with the approximation model. In general, by choosing the fitted candidate model corresponding to the BIC criterion, one is selecting the candidate model with the highest posterior probability. A good property of BIC selection is that it provides consistent model selection, see for example [69]. That is, when the true model is of finite dimension, BIC will choose the model with probability tending to 1 as the sample size n increases.
In general, a penalized function can only be consistent if its penalty term ( in BIC) is a fast enough increasing function of n (see [70]). Thus AIC is not consistent as it always has some probability of selecting models that are too large. However, we note that in finite samples, adjusted versions of AIC can behave much better, see for example [71]. Further, since the penalty term of BIC is more stringent than the penalty term of AIC, BIC tends to form smaller models than AIC. However, BIC provides a large-sample estimator of the transformation of the Bayesian posterior probability associated with the approximating model, and AIC provides an asymptotically unbiased estimator of the expected Kullback discrepancy between the generating model and the fitted approximating model. In addition, AIC is asymptotically efficient in the sense that it asymptotically selects the fitted candidate model which minimizes the MSE of prediction, but BIC is not asymptotically efficient. This is because AIC can be advocated when the primary goal of the model is to induce meaningful factors influencing the outcome based on relative importance.
In summary, both AIC and BIC provide well-founded and self-contained approaches to model selection although with different motivations and penalty objectives. Both are typically good approximations of their own theoretical target quantities. Often, this also means that they will identify good models for observed data but both criteria can still fail in this respect. For a detailed simulation and empirical comparison of these two approaches, see [72], and for their properties see [69,73,74]. Both the AIC and the TIC are designed for the likelihood or quasi-likelihood context. They perform in a similar way. Their relationship is similar to the relationship between the conventional and the White covariance matrix estimators for the MSE/QMLE or LS. Unfortunately, despite the merit TIC has theoretically, it does not appear to be widely used perhaps because it needs a very large sample to get good estimates.
2.1.2. FIC
Let us start from the model
or
where X is an matrix of variables intended (focused) to be included all the time yet the variables in a matrix Z may or may not be included. From the ML estimators (), corresponding with the l-th model, the predictor for can be written as at (). In [10] provides MSE of The basic idea of FIC is to develop a model selection criterion that chooses the model with the smalllest estimated MSE. Such an MSE-based FIC for the l-th submodel is
where , where and captures the projection mappings from the full model to the l-th submodel, such that
In contrast, from [10],
where is the number of uncertain parameters in the l-th submodel, shows that when the estimand such that is the probability density function of the data, the MSE-based FIC is asymptotically equivalent to AIC.
2.1.3. Mallows Model Selection
Let us write the regression model (2) as
where Then where
The objective is to choose q such that the average mean squared error (risk) is minimum, where
such that
Mallows criterion for selecting q is to minimize
where the seceond term on the right hand side is a penalty.
In fact, Mallows criterion is an unbiased estimator of the MSE of the predictive estimator of m. This is because and But the minimization of with respect to q is the same as the minimization of since does not depend on
Alternatively,
and So, an unbiased estimator is and its minimization is equivalent to the Mallows criterion.
2.1.4. Cross-Validation (CV)
CV is a commonly used procedure for model selection. According to this, the selection of q is made by minimizing
where is the LS estimator of β dropping the i-th observations from the sample. It can be shown that where
is the MSE of the forecast error with Thus, CV is an almost unbiased estimator of
This can be shown by first writing the MSPE, based on an out of sample observation from the same distribution as the in sample observation, as
where Since does not depend on q, its selection by and are equivalent.
We observe that is a prediction error based on first estimating based on in sample n observations, and then calculating the error by using the out of sample observation Therefore, is the expectation of a squared leave-one-out prediction error when the sample length is Using this idea we can also obtain a similar leave-one-out prediction error for each observation This is given by based on n observations. Thus, for each i, and
Further, since based on observations will be close to based on n observations, is an almost unbiased estimator of
The written above can be rewritten as
where is referred to as the leverage effect and it is the diagonal element of the projection matrix see [75]. This expression is useful for calculations. Also, see [74] for a link of with AIC.
2.1.5. Model Selection by Other Penalty Functions
The issue regarding the model selection has received more attention in recent years because of the challenging problem of estimating models with large numbers of regressors, which may increase with sample size, for example, earning models in labor economics with large number of regressors, financial portfolio models with large number of stocks, and VAR models with hundreds of macro variables.
A different method of variable selection and estimating such models is penalized least squares (PLS), see [14] for a review on this. In fact in this literature estimation of parameters and variables selections are done by using a criterion function involving loss function with a penalization function. Using -penalized, the PLS estimator and variables selection problem are carried out as
where λ is a tuning or shrinkage parameter and the penalty is the restriction (another tuning parameter). For the -norm becomes with as the usual indicator function which indicates the number of nonzero for The AIC and BIC belong to this norm. For the -norm becomes which is used in the LASSO for simultaneous shrinkage estimation [76] and for variable selection. It can be shown analytically that the LASSO method estimates the zero coefficient as zero with positive probability as Next, for the -norm uses and provides ridge type [13] shrinkage estimation but not variable selection. However, if we consider the generalized ridge estimator under then the coefficient estimates corresponding to will tend to zero, see [77].
Further, when we get the bridge estimator [11,12] which provides a way to combine variable selection and parameter estimation together with as the LASSO. For adaptive LASSO and other forms of LASSO, see [62,78,79,80]. Also, see the link of LASSO with the least angel regression selection (LARS) by [81].
2.2. Model Averaging
Let us consider m be a parametric or nonparametric model, which can be a conditional mean or conditional variance. Let be the set of estimators of m corresponding to the different sets of regressors considered in the problem of model selection. Consider , to be the weights corresponding to where and We can then define a model averaging estimator of m as
Below we present the choice of in linear regression models. For the linear regression model consider the model in (1) or (2) where the dimension of β can tend to as We take M models where l-th model contains regressors, which is a subvector of The corresponding model could be written as
and the LS estimator of is
This gives
where The model averaging estimator (MAE) of m is given as
where An alternative expression is
where we write such that and is the MAE of Thus, for the linear model, the MAE of m corresponds to the MAE of β but this may not hold for the non-linear parameters model.
Now we consider the ways to determine weights.
2.2.1. Bayesian and FIC Weights
Under the Bayesian procedure we assume that there are M potential models and one of the models is the true model. Then, using the prior probabilities that each of the potential models is the true model, and considering the prior probability distributions of the parameters, the posterior probability distribution is obtained as the weighted average of the submodels where weights are the posterior probabilities that the given model is the true model given the data.
The two types of weights considered are then
where and These are known as smoothed AIC (SAIC) and smoothed BIC (SBIC) weights. While the Bayesian model averaging estimator (BMAE) has a neat interpretation, it searches for the true model instead of selecting an estimator of a model with a low loss function. In simulations it has been found that SAIC and SBIC tend to outperform AIC and BIC estimators, see [82].
As for the FIC, consider the model averaging estimator as
where
and κ is an algorithmic parameter, bridging from uniform weighting (κ close to 0) to the hard-core FICC (κ is large). For this and further properties and applications of FIC, see [10] and [82].
2.2.2. Mallows Weight Selection Method
In the linear regression model, is a linear estimator with So an optimal choice of w can be found following the Mallows criterion described above. The Mallows criterion for choosing weights w is
where and
in which is the residual vector from the l-th model and is an matrix of residuals from all the models. Thus
is quadratic in Thus
which is obtained by using the quadratic programming procedure with inequality constraints using Gauss or MATLAB. Then Hansen’s Mallows model averaging (MMA) estimator is
Following [83], [39] shows that
as and is asymptotically optimal in Li’s sense, where . However, Hansen’s result requires weights belonging to a discrete set and the models to be nested. In [41] improves the result by relaxing discreteness and by not assuming that the models are nested. Their approach is based on deriving an unbiased estimator of the exact MSE of
Reference [84] also proposes a corresponding forecasting method, using Mallows model averaging (MMA). He proves that the criterion is an asymptotically unbiased estimator of both the in-sample and the out-of-sample one-step-ahead MSE.
2.2.3. Jackknife Model Averaging Method (CV)
Utilizing the leave-one-out cross validation (CV) procedure, which is also known as the Jackknife procedure, Jackknife model averaging (JMA) method of estimating by [40] relaxes assumptions in [39]. The submodels are now allowed to be non-nested and also the error terms can be heteroskedastic. The sum-of-squared residuals in the JMA method is
where is the vector of the Jackknife estimator computed with the i-th element deleted. To be more specific, where is equal to with its i-th row deleted and is y with the i-th element deleted. Thus
where is an matrix, is an vector in which is computed with the i-th observation deleted. Then
and JMA weights are obtained by minimizing with respect to and the JMA estimator is Reference [40] shows the asymptotic optimality, using [83,85], in the sense of minimizing conditional risk which is equivalent to the out-of-sample prediction MSE.
There are many extensions of the JMA method to various other econometric models. Reference [86] does it for the quantile regression model. Reference [82] extends it for the dependent time series models or models with GARCH errors. Also, using MMA method in [39], for models with endogeneity, in [87] develops MMA based two-stage least squares (MATSLS), model averaging limited information maximum likelihood (MALIML), and model averaging Fuller (MAF) estimators.
However, it would be useful to have extensions of the MMA and JMA procedures to the models with GMM or IV estimator. In addition the sampling properties of the average estimators need to be developed for the purpose of statistical inference.
3. Nonparametric (NP) Model Selection and Model Averaging
3.1. NP Model Selection
Let us write the NP model as
where is i.i.d. with density f and the error is independent of
We can write the local linear model as
or
where so that is an matrix and Then the local linear LS estimator (LLLS) of is
where is a diagonal matrix in which the kernel and is the window-width for the j-th variable. From this, pointwise Further, profiled can be written as
where is an matrix generated by , for If h is fixed then is a linear estimator in y. But it will be a nonlinear estimator in y if is either obtained by a plug-in estimator or by cross-validation.
With respect to the goodness of fit measures for the NP models we note that
So the global population goodness of fit is
and its sample global estimator is given by
where (), and with ι being an vector of unit elements. However, may not be valid since Therefore, one can use the following modified as
where and is an indicator function.
Another way to define a proper global is to first consider a local This is based on the fact that at the point x,
because and due to local linear LS estimation. Thus a local can be defined as
which satisfies A global is then
The goodness of fit is considered in [88] where they showed its application for the statistically significant variables selection in NP regression. is introduced in [89,90]. For the variables selection it may be more appropriate to consider an adjusted as
where As a practical matter, the most critical choice in model selection in the nonparametric regression estimation above is the choice of the window-width h and the number of variables q. Further, if instead of considering the local linear estimator taken above and often used, we consider a local polynomial of degree d, then in would be a matrix and we would need an additional selection for d. Thus the nonparametric goodness of fit measures described above should be considered as and and they can be used for choosing, say h, for fixed q and d, as the value which maximizes . We note that is the well known Nadaraya and Watson local constant estimator and for , it is the local linear estimator. Further, for given d and h, and can be used to choose q.
3.1.1. AIC, BIC, and GCV
In the NP case the model selection (choosing q) using AIC is proposed by [91]. This is based on the LCLS estimator,
where in which and where the ()-th element of is and
In the same way, we note that and it can be used to select, for example, h given q and d ([92]) or q given h and d. In the latter case . The result for the procedure in the NP model is not yet known. However, if one considers NP sieve regression of the type where are nonlinear function of x and then BIC is similar to the BIC given in [96]. This includes, for example, special cases of a series expansion in which and a spline regression in which with as j-th knot, and if and 0 otherwise.
In [9] an estimate of the minimizer of called the GCV, is proposed which does not require the knowledge of This can be written as the minimization of
with respect to It has been shown by [9] that for large n, and the minimizer of is asymptotically optimal in the sense that as That is, the MSE of tends to be minimum as We note that in parametric and nonparametric cases are given in Section 2.1.3 and Section 3.1.2, respectively.
3.1.2. Mallows Model Selection
Let us write the regression model
where and Then, for and
Let us consider the LLLS estimator of which is linear in as
where as defined in section 3.1. When for large can become asymptotically linear.
Our objective is to choose q such that the average mean squared error (risk) is minimum where
We note that for
and
Further Mallows criterion for selecting q (number of variables in ) is by minimizing
where the second term on the right-hand side is the penalty. Essentially, the minimization of is the same as the minimization of the unbiased estimator of since does not depend on q, see Section 2.1.3 and [6,9].
3.1.3. Cross Validation (CV)
The CV method is one of the most widely used window-width selectors for NP kernel smoothing. We note that the cross-validation estimator of the integrated squared error weighted by the density ,
is given by
where is after deleting the i-th observations from the sample. In fact,
where the first term on the right-hand side is a good approximation to , because the second term is generally negligibly small, and the third term converges to a constant free from h. Therefore asymptotically.
Also, in the case where is a sieve regression, [96] shows that CV is an unbiased estimator of the MSE of prediction error (MSEPE) of m, , see section 2.1.4. In addition, the minimization of MSEPE is equivalent to the minimization of MSE and integrated MSE (IMSE) of estimated m for conditional and unconditional x, respectively.
If, instead of the local linear of we consider the local polynomial of order d, then is the LPLS estimator [2], and continues to hold. For we have a local constant LS (LCLS) estimator developed by [98,99]. For we have the LLLS estimator as considered above. In practice, the values of h and d can be determined by minimizing with respect to h and d for given q, which is developed by [100]. For a vector , if the choice of for any j tends to be infinity (very large) then the corresponding variable is an irrelevant variable. This can be observed from a simple example. Suppose the for two variables considering the LCLS estimator is Thus if then is constant and Thus a large estimated value of the window-width leads to the exclusion of variables, and hence variables selection.
In a seminal paper [83] shows that Mallows, GCV and CV procedures are asymptotically equivalent and all of them lead to optimal smoothing in the sense that
where given h and d, is an estimator of with obtained using one of the above procedures.
Also, [101] demonstrates that for the local constant estimator ( and given q), smoothing selectors of h are asymptotically equivalent to GCV selectors. In an important paper, in [92] shows the asymptotic normality of where is obtained by the CV method and is a vector of mixed continuous and discrete variables. Their extensive simulation results reveal (no theoretical proof) that AIC window-width selection criterion is asymptotically equivalent to the CV method, but for small samples AIC tends to perform better than the CV method. Further, with repect to the comparison of NP and parametric models, their results explain the observations of [102] which finds that NP estimators with smoothing parameters h chosen by CV can yield better prediction relative to commonly used parametric methods for the datasets of several countries. Reference [85] shows that CV is optimal under heteroskedasticity. For GMM model selection which involves selecting moments conditions, see [93]. Also, see [94] for using minimization of empirical likelihood/KLIC and comments by [95] claiming a fundamental flaw in the application of KLIC.
3.2. NP Model Averaging
Let us consider to be the set of estimators of m corresponding to the different sets of regressors considered in the model selection. Then
where and is the P matrix, as defined before, based here on the variables in the l-th model. Then the choice of w can be determined by applying Mallows criterion (see Section 2.2.2) as
where , and is a matrix of NP residuals of all the models. Thus we get
Similarly, as in section 2.2.3, if we calculate by deleting one element of each variable, then w can be determined by minimizing
in which the NP residuals matrix with and is computed with the i-th observation deleted.
For the fixed window-width the optimality result of can be shown to follow from [83]. However, for the validity of Li’s result needs further investigation.
4. Conclusions
Nonparametric and parametric models are studied in econometrics and practice. In all applications, the important issue is to reduce model uncertainty by using model selection or model averaging. This paper selectively reviews frequentist results on model selection and model averaging in the regression context.
It is clear that most of the results presented are under the i.i.d. assumption. It is useful to relax this assumption to allow dependence or heterogeneity in the data, see [103] for model selection in dependent time series models using various CV procedures. A systematic study of the properties of estimators based on FMA is warranted. Further, results need to be developed for more complicated nonparametric models, e.g., panel data models and models where variables are endogenous, although for the parametric case see [104,105,106,107,108]. Also, the properties of NP model averaging estimators, when the window-width in kernel regression is estimated are to be developed; although readers can see [96] for NP results of the estimators based on the sieve method.
Acknowledgements
The authors are thankful to L. Su, A.Wan, X. Zhang, and G. Zou for some discussions and references on the subject matter of this paper. They are also grateful to the guest editor, Tomohiro Ando, and anonymous referees for their constructive suggestions and comments. First author is also thankful to the Academic Senate, UCR for its financial support.
Conflicts of Interest
The authors declare no conflict of interest.
References
- A. Pagan, and A. Ullah. Nonparametric Econometrics. Cambridge, UK: Cambridge University Press, 1999. [Google Scholar]
- Q. Li, and J.S. Racine. Nonparametric Econometrics: Theory and Practice. Princeton, NJ, USA: Princeton University Press, 2007. [Google Scholar]
- A. Belloni, and V. Chernozhukov. “L1-penalized quantile regression in high-dimensional sparse models.” Ann. Stat. 39 (2011): 82–130. [Google Scholar] [CrossRef]
- C. Zhang, J. Fan, and T. Yu. “Multiple testing via FDRL for large-scale imaging data.” Ann. Stat. 39 (2011): 613–642. [Google Scholar] [CrossRef] [PubMed]
- H. Akaike. “Information Theory and An Extension of the Maximum Likelihood Principle.” In International Symposium on Information Theory. Edited by B.N. Petrov and F. Csaki. New York, USA: Springer-Verlag, 1973, pp. 267–281. [Google Scholar]
- C.L. Mallows. “Some comments on Cp.” Technometrics 15 (1973): 661–675. [Google Scholar]
- G. Schwarz. “Estimating the dimension of a model.” Ann. Stat. 6 (1978): 461–464. [Google Scholar] [CrossRef]
- M. Stone. “Cross-validatory choice and assessment of statistical predictions.” J. R. Stat. Soc. 36 (1974): 111–147. [Google Scholar]
- P. Craven, and G. Wahba. “Smoothing noisy data with spline functions.” Numer. Math. 31 (1979): 377–403. [Google Scholar] [CrossRef]
- G. Claeskens, and N.L. Hjort. “The focused information criterion.” J. Am. Stat. Assoc. 98 (2003): 900–945. [Google Scholar] [CrossRef]
- I.E. Frank, and J.H. Friedman. “A statistical view of some chemomtrics regression tools.” Technometrics 35 (1993): 109–135. [Google Scholar] [CrossRef]
- W. Fu, and K. Knight. “Asymptotics for lasso-type estimators.” Ann. Stat. 28 (2000): 1356–1378. [Google Scholar] [CrossRef]
- A.E. Hoerl, and R.W. Kennard. “Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12 (1970): 55–67. [Google Scholar] [CrossRef]
- J. Fan, and J. Lv. “A selective overview of variable selection in high dimensional feature space.” Stat. Sin. 20 (2010): 101–148. [Google Scholar] [PubMed]
- P. Bühlmann, and S. Van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York, NY, USA: Springer, 2011. [Google Scholar]
- G. Claeskens, and N.L. Hjort. Model Selection and Model Averaging. Cambridge, UK: Cambridge University Press, 2008. [Google Scholar]
- D. Andrews, and B. Lu. “Consistent model and moment selection procedures for GMM estimation with application to dynamic panel data models.” J. Econom. 101 (2001): 123–164. [Google Scholar] [CrossRef]
- A.R. Hall, A. Inoue, K. Jana, and C. Shin. “Information in generalized method of moments estimation and entropy-based moment selection.” J. Econom. 138 (2007): 488–512. [Google Scholar] [CrossRef]
- B.M. Pötscher. “Effects of model selection on inference.” Econom. Theory 7 (1991): 163–185. [Google Scholar] [CrossRef]
- P. Kabaila. “The Effect of Model Selection on Confidence Regions and Prediction Regions.” Econom. Theory 11 (1995): 537–549. [Google Scholar] [CrossRef]
- P. Bühlmann. “Efficient and adaptive post-model-selection estimators.” J. Stat. Plan. Inference 79 (1999): 1–9. [Google Scholar] [CrossRef]
- H. Leeb, and B.M. Pötscher. “The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations.” Econom. Theory 19 (2003): 100–142. [Google Scholar] [CrossRef]
- H. Leeb, and B.M. Pötscher. “Can one estimate the conditional distribution of post-model-selection estimators? ” Ann. Stat. 34 (2006): 2554–2591. [Google Scholar] [CrossRef]
- L. Breiman. “Heuristics of instability and stabilization in model selection.” Ann. Stat. 24 (1996): 2350–2383. [Google Scholar] [CrossRef]
- S. Jin, L. Su, and A. Ullah. “Robustify financial time series forecasting.” Econom. Rev., 2013, in press. [Google Scholar] [CrossRef]
- J.F. Geweke. Contemporary Bayesian Econometrics and Statistics. Hoboken, NJ, USA: John Wiley and Sons Inc., 2005. [Google Scholar]
- J.F. Geweke. “Bayesian model comparison and validation.” Am. Econ. Rev. Pap. Proc. 97 (2007): 60–64. [Google Scholar] [CrossRef]
- D. Draper. “Assessment and propagation of model uncertainty.” J. R. Stat. Soc. 57 (1995): 45–97. [Google Scholar]
- J.A. Hoeting, D. Madigan, A.E. Raftery, and C.T. Volinsky. “Bayesian model averaging: A tutorial (with discussion).” Stat. Sci. 14 (1999): 382–417. [Google Scholar]
- M. Clyde, and E.I. George. “Model uncertainty.” Stat. Sci. 19 (2004): 81–94. [Google Scholar]
- W.A. Brock, S.N. Durlauf, and K.D. West. “Policy evaluation uncertain economic environment.” Brook. Pap. Econ. Act. 2003 (2003): 235–301. [Google Scholar] [CrossRef]
- X. Sala-i-Martin, G. Doppelhofer, and R.I. Miller. “Determinants of long-term growth: A Bayesian Averaging of Classical Estimates (BACE) approach.” Am. Econ. Rev. 94 (2004): 813–835. [Google Scholar] [CrossRef]
- J.R. Magnus, O. Powell, and P. Prüfer. “A comparison of two model averaging techniques with an application to growth empirics.” J. Econom. 154 (2010): 139–153. [Google Scholar] [CrossRef]
- S.T. Buckland, K.P. Burnham, and N.H. Augustin. “Model selection: An integral part of inference.” Biometrics 53 (1997): 603–618. [Google Scholar] [CrossRef]
- Y. Yang. “Adaptive regression by mixing.” J. Am. Stat. Assoc. 96 (2001): 574–586. [Google Scholar] [CrossRef]
- K.P. Burnham, and D.R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretical Approach. New York, NY, USA: Springer-Verlag, 2002. [Google Scholar]
- G. Leung, and A.R. Barron. “Information theory and mixing least-squares regressions.” IEEE Trans. Inf. Theory 52 (2006): 3396–3410. [Google Scholar] [CrossRef]
- Z. Yuan, and Y. Yang. “Combining linear regression models: When and how? ” J. Bus. Econ. Stat. 100 (2005): 1202–1204. [Google Scholar] [CrossRef]
- B.E. Hansen. “Notes and comments least squares model averaging.” Econometrica 75 (2007): 1175–1189. [Google Scholar] [CrossRef]
- B.E. Hansen, and J. Racine. “Jackknife model averaging.” J. Econom. 167 (2012): 38–46. [Google Scholar] [CrossRef]
- A.T.K. Wan, X. Zhang, and G. Zou. “Least squares model averaging by mallows criterion.” J. Econom. 156 (2010): 277–283. [Google Scholar] [CrossRef]
- G. Kapetanios, V. Labhard, and S. Price. “Forecasting using predictive likelihood model averaging.” Econ. Lett. 91 (2006): 373–379. [Google Scholar] [CrossRef]
- A.T.K. Wan, and X. Zhang. “On the use of model averaging in tourism research.” Ann. Tour. Res. 36 (2009): 525–532. [Google Scholar] [CrossRef]
- J.M. Bates, and C.W. Granger. “The combination of forecasts.” Oper. Res. Q. 20 (1969): 451–468. [Google Scholar] [CrossRef]
- I. Olkin, and C.H. Speigelman. “A semiparametric approach to density estimation.” J. Am. Stat. Assoc. 82 (1987): 858–865. [Google Scholar] [CrossRef]
- Y. Fan, and A. Ullah. “Asymptotic normality of a combined regression estimator.” J. Multivar. Anal. 71 (1999): 191–240. [Google Scholar] [CrossRef]
- D.H. Wolpert. “Stacked generalization.” Neural Netw. 5 (1992): 241–259. [Google Scholar] [CrossRef]
- M. LeBlanc, and R. Tibshirani. “Combining estimates in regression and classification.” J. Am. Stat. Assoc. 91 (1996): 1641–1650. [Google Scholar] [CrossRef]
- Y. Yang. “Mixing strategies for density estimation.” Ann. Stat. 28 (2000): 75–87. [Google Scholar] [CrossRef]
- O. Catoni. The Mixture Approach to Universal Model Selection. Technical Report; Paris, France: Ecole Normale Superieure, 1997. [Google Scholar]
- M.I. Jordan, and R.A. Jacobs. “Hiearchical mixtures of experts and the EM algorithm.” Neural Comput. 6 (1994): 181–214. [Google Scholar] [CrossRef]
- X. Jiang, and M.A. Tanner. “On the asymptotic normality of hierarchical mixtures-of-experts for generalized linear models.” IEEE Trans. Inf. Theory 46 (2000): 1005–1013. [Google Scholar] [CrossRef]
- V.G. Vovk. “Aggregateing Strategies.” In Proceedings of the 3rd Annual Workshop on Computational Learning Theory, Rochester, NY, USA, 06–08 August 1990; Volume 56, pp. 371–383.
- V.G. Vovk. “A game of prediction with expert advice.” J. Comput. Syst. Sci. 56 (1998): 153–173. [Google Scholar] [CrossRef]
- N. Merhav, and M. Feder. “Universal prediction.” IEEE Trans. Inf. Theory 44 (1998): 2124–2147. [Google Scholar] [CrossRef]
- A. Ullah. “Nonparametric estimation of econometric functionals.” Can. J. Econ. 21 (1988): 625–658. [Google Scholar] [CrossRef]
- J. Fan, and I. Gijbels. Nonparametric Estimation of Econometric Functionals. London, UK: Champman and Hall, 1996. [Google Scholar]
- R.L. Eubank. Nonparametric Regression and Spline Smoothing. New York, NY, USA: CRC Press, 1999. [Google Scholar]
- S. Geman, and C. Hwang. “Diffusions for global optimization.” SIAM J. Control Optim. 24 (1982): 1031–1043. [Google Scholar] [CrossRef]
- W.K. Newey. “Convergence rates and asymptotic normality for series estimators.” J. Econom. 79 (1997): 147–168. [Google Scholar] [CrossRef]
- H. Wang, X. Zhang, and G. Zou. “Frequentist model averaging estimation: A review.” J. Syst. Sci. Complex. 22 (2009): 732–748. [Google Scholar] [CrossRef]
- L. Su, and Y. Zhang. “Variable Selection in Nonparametric and Semiparametric Regression Models.” In Handbook of Applied Nonparametric and Semiparametric Econometrics and Statistics. Edited by A. Ullah, J. Racine and L. Su. Oxford, UK: Oxford University Press, 2013, in press. [Google Scholar]
- A.K. Srivastava, V.K. Srivastava, and A. Ullah. “The coefficient of determination and its adjusted version in linear regression models.” Econom. Rev. 14 (1995): 229–240. [Google Scholar] [CrossRef]
- V. Rousson, and N.F. Gosoniu. “An R-square coefficient based on final prediction error.” Stat. Methodol. 4 (2007): 331–340. [Google Scholar] [CrossRef]
- Y. Wang. On Efficiency Properties of An R-square Coefficient Based on Final Prediction Error. Working Paper; Beijing, China: School of International Trade and Economics, University of International Business and Economics, 2013. [Google Scholar]
- K. Takeuchi. “Distribution of information statistics and criteria for adequacy of models.” Math. Sci. 153 (1976): 12–18, In Japanese. [Google Scholar]
- E. Maasoumi. “A compendium to information theory in economics and econometrics.” Econom. Rev. 12 (1993): 137–181. [Google Scholar] [CrossRef]
- A. Ullah. “Entropy, divergence and distance measures with econometric applications.” J. Stat. Plan. Inference 49 (1996): 137–162. [Google Scholar] [CrossRef]
- R. Nishi. “Asymptotic properties of criteria for selection of variables in multiple regression.” Ann. Stat. 12 (1984): 758–765. [Google Scholar] [CrossRef]
- E.J. Hannan, and B.G. Quinn. “The determination of the order of an autoregression.” J. R. Stat. Soc. 41 (1979): 190–195. [Google Scholar]
- C.M. Hurvich, and C.L. Tsai. “Regression and time series model selection in small samples.” Biometrika 76 (1989): 297–307. [Google Scholar] [CrossRef]
- J. Kuha. “AIC and BIC: Comparisons of assumptions and performance.” Sociol. Methods Res. 33 (2004): 188–229. [Google Scholar] [CrossRef]
- M. Stone. “An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion.” J. R. Stat. Soc. 39 (1977): 44–47. [Google Scholar]
- M. Stone. “1979. Comments on model selection criteria of Akaike and Schwartz.” J. R. Stat. Soc. 41 (1979): 276–278. [Google Scholar]
- G.S. Maddala. Introduction to Econometrics. New York, NY, USA: Macmillan, 1988. [Google Scholar]
- R. Tibshirani. “Regression shrinkage and selection via the lasso.” J. R. Stat. 58 (1996): 267–288. [Google Scholar]
- A. Ullah, A.T.K. Wan, H. Wang, X. Zhang, and G. Zou. A Semiparametric Generalized Ridge Estimator and Link with Model Averaging. Working Paper; Riverside, CA, USA: Department of Economics, University of California, 2013. [Google Scholar]
- H. Zou. “The adaptive lasso and its oracle properties.” J. Am. Stat. Assoc. 101 (2006): 1418–1429. [Google Scholar] [CrossRef]
- C. Zhang. “Nearly unbiased variable selection under minimax concave penalty.” Ann. Stat. 38 (2010): 894–942. [Google Scholar] [CrossRef]
- J. Fan, and R. Li. “Variable selection via nonconcave penalized likelihood and its oracle properties.” J. Am. Stat. Assoc. 96 (2001): 1348–1360. [Google Scholar] [CrossRef]
- B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. “Least angle regression.” Ann. Stat. 32 (2004): 407–499. [Google Scholar]
- X. Zhang, A.T.K. Wan, and S.Z. Zhou. “Focused information criteria, model selection, and model averaging in a tobit model with a nonzero threshold.” J. Bus. Econ. Stat. 30 (2012): 132–143. [Google Scholar] [CrossRef]
- K.C. Li. “Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: discrete index set.” Ann. Stat. 15 (1987): 958–975. [Google Scholar] [CrossRef]
- B. Hansen. “Least-squares forecast averaging.” J. Econom. 146 (2008): 342–350. [Google Scholar] [CrossRef]
- D.W.K. Andrews. “Asymptotic optimality of generalized CL, cross-validation, and generalized cross-validation in regression with heteroskedastic errors.” J. Econom. 47 (1991): 359–377. [Google Scholar] [CrossRef]
- X. Lu, and L. Su. Jackknife Model Averaging for Quantile Regressions. Working Paper; Singapore: School of Economics, Singapore Management University, 2012. [Google Scholar]
- G. Kuersteiner, and R. Okui. “Constructing optimal instruments by first-stage prediction averaging.” Econometrica 78 (2010): 697–718. [Google Scholar]
- F. Yao, and A. Ullah. “A nonparametric R2 test for the presence of relevant variables.” J. Stat. Plan. Inference, 143 (2013): 1527–1547. [Google Scholar] [CrossRef]
- L. Su, and A. Ullah. “A nonparametric goodness-of-fit-based test for conditional heteroskedasticity.” Econom. Theory 29 (2013): 187–212. [Google Scholar] [CrossRef]
- L.H. Huang, and J. Chen. “Analysis of variance, coefficient of determination and f-test for local polynomial regression.” Ann. Stat. 36 (2008): 2085–2109. [Google Scholar] [CrossRef]
- C. Hurvich, J. Simonoff, and C. Tsai. “Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion.” J. R. Stat. Soc. 60 (1998): 271–293. [Google Scholar] [CrossRef]
- J. Racine, and Q. Li. “Nonparametric estimation of regression functions with both categorical and continuous data.” J. Econom. 119 (2004): 99–130. [Google Scholar] [CrossRef]
- D.W.K. Andrews. “Consistent moment selection procedures for generalized method of moments estimation.” Econometrica 67 (1999): 543–564. [Google Scholar] [CrossRef]
- X. Chen, H. Hong, and M. Shum. “Nonparametric likelihood ratio model selection tests between parametric likelihood and moment condition models.” J. Econom. 141 (2007): 109–140. [Google Scholar] [CrossRef]
- S.M. Schennach. “Instrumental variable estimation of nonlinear errors-in-variables models.” Econometrica 75 (2007): 201–239. [Google Scholar] [CrossRef]
- B. Hansen. Nonparametric Sieve Regression: Least Squares Averaging Least Squares, and Cross-validation. Working Paper; Madison, WI, USA: University of Wisconsin, 2012. [Google Scholar]
- H. Liang, G. Zou, A.T.K. Wan, and X. Zhang. “Optimal weight choice for frequentist model average estimators.” J. Am. Stat. Assoc. 106 (2011): 1053–1066. [Google Scholar] [CrossRef]
- E.A. Nadaraya. “Some new estimates for distribution functions.” Theory Probab. Its Appl. 9 (1964): 497–500. [Google Scholar] [CrossRef]
- G.S. Watson. “Smooth regression analysis.” Sankhya Ser. A 26 (1964): 359–372. [Google Scholar]
- P.G. Hall, and J.S. Racine. Infinite Order Cross-validated Local Polynomial Regression. Working Paper; Ontario, Canada: Department of Economic, McMaster University, 2013. [Google Scholar]
- W. Härdle, P. Hall, and J.S. Marron. “How far are automatically chosen regression smoothing parameters from their optimum? ” J. Am. Stat. Assoc. 83 (1988): 86–99. [Google Scholar] [CrossRef]
- Q. Li, and J. Racine. Empirical Applications of Smoothing Categorical Variables. Working Paper; Ontario, Canada: Department of Economic, McMaster University, 2001. [Google Scholar]
- J. Racine. “Consistent cross-validatory model-selection for dependent data: Hv-block cross-validation.” J. Econom. 99 (2000): 39–61. [Google Scholar] [CrossRef]
- M. Caner. “A lasso type GMM estimator.” Econom. Theory 25 (2009): 270–290. [Google Scholar] [CrossRef]
- M. Caner, and M. Fan. A Near Minimax Risk Bound: Adaptive Lasso with Heteroskedastic Data in Instrumental Variable Selection. Working Paper; Raleigh, USA: North Carolina State University, 2011. [Google Scholar]
- P.E. Garcia. Instrumental Variable Estimation and Selection with Many Weak and Irrelevant Instruments. Working Paper; Madison, WI, USA: University of Wisconsin, 2011. [Google Scholar]
- Z. Liao. “Adaptive GMM shrinkage estimation with consistent moment selection.” Econom. Theory FirstView (2013): 1–48. [Google Scholar] [CrossRef]
- E. Gautier, and A. Tsybakov. High-Dimensional Instrumental Variables Regression and Confidence Sets. Working Paper; Malakoff Cedex, France: Centre de Recherche en Economie et Statistique, 2011. [Google Scholar]
© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).