1. Introduction
Model averaging has been developed as an alternative to model selection. In many situations, model-averaging methods perform better than alternative model-selection methods. The main reason for this is that model selection delivers a pretest estimator that has inferior properties, and its use can be harmful (see 
Danilov and Magnus, 
2004). 
Yuan and Yang (
2005) provided a detailed discussion on the choice between model averaging and model selection. As one of the pioneers of frequentist model averaging, 
Hansen (
2007) proposed Mallows model averaging (MMA) based on the ordinary least-squares (OLS) estimator for linear regression models with homoscedastic errors. 
Wan et al. (
2010) extended the results for non-nested models with homoscedastic errors. 
Zhao et al. (
2018) is the most recent work in this area. For linear regression models with heteroscedastic errors, 
Hansen and Racine (
2012), 
Liu and Okui (
2013) and 
Zhang et al. (
2013, 
2015) proposed model averaging methods that are still based on the OLS estimator, while 
Liu et al. (
2016) proposed a method based on the generalized least squares estimator (GLS). They demonstrated that their methods are optimal in the sense of 
Li (
1987) for homoscedastic or heteroscedastic models. For model averaging in big datasets, 
Xie (
2017) proposed the use of model screening (before averaging) in order to deal with the large number of candidate models/regressors.
However, all previous papers assumed that it was known whether the errors of the true data-generating process are homoscedastic or heteroscedastic. Due to this assumption, the previous averaging methods were based only on estimators constructed using the same estimation method, either OLS or GLS estimators (with different regressor sets). This assumption can be unrealistic in empirical applications. Usually, researchers do not know the structure of the error term; therefore, this assumption leads to possible misspecification. A natural solution is to combine OLS and GLS estimators. Combinations of different methods are routinely used in the applied forecast combination literature. In a recent forecasting competition that included 100,000 series, 
Makridakis et al. (
2018) found that, out of the 17 most accurate methods, 12 were combinations. All combinations used different models/methods varying from simple exponential smoothing models to sophisticated machine-learning algorithms.
We propose a combination method based on OLS and GLS estimators to reduce the risk of misspecification between homoscedastic and heteroscedastic linear models. More precisely, the proposed estimator is a weighted average of mixtures of OLS and GLS estimators. The OLS mixture is constructed using the MMA of 
Hansen (
2007) or the heteroscedasticity-robust Cp(HRCp) model averaging of 
Liu and Okui (
2013). The GLS mixture is constructed using the GLS model averaging (GLSMA) of 
Liu et al. (
2016).
We propose the use of two criteria, MMA-GLSMA and HRCp-GLSMA, to choose the weight vector for combining estimators. The optimality of the chosen weight vector in the sense of 
Li (
1987) is investigated. Our method works in situations with an unknown variance-covariance matrix of the error term if an estimate based on the nonparametric method k-nearest neighbours (
k-NN) is used. The results of the simulation experiments show that our combination method is adaptive in the sense that it can achieve almost the same estimation accuracy as if the homoscedasticity or heteroscedasticity of the error term was known.
The rest of the paper is organized as follows: In 
Section 2, we describe the theoretical setup and introduce the new combination method with the criteria for choosing the weight vector. In 
Section 3, we investigate the optimality of the proposed criteria. 
Section 4 presents the results of the Monte Carlo simulations. 
Section 5 concludes the paper, and all proofs are provided in the 
Appendix A.
  2. Method
Suppose that we have an independent random sample of 
 for 
, where 
 is a countably infinite real-valued vector and 
 is a real-valued scalar random variable generated from an infinite dimensional linear regression model:
 where:
 is an unobserved error term that can be homoscedastic or heteroscedastic with ,  and  and  for  are unknown parameters. We also define , , ,  and denote the variance-covariance matrix as . We state the theoretical results considering the distributions conditional on X and omit all notations for those conditional on X hereafter.
  2.1. Infeasible Combination Estimator and Information Criteria
Suppose 
 is known; we have a candidate set of 
 linear models with different numbers of independent variables for OLS estimation and a candidate set of 
 linear models with different numbers of independent variables for GLS estimation. Then, we can obtain a set of OLS estimates 
 for 
 and a set of GLS estimates 
 for 
. Therein, 
 is the projection matrix of the 
 regression model for OLS with 
, and 
 with 
 being the independent variable matrix of the 
 regression model for GLS with 
. In this paper, we only consider the situation with nested models for both OLS and GLS estimators. This means that the 
 model is nested in the 
 model. Theoretical results may be extended to non-nested candidate models using the approach of 
Wan et al. (
2010). Moreover, 
 and 
 can be fixed or go to infinity when the sample size 
n is increasing.
Based on those OLS and GLS estimates, we construct a combination estimator as follows:
 where 
 belongs to:
 where 
 and 
I denote a 
 vector having all elements equal to one.
In order to reduce the risk of the combination-estimation method proposed above, we need to select a suitable weight vector. To do that, in this subsection, we propose two versions, MMA-GLSMA and HRCp-GLSMA, with infeasible criteria. In the next subsection, we provide their feasible counterparts for the situation with unknown  values.
HRCp-GLSMA: The first information criterion for selecting a weight vector is defined as: 
 where 
, 
, 
, 
, 
 and 
.
Note that:
 where 
 is the HRCp model-averaging criterion proposed by 
Liu and Okui (
2013) with the weight vector 
 and 
 is a GLSMA information criterion proposed by 
Liu et al. (
2016) with the weight vector 
. 
 can be regarded as a combination of the HRCp and the GLSMA. Hence, we call 
 the HRCp-GLSMA-type criterion.
MMA-GLSMA: Second, we propose an MMA-GLSMA-type criterion for weight selection. The infeasible MMA-GLSMA-type criterion is defined as follows:
 where 
 is the MMA criterion proposed by 
Hansen (
2007) with the weight vector 
.
Suppose the variance-covariance matrix 
 is known; we can then choose the weight by minimizing the criteria 
 or 
, as follows:
 or
        
However, since the variance-covariance matrix  is unknown,  and  are infeasible.
  2.2. Feasible Combination Estimator and Information Criteria
For a situation with unknown variance, a feasible combination estimator can be constructed using feasible GLS (FGLS) estimators. FGLS estimators and the feasible combination estimator are defined below:
 where the FGLS estimator is 
. Therein, the estimator 
 is based on the 
k-NN estimator 
 of 
Liu et al. (
2016).
We propose two feasible counterparts of 
 and 
. The feasible HRCp-GLSMA-type criterion 
 is defined as:
 and the feasible MMA-GLSMA-type criterion 
 is defined as:
 where 
 denotes the 
 element of 
 with 
 denoting the number of independent variables in the largest OLS model:
  denotes the projection matrix of that model, and 
 denotes the 
 diagonal element of 
. 
, while 
 is defined as one of the estimators suggested by 
Hansen (
2007). 
 is calculated by plugging in the 
k-NN estimator 
 used in 
Liu et al. (
2016).
  3. Properties of the Criteria
The following lemma demonstrates the significant fact that the two infeasible criteria proposed above are unbiased estimates of the risk function  plus a constant;  is the loss function.
Lemma 1. For any real-valued vector W,  and , where  and  are constants.
 Another useful property is that all criteria are asymptotically optimal in the sense of 
Li (
1987). Proofs of the asymptotic optimality for all of the above-mentioned criteria can be performed by extending the proofs of 
Hansen (
2007); 
Liu and Okui (
2013); 
Liu et al. (
2016). As an example, we demonstrate the optimality of the feasible MMA-GLSMA case method with the nonparametric estimator of 
 used in 
Liu et al. (
2016).
To do that, we define the feasible loss function and the risk function as follows:
We employ some notations and assumptions from 
Liu et al. (
2016), reproduced in the Appendix, and add the following additional assumption:  
Assumption 1. As  and , , where ,  is the number of regressors used in the regression model for the k-NN estimator and  denotes the approximation error of that model.
 Assumption 1’ guarantees that when the number of regressors used in the regression model (adjusted with the approximation errors of that model) increases with the sample size n, it increases at a rate slower than the lower bound of the risk across all possible weights. In practice, this assumption requires us to moderate the increase in the number of regressors  (relative to the sample size) to reduce the approximation errors.
Following 
Hansen (
2007), we restrict the elements of the weighting vector to belong to set 
 for some integer 
.
Theorem 1. Under Assumption 1’ and Assumptions 1–3, 6, and 10–14 of Liu et al. (2016), as , ,  and , we have:where .  In other words, as the sample size increases, our method can achieve the infimum of the loss.
  4. Simulation Study
To investigate the finite sample performance of the proposed MMA-GLSMA and HRCp-GLSMA versions and to compare them with other alternative methods, we performed a Monte Carlo simulation. Alternative methods include MMA, 
 and GLSMA with 
, where 
 is the feasible counterpart of 
 proposed in 
Liu et al. (
2016).
The data-generating process (DGP) is:
 where:
 for 
, with 
. We used parameter 
 with 
j truncated at 10,000 and a positive constant 
c. 
 and 
 for 
 are independent with respect to 
i. We conducted three simulations. The first case was a simulation with a homoscedastic error term 
. The second case was a mild heteroscedastic case with heteroscedastic error term 
, with 
. The third case was a strong heteroscedastic case with 
. In all cases, 
 was independent with respect to 
i. The number of regressors in the largest approximation model or the number of nested models was 
. The 
 model contained the first 
m regressors, including the constant term. We varied 
c to change the 
 of the DGP from 
–
 with an increment of 
.
We considered two cases of GLSMA with different estimation methods of 
, one based on the maximum likelihood estimation (MLE), and the other based on the nonparametric method 
k-NN. For details of these two cases, see 
Liu et al. (
2016). Because the true specification of 
 is usually unknown in practice, we misspecified 
 for GLSMA. For the MLE-based method, we set 
, where 
a and 
b are the unknown parameters to be estimated. For the nonparametric case, we only used 
 and 
 for 
k-NN estimation.
The number of replications for all simulations was 1000. We evaluated the performance of each method by the sample mean squared error (MSE) 
, where 
 and 
 are the realized vector of the estimated value 
 and true value 
 in the 
 replication, respectively. The simulation results are shown in 
Table 1, 
Table 2 and 
Table 3. 
Figure 1 presents the same information relative to the MSE of the GLSMA method.
The results in 
Table 1 and 
Table 2 show that our combination methods, MMA-GLSMA and HRCp- GLSMA, performed better than the alternatives (GLSMA, HRCp and MMA) when the error term was homoscedastic or had mild heteroscedasticity for 
. When 
, the performance of our methods was slightly worse than that of the alternative methods. For the homoscedastic case, the three alternative methods performed similarly. On the other hand, in the case of mild heteroscedasticity, GLSMA and 
 performed better than MMA (which was expected, as MMA was designed for homoscedastic models).
Table 3 demonstrates that, when the heteroscedasticity of the error term was considerably strong, our combination method HRCp-GLSMA worked much better than the others when the MLE-based estimation of 
 was used. However, MMA-GLSMA and HRCp-GLSMA became worse than GLSMA when 
 was estimated using the nonparametric method.
 Moreover, for most cases, GLSMA with the nonparametric estimator of  outperformed GLSMA with the MLE-based estimator of . This can be explained by a characteristic of the k-NN method that we adopted. In the k-NN estimation, a large weight was placed on the  squared residual to estimate ; therefore, even though  was misspecified, the estimate could catch the heteroscedasticity of the error term to some extent.
The aforementioned simulation results gave us the following indications for practical analysis. If we know that the heteroscedasticity of the data is considerably strong or the population  is large, we should use the nonparametric GLSMA. Otherwise, it is preferable to choose an MMA-GLSMA or HRCp-GLSMA combination.
Table 4 and 
Table 5 give the averages of the estimated weights corresponding to the OLS and GLS parts for HRCp-GLSMA and MMA-GLSMA. 
Table 4 shows that for HRCp-GLSMA, when the heteroscedasticity became stronger, the average weights corresponding to the GLS part increased and the average weights corresponding to the OLS estimates decreased. 
Table 5 does not show a similar trend for MMA-GLSMA. This might be an explanation for why the performance of MMA-GLSMA was worse than that of HRCp-GLSMA.