Next Article in Journal
Asymptotic Versus Bootstrap Inference for Inequality Indices of the Cumulative Distribution Function
Next Article in Special Issue
Bayesian Model Averaging and Prior Sensitivity in Stochastic Frontier Analysis
Previous Article in Journal / Special Issue
A Review of the ‘BMS’ Package for R with Focus on Jointness
Open AccessFeature PaperArticle

Cross-Validation Model Averaging for Generalized Functional Linear Model

by Haili Zhang 1,2 and Guohua Zou 3,*
1
University of Chinese Academy of Sciences, Beijing 100049, China
2
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
3
School of Mathematical Sciences, Capital Normal University, Beijing 100048, China
*
Author to whom correspondence should be addressed.
Econometrics 2020, 8(1), 7; https://doi.org/10.3390/econometrics8010007
Received: 2 September 2019 / Revised: 6 February 2020 / Accepted: 18 February 2020 / Published: 24 February 2020
(This article belongs to the Special Issue Bayesian and Frequentist Model Averaging)

Abstract

Functional data is a common and important type in econometrics and has been easier and easier to collect in the big data era. To improve estimation accuracy and reduce forecast risks with functional data, in this paper, we propose a novel cross-validation model averaging method for generalized functional linear model where the scalar response variable is related to a random function predictor by a link function. We establish asymptotic theoretical result on the optimality of the weights selected by our method when the true model is not in the candidate model set. Our simulations show that the proposed method often performs better than the commonly used model selection and averaging methods. We also apply the proposed method to Beijing second-hand house price data.
Keywords: generalized functional linear model; cross-validation; model averaging; asymptotic optimality generalized functional linear model; cross-validation; model averaging; asymptotic optimality

1. Introduction

In recent years, functional data have been increasingly popular in many scientific areas. A common question for the functional data is how to quantify the relationship between functional covariates and scalar responses. Functional linear model (FLM) and generalized functional linear model (GFLM) can take account of some associations between the response and the different points in the domain of the functional covariates, and therefore are two useful tools in many studies for functional data. These two models have now been widely used to solve practical problems, such as exploring the relationship between the growth and age in the life sciences, analyzing the weather data in different areas, recognizing the handwriting data, and conducting the diffusion tensor imaging studies. Functional data analysis usually represents functional covariates and coefficient functions by some linear combinations of a set of basis functions, such as a prespecified basis system like B-splines, Fourier and wavelet bases (James 2002), and data-adaptive basis functions from functional principal component analysis (FPCA) (Yao et al. 2005). We are concerned with the GFLM because it can estimate the flexible and nonlinear relationships between the functional covariates and scalar responses for many types of data such as binary response data, Poisson response data, and multivariate discrete response data. See, for example, James (2002), who expanded generalized linear models to generalized functional linear models with the functional principal component methodology and demonstrated that this approach can be performed for linear, logistic and censored regressions in simulations and real data analysis.
In econometrics, the relationship between time series and scalar response is often of interest. We can use GFLM instead of generalized linear model to handle the case where a time series with the dependence at different time points is used as the explanatory variables with dimension toward to infinity. On the other hand, prediction is often the main goal in econometric data analysis. Several approaches have been proposed to select some important principal components in FPCA such as AIC, BIC, and leave-one-out cross-validation (Müller and Stadtmüller 2005). However, as we will demonstrate, the model selection alone, such as AIC, is not an optimal approach for the purpose of estimation and prediction. in one model selected by AIC or BIC may lead to the loss of information from other models. Different models often capture different data characteristics and therefore model averaging generally gets higher estimating or predicting accuracy, which has received extensive attention in recent years.
Model averaging has two research directions: Bayesian Model Averaging (BMA) and Frequentist Model Averaging (FMA). We will focus on the latter in this paper. A key problem with the FMA is the choice of weights assigned to different models. In this regard, various approaches have been developed. See, for example, smoothed AIC, smoothed BIC (Buckland et al. 1997), smoothed FIC (Hjort and Claeskens 2003; Claeskens and Carroll 2007; Zhang and Liang 2011; Zhang et al. 2012; Xu et al. 2014), Adaptive method (Yang 2001), MMA method (Hansen 2007; Wan et al. 2010), OPT method (Liang et al. 2011), JMA method (Hansen and Racine 2012; Zhang et al. 2013), and leave-subject-out cross-validation method (Gao et al. 2016), which apply to independent, or time series, or longitudinal data.
For functional data, some model averaging methods have been studied. Zhu et al. (2018) proposed a model averaging estimator based on Mallows’ criterion for partial functional linear models whose response is a scalar and the predictors are a random vector and some functional variables. Zhang et al. (2018) proposed a Jackknife model averaging for fully functional linear models whose response and predictor are both functional processes. For generalized functional linear model designed for the case where the scalar response is nonlinearly dependent on functional explanatory variables, model averaging is a good alternative to model selection that may lead to instability in variable selection or coefficient estimation caused by randomness of the data collection and so on.
In this article, we consider model averaging methods for GFLM to capture the nonlinear characteristics hidden in the data and to reduce the prediction errors and risks. The contributions of this article are threefold: We first adopt FPCA to reduce the dimensions as it provides a parsimonious representation of functional data, and then present a novel model averaging procedure based on leave-one-out cross-validation criterion (CV). Second, we prove the consistency of parameter estimator under the misspecified model with some mild conditions. The dimension of the parameter can be divergent. Third, we establish the asymptotic optimality of our method in the squared loss sense for generalized linear model with a diverging number of parameters. Our work relaxes the condition that the expectations of estimators need to exist.
The rest of the article is organized as follows. In Section 2, we introduce our proposed model averaging method for GFLM. We then establish the asymptotic property of the proposed method in Section 3. Simulation studies and a real data example of second-hand house price in Beijing are presented in Section 4. Section 5 concludes. Proofs of theoretical results are provided in Appendix A and Appendix B.

2. Model Averaging for Generalized Functional Linear Model

2.1. The Generalized Functional Linear Model

The data we collected for the ith subject or experimental unit are X i ( t ) , t T , y i , i = 1 , , n . We assume these data are generated independently. The predictor variable X ( t ) t T is a random curve corresponding to a square integrable stochastic process on a real interval T. The response variable is a real-valued random variable that may be continuous or discrete. For example, in a binary regression, one would have y { 0 , 1 } .
Suppose that the given link function g ( · ) is a strictly monotone and twice continuously differentiable function with bounded derivatives and is thus invertible. This assumption is common in generalized linear model. See, for example, (Chen et al. 1999; Müller and Stadtmüller 2005; Ando and Li 2017). Moreover, we assume a variance function σ 2 ( · ) , which is strictly positive with upper bound defined on the range of the link function. The generalized functional linear model or functional quasi-likelihood model is determined by a parameter function β ( · ) , which is square integrable on its domain T, in addition to the link function g ( · ) and the variance function σ 2 ( · ) .
Given a real measure d ω on T, we define linear predictors
η i = α + β ( t ) X i ( t ) d ω ( t ) , i = 1 , n ,
and conditional means μ i = g ( η i ) , where E y i | X i t , t T = μ i , and Var y i | X i ( t ) , t T = σ 2 ( μ i ) = σ ˜ 2 ( η i ) with the function σ ˜ 2 ( η i ) = σ 2 ( g ( η i ) ) . In a generalized functional linear model, the distribution of y i would be specified with the exponential family. Thus, we should consider a functional quasi-likelihood model
y i = g α + β ( t ) X i ( t ) d ω ( t ) + e i , i = 1 , n ,
where E e i | X i ( t ) , t T = 0 and Var e i | X i ( t ) , t T = σ 2 ( μ i ) = σ ˜ 2 ( η i ) . Note that α is a constant, and the inclusion of an intercept allows us to require E X i ( t ) = 0 for all t. We assume the errors e i are independent with the same variance. It is easy to obtain E ( e i ) = 0 and
Var ( e i ) = Var E ( e i | X i ( t ) , t T ) + E { Var ( e i | X i ( t ) , t T ) } = E σ ˜ 2 ( η i ) = σ 2 .
Following Müller and Stadtmüller (2005), we choose an orthonormal basis ρ j , i = 1 , 2 of the function space L 2 ( d ω ) , that is T ρ j ( t ) ρ k ( t ) d ω ( t ) = δ j k , where δ j k = 0 for j k and δ j k = 1 for j = k . Then, we can expand the predictor process X ( t ) and the parameter function β ( t ) as
X ( t ) = j = 1 ε j ρ j ( t ) ,
and
β ( t ) = j = 1 β j ρ j ( t ) ,
in the L 2 ( d ω ) sense] with random variables ε j and coefficients β j given by ε j = X ( t ) ρ j ( t ) d ω ( t ) and β j = β ( t ) ρ j ( t ) d ω ( t ) , respectively. By the previous assumptions that X ( t ) and β ( t ) are square integrable, we get j = 1 β j 2 < and j = 1 E ε j 2 < .
From the orthonormality of the basis function ρ j and setting
ε i , j = X i ( t ) ρ j ( t ) d ω ( t ) ,
it follows immediately that
η i = α + β ( t ) X i ( t ) d ω ( t ) = α + j = 1 β j ε i , j .
It will be convenient to work with standardized errors
e i = e i / σ ( μ i ) = e i / σ ˜ ( η i ) ,
in which E ( e i | X i ( t ) ) = 0 , E ( e i ) = 0 , and E ( e i 2 ) = 1 . Then, it will be sufficient to consider the following model,
y i = g α + j = 1 β j ε i , j + e i σ ˜ α + j = 1 β j ε i , j , i = 1 , , n ,
where the function g ( · ) is known.
The number of parameter in model (2) is infinite. We address the difficulty caused by the infinite dimensionality of the predictors by approximating model (2) with a series of models where the number of predictors is truncated at p = p n , and the dimension p n can be a constant as large as possible with p n < n . A heuristic truncation strategy is as follows. For the ith sample, a p-truncated linear predictor η i , p is
η i , p = α + j = 1 p β j ε i , j .
The approximating model we use is
y i = g α + j = 1 p β j ε i , j + e i σ ˜ α + j = 1 p β j ε i , j , i = 1 , , n .
Now, we consider the estimation for generalized functional linear model. First, we use FPCA to get a set of orthogonal eigenfunctions as the basis functions in the space L 2 ( d ω ) . Then, we consider a series of candidate models. The number of candidate models is M. For the mth candidate model, we adopt the first p m functional principal components to build the approximating model,
y i = g α ( m ) + j = 1 p m β j ( m ) ε i , j + e i σ ˜ α ( m ) + j = 1 p m β j ( m ) ε i , j , i = 1 , , n .
We assume that p 1 < p 2 < < p M . That is, the candidate models are nested. Denote ε i , 0 = 1 and β 0 ( m ) = α ( m ) , then we estimate the unknown parameter vector β ( m ) = ( β 0 ( m ) , β 1 ( m ) , , β p m ( m ) ) T by solving the following estimating or score equation
U n , m ( β ( m ) ) = 1 n i = 1 n y i g ( η i , p m ) g ( η i , p m ) σ 2 ( μ i , p m ) ε ( i , p m ) = 0 ,
where η i , p m = j = 0 p m β j ( m ) ε i , j and ε ( i , p m ) = ( ε i , 0 , , ε i , p m ) T . Let β ^ ( m ) be the solution of the score equation U n , m ( β ( m ) ) = 0 , i.e.,
U n , m ( β ^ ( m ) ) = 1 n i = 1 n [ y i g ( η ^ i , p m ) ] g ( η ^ i , p m ) σ 2 g η ^ i , p m ε ( i , p m ) = 0 .

2.2. Model Averaging Estimation

For each candidate model, we get the estimator of the unknown parameter vector by (4). Let
w H n = w [ 0 , 1 ] M : m = 1 M w m = 1 ,
then we obtain the model averaging estimator of η i :
η ^ i ( w ) = m = 1 M w m η ^ i , p m ,
where η ^ i , p m = j = 0 p m β ^ j ( m ) ε i j . Thus, a model averaging estimator of the conditional mean μ i is given by
μ ^ i ( w ) = g m = 1 M w m η ^ i , p m .
Let β ˜ j ( m ) be the estimator of β ( m ) from (4) without the jth observation, that is,
U n , m , j ( β ( m ) ) = 1 n 1 i = 1 , i j n y i g ( η i , p m ) g ( η i , p m ) σ 2 ( μ i , p m ) ε ( i , p m ) = 0 .
For the observation j, the leave-one-out truncated linear estimator of η j under the mth model is
η ˜ j , p m = β ˜ j ( m ) T ε ( j , p m ) ,
and the leave-one-out model averaging estimator of μ j is
μ ˜ j ( w ) = g m = 1 M w m η ˜ j , p m .
Thus, we propose the following leave-one-out criterion for choosing weights in the model averaging estimator given by (7)
C V ( w ) = i = 1 n y i μ ˜ i ( w ) 2 = i = 1 n y i g m = 1 M w m η ˜ i , p m 2 .
Let
w ^ = a r g min w H n C V ( w )
be the weight vector from C V ( w ) criterion. Then, plugging w ^ into (7), we obtain the final model averaging estimator μ ^ i ( w ^ ) , i = 1 , 2 , , n .

3. Asymptotic Property for Model Averaging Estimator

In this section, we will establish the optimal property of cross-validation model averaging for generalized functional linear model. We allow the dimension of each candidate model to be divergent as n tends to .

Notations and Conditions

We denote the first and second derivatives of the function g ( · ) by g ( ) and g ( ) , respectively, the diagonal matrix A with diagonal elements a 1 , a 2 , , a q by A = d i a g ( a 1 , a 2 , , a q ) , the minimum singular value of matrix A by λ min A , and
λ n , m = λ m i n ε n ( m ) T ε n ( m ) n ,
with
ε n ( m ) = ε ( 1 , p m ) , ε ( 2 , p m ) , , ε ( n , p m ) T .
For any β ( m ) R p m + 1 , n N + , define
U n , m ( m ) β ( m ) = 1 n i = 1 n g β ( m ) T ε ( i , p m ) μ i g β ( m ) T ε ( i , p m ) σ 2 g β ( m ) T ε ( i , p m ) ε ( i , p m ) .
We assume g ( · ) c < and g ( · ) c 1 < , and σ 2 ( · ) is strictly positive with bound 0 < d 1 σ 2 ( · ) d 2 < and σ 2 ( · ) d 3 < .
Consider the squared loss function
L n ( w ) = μ μ ^ ( w ) 2 ,
where μ = μ 1 , μ 2 , , μ n T and μ ^ ( w ) = μ ^ 1 ( w ) , μ ^ 2 ( w ) , , μ ^ n ( w ) T are the two n × 1 vectors, and · 2 is Euclidean norm. Denote
R n ( w ) = i = 1 n g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) 2 ,
and
ξ n = inf w H n R n ( w ) ,
where β ( m ) is the pseudo true parameter, which, like Flynn et al. (2013) and Lv and Liu (2014), is defined as the solution to the following score equation,
U n , m ( m ) β ( m ) = 0 ,
and is a theoretical target under the mth candidate model with misspecification. We assume that such a solution is existent and β ( m ) 2 / p m + 1 C b < . ξ n represents the minimal bias between the true model and the final model generated by model averaging, which is an alternative to the risk based on L n ( w ) . In this work, we do not require the expectation of L n ( w ) to exist, which is more relaxed than the common requirement on jackknife model averaging methods for generalized linear model. See, for example, Zhang et al. (2016) and Ando and Li (2017). In the following, we assume that X i ( t ) , i = 1 , 2 , , n are non-random with sup i η i C η < .
Condition 1.
For some compact set Θ m in R p m + 1 ,
lim n + P 0 U n , m ( Θ m ) = 1
holds.
Condition 2.
(i) { e i } , i = 1 , , n are mutually independent.
(ii) E e i = 0 .
(iii) C 1 = sup i E e i 2 < .
Condition 3.
sup i ε ( i , p m ) 2 / p m + 1 C 2 < .
Condition 4.
n p 2 / ξ n 0 with p = max m p m and p 4 / n = o 1 .
Condition 5.
i = 1 n η ˜ i , p m η ^ i , p m 2 = O p p m 4 .
Condition 6.
λ m i n β ( m ) U n , m ( m ) β ( m ) C 0 > 0 .
Condition 1 is a requirement for generalized model to guarantee the existence of solutions to (4). In general, the existence and consistency of roots obtained by solving (4) have to be checked, so we list Condition 1. The similar condition can be found in Balan and Schiopu-Kratina (2005). In the special case where the link function is g ( x ) = x , the solution of (4) is a generalized least squares estimator of β ( m ) and Condition 1 is easy to satisfy.
Condition 2 is common for generalized linear model. See, for instance, Chen et al. (1999) and Ando and Li (2017). The least squares estimator for linear regression models is strongly consistent under Condition 2. This condition is less restrictive than (A1) of Ando and Li (2017) for proving the optimality of the weight selection procedure.
Condition 3 is similar to (2.3) of Theorem 1 in Chen et al. (1999) and is due to the nonlinearity. A counterexample is given to show that β ^ ( m ) may not be consistent when Condition 3 (i) is dropped in Chen et al. (1999).
Condition 4 means that the speed of ξ n tending to should be faster than that of n p 2 . This condition also implies that the true model is not in the candidate model set, which is a condition commonly used for optimal model averaging. It is easy to satisfy when the true model is an infinite dimensional model. This condition is an alternative to Condition C.3 of Zhang et al. (2016) and (A3) of Ando and Li (2017).
Condition 5 implies n 1 i = 1 n η ˜ i , p m η ^ i , p m 2 = o p 1 with p m 4 / n = o 1 . By Lemma A3 in the Appendix A and Condition 3, we have
i = 1 n η ^ i , p m ε ( i , p m ) T β ( m ) 2 i = 1 n ε ( i , p m ) 2 β ^ ( m ) β ( m ) 2 = O p p m 4 .
Then, with the following standard condition for the application of cross-validation,
i = 1 n η ˜ i , p m ε ( i , p m ) T β ( m ) 2 i = 1 n η ^ i , p m ε ( i , p m ) T β ( m ) 2 1 = o p ( 1 ) ,
which says that as n gets large, the difference between the ordinary and leave-one-out estimators of η i under the mth candidate model gets small, it can be seen that
i = 1 n η ˜ i , p m η ^ i , p m 2 2 i = 1 n η ˜ i , p m ε ( i , p m ) T β ( m ) 2 + 2 i = 1 n η ^ i , p m ε ( i , p m ) T β ( m ) 2 = O p p m 4 ,
which means Condition 5 is reasonable. For the one-parameter natural exponential family models, Ando and Li (2017) showed under some regularity conditions that i = 1 n η ˜ i , p m η ^ i , p m 2 = O p p m 2 / n satisfying our Condition 5. For the linear models where g ( x ) = x and σ 2 ( · ) = 1 , i = 1 n η ˜ i , p m η ^ i , p m 2 = O P p m 2 / n under the assumption that ε ( i , p m ) T ε n ( m ) T ε n ( m ) 1 ε ( i , p m ) c p m / n for some constant c < , which is commonly used to ensure the asymptotic optimality of cross-validation. See, for example, Condition (5.2) of Li (1987), Condition (5.2) of Andrews (1991), Condition (A.9) of Hansen and Racine (2012), Condition (C.2) of Zhang (2015), and Condition (C.3) of Zhao et al. (2018). In general, our Condition 5 is more relaxed than those in literature for the complex candidate models.
Condition 6 is to ensure that the pseduo true parameter β ( m ) is unique. The consistency of the estimator of β ( m ) can also be derived by this condition. See Lemma A3 in the Appendix A. In addition, the one-parameter natural exponential family considered in Theorem 1 of Ando and Li (2017) is an example with
λ m i n β ( m ) U n , m ( m ) β ( m ) = λ m i n 1 n ε n ( m ) T Γ β ( m ) ε n ( m ) ,
where
Γ β ( m ) = d i a g g ε ( 1 , p m ) T β ( m ) , g ε ( 2 , p m ) T β ( m ) , , g ε ( n , p m ) T β ( m ) .
By the commonly used assumption that λ n , m c 0 > 0 for some constant c 0 < , and the assumption (4.3) in Ando and Li (2017), this example satisfies Condition 6.
Theorem 1.
Assume that Conditions 1–6 hold, then w ^ is asymptotically optimal in the sense that
L n ( w ^ ) inf w H n L n ( w ) p 1 ,
where p means convergence in probability.
Proof. 
See the Appendix B.  □
Remark 1.
When the dimensions of the candidate models are fixed, condition 4 can be relaxed to n / ξ n 2 0 .
Remark 2.
It is easy to see that if we do not require that the weights sum to one, then we can use M instead of 1 as the upper bound of m = 1 M w m 2 in our proof. Thus, all the proofs are still valid for the fixed M. This implies that Theorem 1 remains true if we remove the constraint that the weights sum to one. In addition, as the candidate models are not necessarily nested in the proof, this theorem still holds when the candidate models are non-nested.

4. Numerical Examples

4.1. Simulation I: Fixed Number of Candidate Models

In this section, we conduct simulation experiments to compare the finite sample performance of our model averaging methods and some commonly used model selection and model averaging methods. For model selection, we consider three methods: AIC, BIC, and FPCA. FPCA is an efficient and common method in functional data analysis, which determines the final model by the cumulative contributions of the functional principal components. For model averaging, we consider the following methods, S-AIC (smoothed AIC), S-BIC (smoothed BIC), and our cross-validation model averaging, which is denoted as CV1 if we restrict the sum of weights to be 1 as before, and CV2 if no constraint on the sum of weights is imposed.
The data generating process is as follows: the predictor variable is
X i ( t ) = j = 1 J ε i , j ρ j ( t ) ,
and the parameter function is
β ( t ) = j = 1 J β j ρ j ( t ) ,
where ρ j ( t ) is a basis function with t [ 0 , 1 ] , and j 1 and J is the number of the basis functions. Here, we use B-spline base and Fourier base. For B-spline base, we choose the order of the basis functions to be 2, and the number of the basis functions to be 20. As for Fourier base, we choose the number of the basis functions as 21 and the first basis to be a constant function.
In our simulation, the following four cases are considered.
Case 1 
For 1 j 10 , β j are generated from the standard normal distribution N ( 0 , 1 ) ; for 10 < j 20 , β j = 0 . The basis functions ρ j ( t ) , t [ 0 , 1 ] , 1 j 20 are B-spline functions with parameters as mentioned above.
Case 2 
For 1 j 20 , β j = j 2 . The basis functions ρ j ( t ) , t [ 0 , 1 ] , 1 j 20 are B-spline functions with parameters as mentioned above.
Case 3 
For 1 j 11 , β j are generated from the standard normal distribution N ( 0 , 1 ) ; for 11 < j 21 , β j = 0 . The basis functions ρ j ( t ) , t [ 0 , 1 ] , 1 j 21 are Fourier functions with parameters as mentioned above.
Case 4 
For 1 j 21 , β j = j 2 . The basis functions ρ j ( t ) , t [ 0 , 1 ] , 1 j 20 are Fourier functions with parameters as mentioned above.
We set the term ε i , j to be independently generated from N ( 0 , R 2 / j 2 ) , where R = 1 , 2 , , 10 . The response variable y i is generated from binomial distribution B i n o m i a l ( p ( X i ( t ) ) , 1 ) with the probability p ( X i ( t ) ) being g 0 1 X i ( t ) β ( t ) d t . We consider three types of link function g ( · ) : logistic link function e x p ( · ) / ( 1 + e x p ( · ) ) , Probit link function, and Poisson link function. For the Poisson model, we only consider the simulations with R = 1 for Cases 1–4.
In the simulation, we use FPCA to obtain the nested candidate models. Each candidate model contains the first p m principal components. The number of candidate models is 18 for Cases 1–2 and 19 for Cases 3–4. Then we adopt the weighted iterated least squares algorithm which is a common approach in generalized linear model to get the estimates for each model. For the weights, we use the ’fmincon’ function in Matlab to get the solution of CV criterion.
The sample size is set as n = 60 , 200 , 500 . We use the 80% data as the training data Y 1 , X 1 with size n 1 , and the remaining data as the test data Y 2 , X 2 with size n 2 . Then, we compare the prediction errors. We calculate the prediction accuracy ( Y ^ 2 Y 2 2 / n 2 ), fitting accuracy ( Y ^ 1 Y 1 2 / n 1 ), predictor coefficient prediction accuracy ( η ^ ( 2 ) η ( 2 ) 2 / n 2 ), and predictor coefficient fitting accuracy ( η ^ ( 1 ) η ( 1 ) 2 / n 1 ). We repeat this process 1000 times, and then obtain mean, median, and variance of these prediction errors for each method. To save space, we present only the results on the prediction accuracy. The results on the other type accuracies are available from the authors upon request. We only report the results for logistic link function due to space limitations. Other link function results are also available from the authors.
For Case 1, the prediction errors are summarized in Table A1, Table A2 and Table A3. From Table A1, it is seen that with R varying from 1 to 10, the prediction errors are decreasing, because the difference of probability between the two groups (one group whose response is 1 and the other group whose response is 0) becomes larger. Our methods (CV1 and CV2 in the tables) always obtain the minimum error means (Mean in the tables), medians (Median in the tables), and variances (Var in the tables). However, there is no clear tendency between CV1 and CV2, which perform similarly in most of situations. When R is small, BIC is always better than AIC, and S-BIC is always better than S-AIC. This may be due to less parameters being useful for smaller R values, and in this case, a bigger penalty on the number of parameters in the model is preferred. Moreover, when the candidate models differ significantly, AIC or BIC performs similarly to S-AIC or S-BIC, respectively. As R becomes larger, the difference between AIC and BIC or S-AIC and S-BIC becomes smaller. FPCA is always superior to AIC, BIC, S-AIC, and S-BIC, and their differences become larger as R increases. Now, we turn to Table A2 and Table A3. With the sample size n increasing from 60 to 200 and 500, we can see that the prediction errors decrease for each fixed R. The median and variance of prediction errors also become smaller. AIC and BIC behave increasingly similarly. CV1 and CV2 are still the best among all the methods, and followed by FPCA.
For Case 2, the prediction errors are given in Table A4, Table A5 and Table A6. As shown earlier, CV1 and CV2 perform the best, and followed by FPCA. Likewise, S-AIC or S-BIC is better than AIC or BIC, respectively. For Table A4, with R varying from 1 to 10, the prediction errors are decreasing except FPCA method, which gets the minimum at R = 7 with a small fluctuation. CV1 and CV2 perform equally well for different R values and sample sizes. The difference between AIC and BIC becomes small with the sample size increasing. The similar phenomenon is observed for S-AIC and S-BIC.
For Case 3, the prediction errors are provided in Table A7, Table A8 and Table A9. For n = 60 (Table A7), CV1 or CV2 is the best when R is between 1 and 5. However, when R is between 6 and 10, the two model selection methods—AIC and BIC—are the best. The similar conclusions can be found in Table A8 with n = 200 and Table A9 with n = 500 , although in the latter case, CV1 actually performs the best for all of R values. The error rates of all methods become smaller with R increasing from 1 to 6 and then bigger with R varying from 7 to 10.
For Case 4, the prediction errors are presented in Table A10, Table A11 and Table A12. For n = 60 in Table A10, CV1, CV2, and BIC are the best, and followed by AIC. In this design, S-AIC or S-BIC is not better than AIC or BIC. For n = 200 in Table A11, BIC is the best, and followed by AIC. For n = 500 in Table A12, CV1 always performs the best, and followed by BIC.
In summary, for out-of-sample prediction, our methods CV1 and CV2 perform the best in most of cases and have smaller variances and medians of errors. Furthermore, CV1 and CV2 often perform equally well. This indicates that removing the restriction on the sum of weights may not lead to a better model averaging estimates.

4.2. Simulation II: Divergent Number of Candidate Models

We consider the situations where the number of candidate models tends to as the sample size increases. We set the sample size n to be 200, 400, and 1000, and the the number of candidate models to be 9 n / 100 (So M=18,36, and 90 for the three sample sizes). The data generating process is as before: the predictor variable is X i ( t ) = j = 1 J ε i , j ρ j ( t ) , and the parameter function is β ( t ) = j = 1 J β j ρ j ( t ) , where ρ j ( t ) is a 2-order B-spline basis function, t [ 0 , 1 ] , j 1 , and J = n / 10 . For 1 j J , β j = j 1 / 2 . We set the term ε i , j to be independently generated from N ( 0 , R 2 / j 1 / 2 ) , where R = 1 , 3 , 7 . The response variable y i is generated from binomial distribution with the logistic link.
The candidate models are nested. The algorithms used in the calculations are the same as that described in Section 4.1. For the simulation results, we report the errors of seven methods considered as Section 4.1. From Table A13, Table A14 and Table A15, our methods—CV1 and CV2—perform the best in most of cases, and followed by FPCA, and SAIC. The difference between AIC and BIC, or SAIC and SBIC is decreasing with increasing R.

4.3. Application: Beijing Second-Hand House Price Data

We apply our method to the Beijing second-hand housing transaction price data, which is captured from the internet collected by the Guoxinda Group Corporation. Most of the data pass through the manual check. This data include the second-hand housing prices and the surrounding environment variables of the 2318 residential areas in Beijing. The second-hand housing prices data are monthly data from January 2015 to December 2017 for each residential area.
Our aim is to predict the increase level in house prices in next year. We are concerned about the relationship between price level to rise and the past housing price curves. We use the median of listing online prices of houses in a residential area as the house price for this residential area. We use the price curve of each residential area from January 2015 to December 2016 as a predictor variable. The response variable is a binary variable, which takes 1 if the rising ratio is high, and 0 otherwise. Here, we define the rising ratio for each district as the ratio of the average monthly price in 2017 to the average monthly price in 2016. The 25%, 50%, and 75% quantile ratios are 1.31 , 1.37 , and 1.44 , respectively. We focus on the residential areas whose housing prices are rising rapidly, and so if the ratio is higher than 75% quantile ratios of all residential areas, the response variable of this residential area takes 1 as its value, and 0 otherwise. Of the n = 2318 residential areas, 568 are rising fast, and 1750 are not.
For simplicity, we standardize all the price data. For each group, we plot the housing price trajectories in Figure 1. Failure to visually detect differences between the groups could result from overcrowding of these plots with too many curves, but when displaying fewer curves (lower panels of Figure 1), the same phenomenon remains. With a few exceptions, no clear visual differences between the two groups can be discerned. On the whole, the trajectories of per year from 2015 to 2016 are not much different. Therefore, the discrimination task at hand is difficult.
We randomly select 75% of all residential areas as the training set with size 1739, and the rest as the testing data with size 579. We use logistic link and B-spline functions to fit the house price curves. The number of the basis functions is 6, and the order of the B-spline basis functions is 2. Then, we adopt functional principal component analysis (Yao et al. 2005) to built the data-adaptive basis functions to reduce the dimension and deal with the correlations in house price time series.
We compare the out-of-sample prediction errors of the seven methods in Section 4. We repeat every method 20 times. The results are summarized in Table 1 and Table 2. It can be observed from the tables that the error of CV1 or CV2 method is lower 10% on average than those of other methods, and overall, CV1 and CV2 behave similarly. As shown in the simulation above, this indicates the constraint that the sum of weights equals 1 makes sense in practical cases. AIC and BIC perform equally well, as both choose the largest model in most cases. We also find that FPCA is better than AIC or BIC. FPCA always selects the smallest model because the cumulative reliability of the first principal component is ~98%. Further, it is clear that the fitting error and prediction error of FPCA are similar. For the other methods, the fitting errors are always a little smaller than the prediction errors.

5. Concluding Remarks

In this paper, we proposed a model averaging approach under the framework of the generalized functional linear model. We showed that the weight chosen by the leave-one-out cross-validation method is asymptotically optimal in the sense of achieving the lowest possible squared error in a class of model averaging estimators. It can be seen from the theoretical proof that our method is also valid for the non-nested candidate model set. Numerical analysis shows that for generalized functional linear model, cross-validation model averaging is a powerful tool for estimation and prediction. A further work is to develop model averaging inference procedures based on generalized functional linear model. In addition, how to combine other covariates into generalized functional linear model is also an interesting problem.

Author Contributions

H.Z. wrote the original draft. G.Z. reviewed and revised the whole paper. All authors have read and agreed to the published version of the manuscript.

Funding

Zou’s work was partially supported by the Ministry of Science and Technology of China (Grant No. 2016YFB0502301) and the National Natural Science Foundation of China (Grant Nos. 11971323 and 11529101).

Acknowledgments

The authors thank the two referees for their constructive comments and suggestions that have substantially improved the original manuscript. The Beijing second-hand house price data is collected by the Guoxinda Group Corporation. This project was partially supported by the National Natural Science Foundation of China (Grant No.71571180).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Lemmas and Proofs

The following Definition A1 and Lemma A1 can be found in Kahane (1968); Hoffmann (1974); Hoffmann and Pisier (1976); Zinn (1977); and Wu (1981).
Definition A1.
A linear map ν : D F is of type 2 if i = 1 n ε i ν ( s i ) converges in F a.s. for all sequences s i D such that i = 1 s i D 2 < , where D and F are Banach space, ε i i = 1 are independent random variables such that P ε i = 1 = P ε i = 1 = 1 / 2 , and a.s. means converges almost surely. A Banach space G is said to be type 2 if the identity map on G is type 2.
Let S , d be a compact metric space and C ( S ) be the Banach space of real-valued continuous functions on S with the supremum norm
ν = s u p s S ν ( s ) ,
for any ν C ( S ) . Denote a d-continuous metric ρ on S. Let N S , ρ , ε denote the minimal number of ρ -balls of radius less than or equal to ε which cover S, and set
H S , ρ , ε = log N S , ρ , ε .
We let
L i p ( ρ ) = ν C S : Λ ν = sup s 1 s 2 S ν ( s 1 ) ν ( s 2 ) ρ ( s 1 , s 2 ) < ) ,
and for ν L i p ( ρ ) , we define
ν ρ = Λ ν + ν ( s * ) ,
where s * is some fixed point in S. In addition, assume that ν j : j 1 L i p ( ρ ) and e j : j 1 are independent real-valued random variables. Then, ν j e j are independent L i p ( ρ ) -valued random variables.
Lemma A1.
Let S , d denote a compact metric space. Suppose that ρ is a d-continuous metric on S with
0 δ H 1 / 2 S , ρ , u d u < for some δ > 0 .
Then we have A < such that for all n,
E X 1 + X 2 + + X n 2 A j = 1 n E X j ρ 2 ,
where X 1 , X 2 , , X n are independent L i p ( ρ ) -valued random variables with mean zeros.
Lemma A2.
For any β ( m ) Θ m , define
v i β ( m ) = g ε ( i , p m ) T β ( m ) σ 2 g ε ( i , p m ) T β ( m ) .
Under Condition 3, we have
sup β ( m ) Θ m i [ 1 , n ] v i β ( m ) p m + 1 ε ( i , p m ) e i = O p n p m .
Proof of Lemma A2.
First note that for any l 0 , p m , we have
Λ v i p m + 1 ε i , l = sup β 1 ( m ) β 2 ( m ) Θ m v i β 1 ( m ) v i β 2 ( m ) ε i , l p m + 1 × ρ β 1 ( m ) , β 2 ( m ) = sup β 1 ( m ) β 2 ( m ) Θ m v i β 1 ( m ) v i β 2 ( m ) ε ( i , p m ) T β 1 ( m ) ε ( i , p m ) T β 2 ( m ) ε ( i , p m ) T β 1 ( m ) ε ( i , p m ) T β 2 ( m ) p m + 1 × ρ β 1 ( m ) , β 2 ( m ) ε i , l = sup β 1 ( m ) β 2 ( m ) Θ m g γ i * σ 2 γ i g γ i σ 2 γ i σ 4 γ i × ε ( i , p m ) T β 1 ( m ) ε ( i , p m ) T β 2 ( m ) p m + 1 × ρ β 1 ( m ) , β 2 ( m ) ε i , l ,
where the last step is by the mean-value theorem and γ i is a point betweeen ε ( i , p m ) T β 1 ( m ) and ε ( i , p m ) T β 2 ( m ) . From the assumptions that g ( · ) is a twice continuously differentiable function with bounded derivatives g ( · ) c < and g ( · ) c 1 < , and σ 2 ( · ) is strictly positive with bound 0 < d 1 σ 2 ( · ) d 2 < and σ 2 ( · ) d 3 < , we see that there is a constant c > 0 such that | v i ( · ) | c < , and
Λ v i p m + 1 ε i , l sup β 1 ( m ) β 2 ( m ) Θ m c × ε ( i , p m ) T β 1 ( m ) ε ( i , p m ) T β 2 ( m ) p m + 1 × ρ β 1 ( m ) , β 2 ( m ) ε i , l sup β 1 ( m ) β 2 ( m ) Θ m c × ε ( i , p m ) p m + 1 ε i , l = c ε ( i , p m ) p m + 1 ε i , l ,
where the second inequality is by Cauchy–Schwarz inequality. Therefore, we obtain
v i p m + 1 ε i , l ρ = Λ v i p m + 1 ε i , l + v i β ( m ) * p m + 1 ε i , l c ε ( i , p m ) p m + 1 ε i , l + c 1 p m + 1 ε i , l = c ε ( i , p m ) + 1 p m + 1 ε i , l < .
As Θ m is a compact subset of R p m + 1 , and ρ ( β 1 ( m ) , β 2 ( m ) ) is the Euclidean metric in R p m + 1 , (A1) is satisfied. Thus, by Lemma A1, there is a constant A > 0 uniformly for all l such that for any C > 0 , we have
P sup β ( m ) Θ m i [ 1 , n ] v i β ( m ) p m + 1 ε i , l e i 2 > C n = P i [ 1 , n ] v i p m + 1 ε i , l e i 2 > C n 1 C n E i [ 1 , n ] v i p m + 1 ε i , l e i 2 1 C n A i = 1 n v i p m + 1 ε i , l ρ 2 sup i E e i 2 .
Notice
sup β ( m ) Θ m i [ 1 , n ] v i β ( m ) p m + 1 ε ( i , p m ) e i 2 l = 0 p m sup β ( m ) Θ m i [ 1 , n ] v i β ( m ) p m + 1 ε i , l e i 2 .
Therefore, for any ε > 0 , letting C = A c 2 C 2 + 1 2 C 2 C 1 / ε , we obtain
P sup β ( m ) Θ m i [ 1 , n ] v i β ( m ) p m + 1 ε ( i , p m ) e i 2 > C n p m + 1 P l = 0 p m sup β ( m ) Θ m i [ 1 , n ] v i β ( m ) p m + 1 ε i , l e i 2 > C n p m + 1 l = 0 p m P sup β ( m ) Θ m i [ 1 , n ] v i β ( m ) p m + 1 ε i , l e i 2 > C n l = 0 p m 1 C n A i = 1 n v i p m + 1 ε i , l ρ 2 sup i E e i 2 A c 2 C n p m + 1 2 i = 1 n ε ( i , p m ) + 1 2 ε ( i , p m ) 2 sup i E e i 2 A c 2 C 2 + 1 2 C 2 C 1 C = ε ,
which implies (A3).  □
Lemma A3.
Under Conditions 1–3 and 6, we have
β ^ ( m ) β ( m ) 2 = O p p m 3 n ,
where β ^ ( m ) belonging to Θ m is the root of (4).
Proof of Lemma A3.
By the definition of β ( m ) and Condition 6, then we have
U n , m ( m ) β ( m ) 2 = U n , m ( m ) β ( m ) U n , m ( m ) β ( m ) 2 = U n , m ( m ) β ( m ) β ( m ) = β ¯ ( m ) β ( m ) β ( m ) 2 C 0 2 β ( m ) β ( m ) 2 ,
where β ¯ ( m ) is a point between β ( m ) and β ( m ) . Recalling that
U n , m β ^ ( m ) = 1 n i = 1 n y i g η ^ i , p m g η ^ i , p m σ 2 g η ^ i , p m ε ( i , p m ) = 1 n i = 1 n μ i + e i g η ^ i , p m g η ^ i , p m σ 2 g η ^ i , p m ε ( i , p m ) = U n , m ( m ) β ^ ( m ) + 1 n i = 1 n e i g η ^ i , p m σ 2 g η ^ i , p m ε ( i , p m ) = 0 ,
we obtain
U n , m ( m ) β ^ ( m ) = 1 n i = 1 n e i g η ^ i , p m σ 2 g η ^ i , p m ε ( i , p m ) = 1 n ε n ( m ) T V n β ^ ( m ) e ,
where V n β ^ ( m ) = d i a g g η ^ i , p m σ 2 g η ^ i , p m 1 i n . From (A6), we get
U n , m ( m ) β ( m ) C 0 δ β ( m ) β ( m ) δ , δ > 0 .
By Condition 1, for any κ > 0 , there is an N 1 such that for all n > N 1 , we have
P 0 U n , m Θ m > 1 κ .
From (A7), it can be seen that
0 U n , m Θ m = T h e r e i s a β ^ * ( m ) Θ m s u c h t h a t 1 n ε n ( m ) T V n β ^ * ( m ) e = U n , m ( m ) β ^ * ( m ) sup β ( m ) Θ m ε n ( m ) T V n β ( m ) e > C 0 n δ β ^ ( m ) β ( m ) δ .
Then for any C > 0 and n > N 1 , letting δ = C p m + 1 3 n , we have
P β ^ ( m ) β ( m ) C p m + 1 3 / 2 n 1 κ P sup β ( m ) Θ m ε n ( m ) T V n β ( m ) e > C 0 C n p m + 1 3 / 2 1 κ A c 2 C 2 + 1 2 C 2 C 1 C 0 2 C 2 1 κ C C 0 2 C 2 ,
where C = A c 2 C 2 + 1 2 C 2 C 1 and the second inequality is derived by (A4). As a result, for any κ > 0 , we can select C = C C 0 κ such that
P β ^ ( m ) β ( m ) > C p m + 1 3 / 2 n < 2 κ ,
for sufficiently large n, thus
β ^ ( m ) β ( m ) 2 = O p p m 3 n .
Lemma A4.
Under Conditions 1–4 and 6,
sup w H n L n ( w ) R n ( w ) 1 = o p ( 1 ) .
Proof of Lemma A4.
Write Δ i ( w ) = g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) . From the definition of L n ( w ) , we have
L n ( w ) = i = 1 n g ( η i ) g m = 1 M w m η ^ i , p m 2 = i = 1 n g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) + g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m 2 = i = 1 n g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) 2 + g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m 2 + 2 g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m = i = 1 n Δ i 2 ( w ) + L n ( 2 ) ( w ) + i = 1 n 2 Δ i ( w ) g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m R n ( w ) + L n ( 2 ) ( w ) + L n ( 3 ) ( w ) .
We note that
L n ( w ) R n ( w ) R n ( w ) = L n ( 2 ) ( w ) + L n ( 3 ) ( w ) R n ( w ) sup w H n | L n ( 2 ) ( w ) | R n ( w ) + 2 | L n ( 2 ) ( w ) | R n ( w ) .
Then, (A8) is valid if
sup w H n L n ( 2 ) ( w ) R n ( w ) p 0 .
Let η i * m be the point between ε ( i , p m ) T β ( m ) and η ^ i , p m , for fixed M,
L n ( 2 ) ( w ) = i = 1 n g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m 2 = i = 1 n g m = 1 M w m η i * m m = 1 M w m ε ( i , p m ) T β ( m ) m = 1 M w m η ^ i , p m 2 = i = 1 n g m = 1 M w m η i * m 2 m = 1 M w m ε ( i , p m ) T β ( m ) η ^ i , p m 2 i = 1 n g ( w ; η i * ) 2 m = 1 M ε ( i , p m ) T β ( m ) η ^ i , p m 2 c 2 i = 1 n m = 1 M ε ( i , p m ) T β ( m ) β ^ ( m ) 2 .
Then, by Lemma A3 and Condition 3, we have
sup w H n L n ( 2 ) ( w ) c 2 i = 1 n m = 1 M ε ( i , p m ) T β ( m ) β ^ ( m ) 2 = O p ( m = 1 M p m 4 ) ,
which, together with Condition 4, leads to (A9).  □

Appendix B. Proof of Theorem 1

Let μ ˜ ( w ) = μ ˜ 1 ( w ) , μ ˜ 2 ( w ) , , μ ˜ n ( w ) T , and
L ˜ n ( w ) = μ μ ˜ ( w ) 2 .
As in Li (1987) and Ando and Li (2014), we know that
C V ( w ) = e 2 + L ˜ n ( w ) + 2 e , μ μ ˜ ( w ) = e 2 + L n ( w ) L ˜ n ( w ) L n ( w ) + 2 e , μ μ ˜ ( w ) L n ( w ) .
As w ^ minimizes C V ( w ) over w H n , it also minimizes C V ( w ) e 2 over w H n . Therefore, the claim
L n ( w ^ ) inf w H n L n ( w ) p 1
is valid if
sup w H n L ˜ n ( w ) L n ( w ) 1 p 0
and
sup w H n e , μ μ ˜ ( w ) L n ( w ) p 0
hold. In fact, if we denote w * = a r g min w H n L n ( w ) , then
L n ( w ^ ) inf w H n L n ( w ) = L n ( w ^ ) L n ( w * ) 1 ,
so we only need to prove
L n ( w ^ ) L n ( w * ) 1 + δ n ,
where δ n 0 for n = 1 , 2 , , and δ n p 0 . According to the definition of w ^ , we have C V n ( w ^ ) C V n ( w * ) . Then, by (A10), we obtain
e 2 + L n ( w ^ ) L ˜ n ( w ^ ) L n ( w ^ ) + 2 e , μ μ ˜ ( w ^ ) L n ( w ^ ) e 2 + L n ( w * ) L ˜ n ( w * ) L n ( w * ) + 2 e , μ μ ˜ ( w * ) L n ( w * ) ,
which is equivalent to
L n ( w ^ ) L n ( w * ) L ˜ n ( w ^ ) L n ( w ^ ) + 2 e , μ μ ˜ ( w ^ ) L n ( w ^ ) L ˜ n ( w * ) L n ( w * ) + 2 e , μ μ ˜ ( w * ) L n ( w * ) .
From (A11) and (A12), we have
L n ( w ^ ) L n ( w * ) L ˜ n ( w ^ ) L n ( w ^ ) + 2 e , μ μ ˜ ( w ^ ) L n ( w ^ ) L ˜ n ( w * ) L n ( w * ) + 2 e , μ μ ˜ ( w * ) L n ( w * ) sup w H n L ˜ n ( w ) L n ( w ) + 2 e , μ μ ˜ ( w ) L n ( w ) sup w H n L ˜ n ( w ) L n ( w ) 1 + 1 + sup w H n 2 e , μ μ ˜ ( w ) L n ( w ) ,
and
L n ( w ^ ) L n ( w * ) L ˜ n ( w ^ ) L n ( w ^ ) + 2 e , μ μ ˜ ( w ^ ) L n ( w ^ ) L n ( w ^ ) L n ( w * ) 1 sup w H n L ˜ n ( w ) L n ( w ) 1 sup w H n 2 e , μ μ ˜ ( w ) L n ( w ) .
Therefore,
1 L n ( w ^ ) / L n ( w * ) 1 δ n 1 + δ n 1 ,
with L n ( w ^ ) / L n ( w * ) 1 , and
δ n = sup w H n L ˜ n ( w ) L n ( w ) 1 + sup w H n 2 e , μ μ ˜ ( w ) L n ( w ) .
Thus, we obtain
L n ( w ^ ) L n ( w * ) p 1 .
In the following, we prove (A11) and (A12).

Appendix B.1. Proof of (A11)

Notice that
L ˜ n ( w ) L n ( w ) = i = 1 n μ ˜ i ( w ) 2 i = 1 n μ ^ i ( w ) 2 + 2 i = 1 n μ i μ ^ i ( w ) μ ˜ i ( w ) = i = 1 n μ ˜ i ( w ) μ ^ i ( w ) 2 2 i = 1 n μ ^ i ( w ) 2 + 2 i = 1 n μ ˜ i ( w ) μ ^ i ( w ) + 2 i = 1 n μ i μ ^ i ( w ) μ ˜ i ( w ) = i = 1 n μ ˜ i ( w ) μ ^ i ( w ) 2 + 2 i = 1 n μ i μ ^ i ( w ) μ ^ i ( w ) μ ˜ i ( w ) = μ ^ ( w ) μ ˜ ( w ) 2 + 2 μ μ ^ ( w ) , μ ^ ( w ) μ ˜ ( w ) μ ^ ( w ) μ ˜ ( w ) 2 + 2 L n ( w ) μ ^ ( w ) μ ˜ ( w ) .
So,
L ˜ n ( w ) L n ( w ) 1 = L ˜ n ( w ) L n ( w ) L n ( w ) μ ^ ( w ) μ ˜ ( w ) 2 + 2 L n ( w ) μ ^ ( w ) μ ˜ ( w ) L n ( w ) = μ ^ ( w ) μ ˜ ( w ) 2 L n ( w ) + 2 μ ^ ( w ) μ ˜ ( w ) L n ( w ) .
Therefore, to prove (A11), it suffices to verify
sup w H n μ ^ ( w ) μ ˜ ( w ) 2 L n ( w ) p 0 .
By Lemma A4, we need only to show
sup w H n μ ^ ( w ) μ ˜ ( w ) 2 R n ( w ) p 0 .
Let η i , p m * be the point between η ˜ i , p m and η ^ i , p m . Then, for any δ > 0 , we have
P sup w H n μ ^ ( w ) μ ˜ ( w ) 2 R n ( w ) > δ P sup w H n μ ^ ( w ) μ ˜ ( w ) 2 > δ ξ n = P sup w H n i = 1 n μ ˜ i ( w ) μ ^ i ( w ) 2 > δ ξ n = P sup w H n i = 1 n g m = 1 M w m η ˜ i , p m g m = 1 M w m η ^ i , p m 2 > δ ξ n = P sup w H n i = 1 n g η i , p m * m = 1 M w m η ˜ i , p m m = 1 M w m η ^ i , p m 2 > δ ξ n P max 1 i n , w H n g η i , p m * 2 sup w H n i = 1 n m = 1 M w m η ˜ i , p m η ^ i , p m 2 > δ ξ n P max 1 i n , w H n g ( η i , p m * ) 2 sup w H n i = 1 n m = 1 M η ˜ i , p m η ^ i , p m 2 > δ ξ n = P max 1 i n , w H n g ( η i , p m * ) 2 i = 1 n m = 1 M η ˜ i , p m η ^ i , p m 2 > δ ξ n ,
which, together with the assumption that g ( · ) is a twice continuously differentiable function with bounded derivatives implying max 1 i n , w H n | g ( η i , p m * ) | 2 c 2 < , leads to
P sup w H n μ ^ ( w ) μ ˜ ( w ) 2 R n ( w ) > δ P c 2 i = 1 n m = 1 M η ˜ i , p m η ^ i , p m 2 / ξ n > δ .
Thus, to prove (A13), it suffices to show
i = 1 n m = 1 M η ˜ i , p m η ^ i , p m 2 / ξ n = o p ( 1 ) .
By Condition 5, for fixed M, we obtain
i = 1 n m = 1 M η ˜ i , p m η ^ i , p m 2 = O p m = 1 M p m 4 ,
which, together with Condition 4, leads to (A14), and thus (A13) holds.

Appendix B.2. Proof of (A12)

As
e , μ μ ˜ ( w ) = i = 1 n e i g ( η i ) g m = 1 M w m η ˜ i , p m ,
it is sufficient to show
sup w H n i = 1 n e i g ( η i ) g m = 1 M w m η ˜ i , p m / R n ( w ) p 0 .
It is readily seen that
sup w H n i = 1 n e i g ( η i ) g m = 1 M w m η ˜ i , p m R n ( w ) sup w H n i = 1 n e i g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) R n ( w ) + sup w H n i = 1 n e i g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m R n ( w ) + sup w H n i = 1 n e i g m = 1 M w m ε ( i , p m ) T β ^ ( m ) g m = 1 M w m η ˜ i , p m R n ( w ) sup w H n A n ( 1 ) ( w ) + sup w H n A n ( 2 ) ( w ) + sup w H n A n ( 3 ) ( w ) .
Thus, we need only to prove
sup w H n A n ( 1 ) ( w ) p 0 ,
sup w H n A n ( 2 ) ( w ) p 0 ,
and
sup w H n A n ( 3 ) ( w ) p 0 .
The proof of (A15) is similar to that of Wu (1981). We denote a metric
ρ ( w , w ) = w w ,
which is on H n . Let H n , ρ be a compact metric space. Then C H n is the Banach space of real-valued continuous functions on H n with the supremum norm
Δ = s u p w H n Δ ( w ) .
Let N H n , ρ , ε denote the minimal number of ρ -balls of radius less than or equal to ε which cover H n , and set
H H n , ρ , ε = log N H n , ρ , ε .
We let
L i p ( ρ ) = Δ C H n : Λ Δ = sup w w H n Δ ( w ) Δ ( w ) ρ ( w , w ) < ,
and for Δ L i p ( ρ ) , we define
Δ ρ = Λ Δ + Δ ( w * ) ,
where w * is some fixed point in H n .
Recalling that Δ i ( w ) = g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) , we have
Λ Δ i p + 1 = sup w w H n Δ i ( w ) Δ i ( w ) p + 1 ρ ( w , w ) = sup w w H n g γ 0 , i × m = 1 M w m ε ( i , p m ) T β ( m ) m = 1 M w m ε ( i , p m ) T β ( m ) p + 1 ρ ( w , w ) c × sup w w H n m = 1 M w m ε ( i , p m ) T β ( m ) m = 1 M w m ε ( i , p m ) T β ( m ) p + 1 ρ ( w , w ) c m = 1 M ε ( i , p m ) T β ( m ) 2 p + 1 2 ,
where γ 0 , i is a point between m = 1 M w m ε ( i , p m ) T β ( m ) and m = 1 M w m ε ( i , p m ) T β ( m ) . From the assumption β ( m ) 2 / p m + 1 C b < , and Condition 3, we obtain
sup i Λ Δ i C g < .
As for Δ i ( w * ) p + 1 , using Lagrange theorem, we have
Δ i ( w * ) p + 1 = 1 p + 1 g ( ζ i ) η i m = 1 M w m * ε ( i , p m ) T β ( m ) c m = 1 M η i ε ( i , p m ) T β ( m ) 2 p + 1 2 ,
where ζ i is a point between η i and m = 1 M w m * ε ( i , p m ) T β ( m ) . Again, by Condition 3, β ( m ) 2 / p m + 1 C b < , and the assumption sup i η i C η < , we obtain
sup i Δ i ( w * ) C ˜ < .
For (A15), we have
P sup w H n A n ( 1 ) ( w ) > δ P sup w H n i = 1 n e i g ( η i ) g m = 1 M w m ε ( i , p m ) T β ( m ) > δ ξ n = P sup w H n i = 1 n e i Δ i ( w ) p + 1 > δ ξ n p + 1 p + 1 2 E sup w H n i = 1 n e i Δ i ( w ) p + 1 2 δ 2 ξ n 2 = p + 1 2 E i = 1 n e i Δ i p + 1 2 δ 2 ξ n 2 ,
where δ > 0 is an arbitrary constant. Since H n is a compact subset of R M , and ρ ( w , w ) is the Euclidean metric in R M , (A1) is satisfied. Therefore, by Lemma A1, we see that there is a constant A < such that for all n,
E i = 1 n e i Δ i p + 1 2 A i = 1 n E e i Δ i p + 1 ρ 2 A sup j E e j 2 i = 1 n Λ Δ i p + 1 + Δ i ( w * ) p + 1 2 2 A sup j E e j 2 i = 1 n Λ 2 Δ i p + 1 + Δ i ( w * ) p + 1 2 = O n ,
where the last equality is because of (A18), (A19) and sup j E e j 2 < . Therefore,
P sup w H n A n ( 1 ) ( w ) > δ = O p + 1 2 n ξ n 2 0 ,
and (A15) holds.
Denote Δ ˜ i = g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m . For (A16), we have
P sup w H n A n ( 2 ) ( w ) > δ P sup w H n i = 1 n e i g m = 1 M w m ε ( i , p m ) T β ( m ) g m = 1 M w m η ^ i , p m 2 > δ 2 ξ n 2 P sup w H n i = 1 n e i 2 i = 1 n Δ ˜ i 2 > δ 2 ξ n 2 P sup w H n i = 1 n Δ ˜ i 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 = P sup w H n i = 1 n g m = 1 M w m η ˜ i * m m = 1 M w m ε ( i , p m ) T β ( m ) η ^ i , p m 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 P sup w H n c 2 i = 1 n m = 1 M w m ε ( i , p m ) T β ( m ) η ^ i , p m 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 P c 2 i = 1 n m = 1 M ε ( i , p m ) T β ( m ) η ^ i , p m 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 P c 2 i = 1 n m = 1 M ε ( i , p m ) T β ( m ) η ^ i , p m 2 > δ 2 ξ n p 2 / n + p 2 i = 1 n E e i 2 ξ n n .
From Lemma A3 and Condition 3, we see that
i = 1 n m = 1 M ε ( i , p m ) T β ( m ) η ^ i , p m 2 = i = 1 n m = 1 M ε ( i , p m ) T β ( m ) β ^ ( m ) 2 = O p ( m = 1 M p m 4 ) .
Therefore, lim n + P sup w H n A n ( 2 ) ( w ) > δ = 0 , that is, (A16) is valid.
Write Δ ¯ i = g m = 1 M w m ε ( i , p m ) T β ^ ( m ) g m = 1 M w m η ˜ i , p m . For (A17), we have
P sup w H n A n ( 3 ) ( w ) > δ P sup w H n i = 1 n e i g m = 1 M w m ε ( i , p m ) T β ^ ( m ) g m = 1 M w m η ˜ i , p m 2 > δ 2 ξ n 2 P sup w H n i = 1 n e i 2 i = 1 n Δ ¯ i 2 > δ 2 ξ n 2 P sup w H n i = 1 n Δ ¯ i 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 = P sup w H n i = 1 n g m = 1 M w m η i , p m * m = 1 M w m ε ( i , p m ) T β ^ ( m ) η ˜ i , p m 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 P sup w H n c 2 i = 1 n m = 1 M w m ε ( i , p m ) T β ^ ( m ) η ˜ i , p m 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 P c 2 i = 1 n m = 1 M ε ( i , p m ) T β ^ ( m ) η ˜ i , p m 2 > δ 2 ξ n p 2 / n + P i = 1 n e i 2 > ξ n n / p 2 P c 2 i = 1 n m = 1 M ε ( i , p m ) T β ^ ( m ) η ˜ i , p m 2 > δ 2 ξ n p 2 / n + p 2 i = 1 n E e i 2 ξ n n .
From Condition 5, we see that
i = 1 n m = 1 M ε ( i , p m ) T β ^ ( m ) η ˜ i , p m 2 = O p ( m = 1 M p m 4 ) .
Therefore, lim n + P sup w H n A n ( 3 ) ( w ) > δ = 0 , that is, (A17) is valid.

Appendix C. Simulation Results in Section 4.1

Table A1. Prediction errors with n = 60 in Case 1.
Table A1. Prediction errors with n = 60 in Case 1.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.4320.4080.4040.4330.4080.3940.393
1Median0.4170.4170.4170.4170.4170.3750.417
Var0.0230.0230.0200.0230.0240.0230.021
Mean0.3120.2940.2490.3110.2920.2250.226
2Median0.3330.3330.2500.3330.3330.2500.250
Var0.0130.0130.0160.0130.0130.0130.013
Mean0.2730.2620.2260.2730.2600.1880.189
3Median0.2500.2500.2500.2500.2500.1670.167
Var0.0170.0170.0150.0170.0170.0160.015
Mean0.2560.2430.1830.2560.2470.1620.163
4Median0.2500.2500.1670.2500.2500.1670.167
Var0.0180.0170.0110.0180.0170.0130.013
Mean0.2030.1960.1480.2030.1930.1330.134
5Median0.1670.1670.1670.1670.1670.0830.083
Var0.0140.0140.0110.0140.0130.0090.009
Mean0.2340.2330.1350.2340.2330.1170.115
6Median0.2500.2500.1250.2500.2500.0830.083
Var0.0160.0160.0100.0160.0160.0100.010
Mean0.2140.2130.1490.2140.2140.1180.117
7Median0.2080.2080.1670.2080.2500.0830.083
Var0.0140.0150.0100.0140.0150.0090.008
Mean0.2130.2090.1340.2130.2100.1040.103
8Median0.2500.1670.1250.2500.1670.0830.083
Var0.0120.0120.0090.0120.0120.0080.008
Mean0.1960.1960.1280.1960.1960.0960.099
9Median0.1670.1670.0830.1670.1670.0830.083
Var0.0140.0140.0120.0140.0150.0080.008
Mean0.2090.2080.1260.2090.2060.0880.087
10Median0.1670.1670.0830.1670.1670.0830.083
Var0.0160.0160.0090.0160.0160.0060.006
Table A2. Prediction errors with n = 200 in Case 1.
Table A2. Prediction errors with n = 200 in Case 1.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3550.3500.3290.3550.3490.3220.322
1Median0.3500.3500.3250.3500.3500.3250.313
Var0.0060.0070.0070.0060.0070.0060.006
Mean0.2620.2620.2340.2620.2620.2270.227
2Median0.2750.2750.2250.2750.2750.2250.225
Var0.0050.0050.0040.0050.0050.0040.004
Mean0.2050.2050.1840.2050.2050.1740.174
3Median0.2000.2000.1750.2000.2000.1750.175
Var0.0050.0050.0040.0050.0050.0030.003
Mean0.1630.1630.1340.1630.1630.1280.128
4Median0.1500.1500.1250.1500.1500.1250.125
Var0.0040.0040.0030.0040.0040.0030.003
Mean0.1390.1390.1130.1390.1390.1100.110
5Median0.1250.1250.1130.1250.1250.1000.100
Var0.0030.0030.0030.0030.0030.0020.002
Mean0.1360.1360.1010.1360.1360.0940.094
6Median0.1250.1250.1000.1250.1250.1000.100
Var0.0030.0030.0020.0030.0030.0020.002
Mean0.1290.1290.0990.1290.1290.0860.086
7Median0.1250.1250.1000.1250.1250.0750.075
Var0.0030.0030.0030.0030.0030.0020.002
Mean0.1210.1210.0910.1210.1210.0830.082
8Median0.1130.1130.0750.1130.1130.0750.075
Var0.0030.0030.0020.0030.0030.0020.002
Mean0.1270.1270.0900.1270.1270.0840.083
9Median0.1250.1250.1000.1250.1250.0750.075
Var0.0030.0030.0020.0030.0030.0020.002
Mean0.1210.1210.0880.1210.1210.0690.069
10Median0.1250.1250.0750.1250.1250.0750.075
Var0.0030.0030.0020.0030.0030.0020.002
Table A3. Prediction errors with n = 500 in Case 1.
Table A3. Prediction errors with n = 500 in Case 1.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3490.3490.3320.3490.3490.3300.330
1Median0.3450.3450.3300.3450.3450.3300.330
Var0.0020.0020.0020.0020.0020.0020.002
Mean0.2400.2400.2320.2400.2400.2280.228
2Median0.2400.2400.2300.2400.2400.2300.230
Var0.0010.0010.0020.0010.0010.0020.002
Mean0.1760.1760.1740.1760.1760.1680.168
3Median0.1700.1700.1700.1700.1700.1600.160
Var0.0020.0020.0010.0020.0020.0010.001
Mean0.1430.1430.1330.1430.1430.1350.134
4Median0.1400.1400.1300.1400.1400.1300.130
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.1260.1260.1140.1260.1260.1150.115
5Median0.1200.1200.1100.1200.1200.1100.110
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.1090.1090.0970.1090.1090.0950.096
6Median0.1100.1100.0900.1100.1100.0900.090
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.1060.1060.0900.1060.1060.0890.089
7Median0.1100.1100.0900.1100.1100.0900.090
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.0960.0960.0810.0960.0960.0840.084
8Median0.0900.0900.0800.0900.0900.0800.080
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.0900.0900.0750.0900.0900.0700.070
9Median0.0850.0850.0700.0850.0850.0650.065
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.0910.0910.0750.0910.0910.0690.068
10Median0.0900.0900.0700.0900.0900.0650.065
Var0.0010.0010.0010.0010.0010.0010.001
Table A4. Prediction errors with n = 60 in Case 2.
Table A4. Prediction errors with n = 60 in Case 2.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3620.3460.3590.3590.3420.3510.354
1Median0.3330.3330.3330.3330.3330.3330.333
Var0.0210.0210.0210.0210.0210.0210.022
Mean0.3150.2510.2620.3000.2450.2450.248
2Median0.3330.2500.2500.2500.2500.2500.250
Var0.0200.0160.0160.0190.0150.0150.016
Mean0.2690.1930.2080.2570.1880.1850.184
3Median0.2500.1670.1670.2500.1670.1670.167
Var0.0160.0140.0140.0150.0130.0120.013
Mean0.2580.1740.1760.2520.1670.1630.164
4Median0.2500.1670.1670.2500.1670.1670.167
Var0.0180.0130.0120.0170.0130.0120.012
Mean0.2440.1450.1690.2390.1370.1380.135
5Median0.2500.1670.1670.2500.1670.0830.083
Var0.0170.0100.0130.0170.0100.0110.011
Mean0.2340.1420.1500.2270.1310.1220.119
6Median0.2500.1670.1670.2500.0830.0830.083
Var0.0180.0100.0120.0170.0100.0090.009
Mean0.2140.1270.1420.2050.1180.1130.110
7Median0.1670.0830.1670.1670.0830.0830.083
Var0.0160.0110.0120.0160.0100.0090.009
Mean0.2300.1200.1560.2230.1100.1050.107
8Median0.2500.0830.1670.1670.0830.0830.083
Var0.0180.0100.0140.0170.0090.0090.010
Mean0.2040.1210.1600.1920.1080.1000.099
9Median0.1670.0830.1670.1670.0830.0830.083
Var0.0170.0090.0160.0160.0090.0080.008
Mean0.2010.1140.1780.1820.1010.0960.096
10Median0.1670.0830.1670.1670.0830.0830.083
Var0.0190.0100.0170.0190.0090.0080.008
Table A5. Prediction errors with n = 200 in Case 2.
Table A5. Prediction errors with n = 200 in Case 2.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3690.3360.3490.3690.3360.3420.341
1Median0.3750.3250.3500.3750.3250.3500.338
Var0.0070.0070.0060.0060.0070.0060.006
Mean0.2650.2530.2390.2650.2480.2330.233
2Median0.2750.2500.2500.2750.2500.2250.225
Var0.0060.0050.0050.0060.0050.0050.005
Mean0.2040.2040.1840.2040.2030.1750.175
3Median0.2000.2000.1750.2000.2000.1750.175
Var0.0040.0040.0030.0040.0040.0030.003
Mean0.1750.1750.1470.1750.1750.1430.142
4Median0.1750.1750.1500.1750.1750.1500.125
Var0.0040.0040.0040.0040.0040.0040.003
Mean0.1570.1570.1300.1570.1570.1180.118
5Median0.1500.1500.1250.1500.1500.1250.125
Var0.0040.0040.0030.0040.0040.0030.003
Mean0.1480.1480.1200.1480.1480.1080.107
6Median0.1500.1500.1250.1500.1500.1000.100
Var0.0040.0040.0030.0040.0040.0020.002
Mean0.1500.1500.1160.1500.1500.0920.091
7Median0.1500.1500.1130.1500.1500.1000.100
Var0.0030.0030.0030.0030.0040.0020.002
Mean0.1620.1610.1250.1620.1610.0910.092
8Median0.1500.1500.1250.1500.1500.0880.100
Var0.0050.0050.0040.0050.0050.0020.002
Mean0.1730.1670.1300.1730.1650.0860.087
9Median0.1750.1750.1250.1750.1500.0750.075
Var0.0040.0040.0040.0040.0040.0020.002
Mean0.1920.1720.1470.1920.1670.0880.090
10Median0.2000.1750.1500.2000.1500.0750.075
Var0.0060.0050.0050.0060.0050.0020.002
Table A6. Prediction errors with n = 500 in Case 2.
Table A6. Prediction errors with n = 500 in Case 2.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3450.3380.3320.3450.3360.3300.330
1Median0.3500.3400.3300.3500.3400.3300.330
Var0.0030.0020.0020.0030.0020.0020.002
Mean0.2390.2390.2270.2390.2390.2250.225
2Median0.2400.2400.2300.2400.2400.2250.220
Var0.0020.0020.0020.0020.0020.0020.002
Mean0.1820.1820.1700.1820.1820.1680.168
3Median0.1800.1800.1700.1800.1800.1700.170
Var0.0020.0020.0010.0020.0020.0010.001
Mean0.1520.1520.1410.1520.1520.1360.136
4Median0.1500.1500.1400.1500.1500.1400.140
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.1350.1350.1200.1350.1350.1140.114
5Median0.1300.1300.1200.1300.1300.1100.110
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.1290.1290.1100.1290.1290.1000.101
6Median0.1300.1300.1100.1300.1300.1000.100
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.1280.1280.1070.1280.1280.0920.093
7Median0.1300.1300.1000.1300.1300.0900.090
Var0.0020.0020.0010.0020.0020.0010.001
Mean0.1340.1340.1090.1340.1340.0860.087
8Median0.1300.1300.1100.1300.1300.0800.080
Var0.0020.0020.0010.0020.0020.0010.001
Mean0.1470.1470.1170.1470.1470.0860.088
9Median0.1400.1400.1100.1400.1400.0900.090
Var0.0020.0020.0020.0020.0020.0010.001
Mean0.1710.1710.1350.1710.1710.0930.096
10Median0.1700.1700.1350.1700.1700.0900.090
Var0.0030.0030.0020.0030.0030.0010.001
Table A7. Prediction errors with n = 60 in Case 3.
Table A7. Prediction errors with n = 60 in Case 3.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.4940.4820.4300.4900.4830.4050.413
1Median0.5000.5000.4170.5000.5000.4170.417
Var0.0260.0290.0260.0270.0290.0220.022
Mean0.4280.4120.3170.4270.4120.3180.303
2Median0.4170.4170.3330.4170.4170.3330.333
Var0.0210.0230.0280.0210.0230.0180.018
Mean0.4160.4010.3170.4190.4030.3130.302
3Median0.4170.4170.2920.4170.4170.2920.250
Var0.0280.0310.0370.0270.0300.0320.031
Mean0.4240.3870.3930.4200.3820.3570.344
4Median0.5000.4170.4170.4580.4170.3330.333
Var0.0470.0480.0560.0440.0460.0460.043
Mean0.3720.3620.4930.3980.3650.3800.355
5Median0.3330.3330.5830.4170.3330.4170.333
Var0.0520.0540.0670.0490.0520.0530.048
Mean0.4000.3830.6080.4270.3900.4460.430
6Median0.4170.3750.6670.4170.3750.5000.417
Var0.0720.0750.0600.0660.0750.0670.066
Mean0.3740.3780.6280.4280.3880.4810.468
7Median0.3330.3330.6670.4170.4170.5000.500
Var0.0720.0750.0520.0630.0720.0670.070
Mean0.4570.4570.6730.5270.4740.6150.593
8Median0.4170.4170.7500.5830.5000.6670.667
Var0.0980.0980.0530.0710.0910.0730.075
Mean0.5650.5650.7380.6420.5830.6520.659
9Median0.5830.5830.7500.7500.6670.7500.750
Var0.0990.0990.0400.0790.0870.0720.074
Mean0.5650.5650.7440.6620.6130.6980.694
10Median0.5830.5830.7500.6670.6670.7500.750
Var0.0960.0960.0370.0630.0800.0570.065
Table A8. Prediction errors with n = 200 in Case 3.
Table A8. Prediction errors with n = 200 in Case 3.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.4060.4030.3660.4060.4010.3410.342
1Median0.4000.4000.3500.4000.4000.3250.325
Var0.0060.0070.0070.0060.0070.0070.007
Mean0.3780.3770.3100.3780.3780.2720.271
2Median0.3750.3750.3000.3750.3750.2500.250
Var0.0100.0100.0100.0100.0100.0080.007
Mean0.4280.4280.3240.4280.4270.2530.251
3Median0.4630.4630.3000.4630.4500.2250.225
Var0.0180.0180.0160.0180.0180.0100.009
Mean0.4650.4270.3700.4700.4300.2590.254
4Median0.5000.4750.3500.5000.4750.2250.225
Var0.0310.0350.0290.0300.0340.0210.020
Mean0.2810.2310.5070.3100.2280.2820.276
5Median0.2000.1750.5000.2250.1750.2380.225
Var0.0350.0210.0410.0340.0200.0300.029
Mean0.2420.2420.6120.2890.2420.3250.321
6Median0.1750.1750.6750.2250.1750.2380.238
Var0.0400.0400.0360.0390.0370.0500.049
Mean0.2980.2980.7120.3630.2940.3680.362
7Median0.2000.2000.7250.3130.2000.3000.288
Var0.0590.0590.0140.0560.0560.0640.063
Mean0.4760.4760.7490.5530.4730.4980.495
8Median0.5130.5130.7630.5880.5000.5880.575
Var0.0860.0860.0090.0680.0840.0760.076
Mean0.4970.4970.7850.6250.5000.5920.586
9Median0.5250.5250.8000.7000.5380.6630.650
Var0.1040.1040.0050.0570.0990.0620.064
Mean0.6060.6060.8070.7460.6270.6620.661
10Median0.7500.7500.8250.8250.8000.7630.750
Var0.1050.1050.0040.0420.1010.0530.054
Table A9. Prediction errors with n = 500 in Case 3.
Table A9. Prediction errors with n = 500 in Case 3.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3940.3940.3600.3940.3940.3380.338
1Median0.3900.3900.3550.3900.3900.3400.340
Var0.0040.0040.0030.0040.0040.0030.003
Mean0.3450.3450.2800.3450.3450.2410.244
2Median0.3400.3400.2750.3400.3400.2400.240
Var0.0050.0050.0030.0050.0050.0020.002
Mean0.4260.4260.2860.4260.4260.1900.200
3Median0.4300.4300.2700.4300.4300.1900.200
Var0.0080.0080.0080.0080.0080.0020.002
Mean0.5240.4900.3900.5260.4900.1700.190
4Median0.5500.5400.4000.5500.5400.1600.180
Var0.0170.0250.0180.0150.0250.0020.003
Mean0.2250.1990.5350.2410.1980.1680.170
5Median0.1600.1600.5600.1800.1600.1500.160
Var0.0280.0180.0170.0270.0180.0060.006
Mean0.1860.1830.6650.2250.1840.1830.183
6Median0.1400.1400.6800.1800.1400.1400.150
Var0.0140.0130.0090.0140.0120.0130.011
Mean0.2510.2510.7350.3220.2520.2510.253
7Median0.1700.1700.7400.2600.1700.1700.190
Var0.0330.0330.0040.0280.0310.0330.028
Mean0.3760.3760.7760.5110.3790.3760.383
8Median0.3350.3350.7800.5200.3350.3350.385
Var0.0650.0650.0020.0480.0620.0650.057
Mean0.4670.4670.7970.6500.4760.4670.491
9Median0.4750.4750.8000.7000.4800.4750.510
Var0.0870.0870.0020.0390.0820.0870.076
Mean0.6520.6520.8220.8200.6750.6520.713
10Median0.7800.7800.8200.8400.7900.7800.800
Var0.0710.0710.0020.0120.0620.0710.048
Table A10. Prediction errors with n = 60 in Case 4.
Table A10. Prediction errors with n = 60 in Case 4.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3890.3780.4170.3960.3810.3810.387
1Median0.4170.3330.4170.4170.4170.4170.417
Var0.0240.0240.0230.0230.0230.0220.024
Mean0.2860.2680.3630.2990.2690.2680.268
2Median0.2500.2500.3330.2500.2500.2500.250
Var0.0220.0210.0290.0220.0220.0210.021
Mean0.2300.2190.3820.2590.2280.2190.219
3Median0.1670.1670.3330.2500.1670.1670.167
Var0.0240.0230.0400.0240.0230.0230.023
Mean0.1860.1810.4600.2420.1990.1810.181
4Median0.1670.1670.4170.1670.1670.1670.167
Var0.0220.0220.0480.0310.0240.0220.022
Mean0.1950.1940.5450.2840.2160.1940.194
5Median0.1670.1670.5830.2500.1670.1670.167
Var0.0290.0300.0540.0460.0340.0300.030
Mean0.2130.2110.6420.3740.2560.2110.211
6Median0.1670.1670.6670.3330.1670.1670.167
Var0.0420.0420.0450.0620.0490.0420.042
Mean0.2080.2100.6800.4240.2680.2100.210
7Median0.1670.1670.7500.4170.1670.1670.167
Var0.0520.0530.0370.0680.0600.0530.053
Mean0.2280.2280.7270.5130.3100.2280.228
8Median0.1670.1670.7500.5000.2500.1670.167
Var0.0590.0590.0250.0670.0710.0590.059
Mean0.2590.2580.7300.5720.3660.2580.258
9Median0.1670.1670.7500.5830.2500.1670.167
Var0.0840.0840.0300.0690.0910.0840.084
Mean0.3030.3030.7610.6650.4550.3030.303
10Median0.1670.1670.7500.7500.4170.1670.167
Var0.0990.0990.0200.0470.0960.0990.099
Table A11. Prediction errors with n = 200 in Case 4.
Table A11. Prediction errors with n = 200 in Case 4.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3780.3480.3870.3800.3500.3540.353
1Median0.3750.3500.3750.3750.3500.3500.350
Var0.0080.0080.0080.0080.0070.0070.007
Mean0.2770.2510.3300.2870.2530.2580.258
2Median0.2750.2500.3250.2750.2500.2500.250
Var0.0070.0060.0100.0070.0060.0070.006
Mean0.1930.1830.3740.2160.1860.2050.205
3Median0.1750.1750.3750.2000.1750.2000.200
Var0.0060.0050.0200.0070.0050.0060.006
Mean0.1680.1670.5120.2190.1710.2170.216
4Median0.1500.1500.5500.2000.1500.2000.200
Var0.0080.0080.0220.0120.0080.0110.011
Mean0.1410.1410.6130.2370.1520.2370.237
5Median0.1250.1250.6500.2000.1250.2000.200
Var0.0080.0080.0200.0190.0090.0180.018
Mean0.1320.1320.7000.2940.1460.2920.291
6Median0.1000.1000.7000.2500.1250.2500.250
Var0.0110.0110.0100.0300.0130.0290.029
Mean0.1380.1380.7420.3920.1610.3810.377
7Median0.1000.1000.7500.3750.1250.3750.375
Var0.0140.0140.0070.0390.0170.0330.033
Mean0.1540.1540.7690.5120.1930.4900.487
8Median0.1000.1000.7750.5500.1250.5000.500
Var0.0230.0230.0040.0420.0280.0390.039
Mean0.1750.1750.7880.6240.2320.5830.580
9Median0.1000.1000.8000.6750.1250.6250.625
Var0.0380.0380.0050.0320.0460.0350.035
Mean0.1920.1920.8000.6950.2820.6540.653
10Median0.1000.1000.8000.7250.1750.6880.675
Var0.0490.0490.0040.0240.0630.0290.029
Table A12. Prediction errors with n = 500 in Case 4.
Table A12. Prediction errors with n = 500 in Case 4.
R AICBICFPCAS-AICS-BICCV1CV2
Mean0.3800.3390.3670.3800.3400.3380.339
1Median0.3800.3400.3600.3800.3400.3400.340
Var0.0030.0030.0040.0030.0030.0030.003
Mean0.2780.2420.3100.2840.2420.2280.229
2Median0.2700.2400.3000.2800.2400.2300.230
Var0.0050.0030.0050.0040.0030.0020.002
Mean0.1800.1770.3850.1980.1790.1760.179
3Median0.1700.1700.3800.1900.1700.1700.180
Var0.0020.0020.0090.0030.0020.0020.002
Mean0.1410.1410.5270.1840.1430.1410.146
4Median0.1400.1400.5400.1700.1400.1400.140
Var0.0010.0010.0100.0030.0010.0010.001
Mean0.1220.1220.6490.2030.1260.1220.130
5Median0.1200.1200.6600.1850.1200.1200.120
Var0.0020.0020.0040.0070.0020.0020.002
Mean0.1090.1090.7160.2660.1160.1090.125
6Median0.1000.1000.7200.2400.1100.1000.110
Var0.0020.0020.0030.0130.0020.0020.003
Mean0.1030.1030.7540.3710.1150.1030.129
7Median0.0900.0900.7500.3600.1000.0900.120
Var0.0030.0030.0020.0200.0040.0030.005
Mean0.1020.1020.7750.4900.1190.1020.141
8Median0.0900.0900.7800.5000.1000.0900.120
Var0.0050.0050.0020.0230.0060.0050.007
Mean0.1120.1120.7910.6290.1430.1120.184
9Median0.0900.0900.7900.6500.1100.0900.140
Var0.0090.0090.0020.0150.0120.0090.017
Mean0.1140.1140.8020.7070.1550.1140.211
10Median0.0800.0800.8000.7200.1100.0800.160
Var0.0140.0140.0020.0070.0190.0140.025

Appendix D. Simulation Results in Section 4.2

Table A13. Prediction errors with R = 1 .
Table A13. Prediction errors with R = 1 .
NR = 1AICBICPCASAICSBICCV1CV2
Mean0.3290.3250.3120.3230.3220.3130.313
200Median0.3250.3250.3000.3250.3250.3250.325
Var0.0060.0060.0050.0060.0060.0060.006
Mean0.3300.3190.3050.3270.3140.3040.304
400Median0.3250.3130.3000.3250.3130.3000.300
Var0.0030.0030.0030.0030.0030.0030.003
Mean0.3320.3260.3040.3300.3260.3050.304
1000Median0.3300.3200.3030.3300.3200.3030.300
Var0.0010.0010.0010.0010.0010.0010.001
Table A14. Prediction errors with R = 3 .
Table A14. Prediction errors with R = 3 .
NR = 3AICBICPCASAICSBICCV1CV2
Mean0.1730.1730.1680.1730.1730.1620.162
200Median0.1750.1750.1750.1750.1750.1750.163
Var0.0030.0030.0020.0030.0030.0020.002
Mean0.1720.1720.1630.1720.1710.1670.169
400Median0.1750.1750.1630.1750.1750.1750.175
Var0.0010.0010.0020.0010.0010.0010.002
Mean0.1750.1930.1560.1750.1890.1490.148
1000Median0.1800.1980.1600.1800.1900.1450.145
Var0.0010.0010.0000.0010.0010.0010.001
Table A15. Prediction errors with R = 7 .
Table A15. Prediction errors with R = 7 .
NR = 7AICBICPCASAICSBICCV1CV2
Mean0.1040.1040.1090.1040.1040.0950.097
200Median0.1000.1000.1000.1000.1000.1000.088
Var0.0030.0030.0030.0030.0030.0030.002
Mean0.1060.1060.1010.1060.1060.0870.087
400Median0.1000.1000.1000.1000.1000.0880.088
Var0.0010.0010.0010.0010.0010.0010.001
Mean0.1090.1090.1030.1090.1090.0840.083
1000Median0.1100.1100.1050.1100.1100.0850.080
Var0.0010.0010.0000.0010.0010.0000.000

References

  1. Ando, Tomohiro, and Ker Chau Li. 2014. A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association 109: 254–65. [Google Scholar] [CrossRef]
  2. Ando, Tomohiro, and Ker Chau Li. 2017. A weight-relaxed model averaging approach for high-dimensional generalized linear models. The Annals of Statistics 45: 2654–79. [Google Scholar] [CrossRef]
  3. Andrews, Donald W. K. 1991. Asymptotic optimality of generalized CL, cross-validation, and generalized cross-validation in regression with heteroskedastic errors. Journal of Econometrics 47: 359–77. [Google Scholar] [CrossRef]
  4. Balan, Raluca M., and Ioana Schiopu-Kratina. 2005. Asymptotic results with generalized estimating equations for longitudinal data. The Annals of Statistics 33: 522–41. [Google Scholar] [CrossRef]
  5. Buckland, Steven T., Kenneth P. Burnham, and Nicole H. Augustin. 1997. Model selection: An integral part of inference. Biometrics 53: 603–18. [Google Scholar] [CrossRef]
  6. Chen, Kani, Inchi Hu, and Zhiliang Ying. 1999. Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. The Annals of Statistics 27: 1155–63. [Google Scholar] [CrossRef]
  7. Claeskens, Gerda, and Raymond J. Carroll. 2007. An asymptotic theory for model selection inference in general semiparametric problems. Biometrika 94: 249–65. [Google Scholar] [CrossRef]
  8. Flynn, Cheryl J., Clifford M. Hurvich, and Jeffrey S. Simonoff. 2013. Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. Journal of the American Statistical Association 108: 1031–43. [Google Scholar] [CrossRef]
  9. Gao, Yan, Xinyu Zhang, Shouyang Wang, and Guohua Zou. 2016. Model averaging based on leave-subject-out cross-validation. Journal of Econometrics 192: 139–51. [Google Scholar] [CrossRef]
  10. Hansen, Bruce E. 2007. Least squares model averaging. Econometrica 75: 1175–89. [Google Scholar] [CrossRef]
  11. Hansen, Bruce E., and Jeffrey S. Racine. 2012. Jacknife model averaging. Journal of Econometrics 167: 38–46. [Google Scholar] [CrossRef]
  12. Hoffmann-Jørgensen, Jørgen. 1974. Sums of independent Banach space valued random variables. Studia Mathematica 52: 159–86. [Google Scholar] [CrossRef]
  13. Hoffmann-Jørgensen, Jørgen, and Gilles Pisier. 1976. The law of large numbers and the central limit theorem in Banach spaces. The Annals of Probability 4: 587–99. [Google Scholar] [CrossRef]
  14. Hjort, Nils L., and Gerda Claeskens. 2003. Frequentist model average estimators. Journal of the American Statistical Association 98: 879–99. [Google Scholar] [CrossRef]
  15. James, Gareth M. 2002. Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64: 411–32. [Google Scholar] [CrossRef]
  16. Kahane, Jean Pierrc. 1968. Some Random Series of Functions. Lexington: D. C. Heath. [Google Scholar]
  17. Li, Ker Chau. 1987. Asymptotic optimality for Cp,CL, cross-validation and generalized cross-validation: discrete index set. The Annals of Statistics 15: 958–75. [Google Scholar] [CrossRef]
  18. Liang, Hua, Guohua Zou, Alan T. K. Wan, and Xinyu Zhang. 2011. Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association 106: 1053–66. [Google Scholar] [CrossRef]
  19. Lv, Jinchi, and Jun S. Liu. 2014. Model selection principles in misspecified models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76: 141–67. [Google Scholar] [CrossRef]
  20. Müller, Hans Georg, and Ulrich Stadtmüller. 2005. Generalized functional linear models. The Annals of Statistics 33: 774–805. [Google Scholar] [CrossRef]
  21. Wan, Alan T. K., Xinyu Zhang, and Guohua Zou. 2010. Least squares model averaging by Mallows criterion. Journal of Econometrics 156: 277–83. [Google Scholar] [CrossRef]
  22. Wu, Chien-Fu. 1981. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics 9: 501–13. [Google Scholar] [CrossRef]
  23. Xu, Ganggang, Suojin Wang, and Jianhua Z. Huang. 2014. Focused information criterion and model averaging based on weighted composite quantile regression. Scandinavian Journal of Statistics 41: 365–81. [Google Scholar] [CrossRef]
  24. Yang, Yuhong. 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96: 574–88. [Google Scholar] [CrossRef]
  25. Yao, Fang, Müller Hans-Georg, and Wang Jane-Ling. 2005. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100: 577–90. [Google Scholar] [CrossRef]
  26. Zhang, Xinyu, and Hua Liang. 2011. Focused information criterion and model averaging for generalized additive partial linear models. The Annals of Statistics 39: 174–200. [Google Scholar] [CrossRef]
  27. Zhang, Xinyu, Alan T. K. Wan, and Sherry Z. Zhou. 2012. Focused information criteria model selection and model averaging in a Tobit model with a non-zero threshold. Journal of Business and Economic Statistics 30: 132–42. [Google Scholar] [CrossRef]
  28. Zhang, Xinyu, Alan T. K. Wan, and Guohua Zou. 2013. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics 174: 82–94. [Google Scholar] [CrossRef]
  29. Zhang, Xinyu. 2015. Consistency of model averaging estimators. Economics Letters 130: 120–23. [Google Scholar] [CrossRef]
  30. Zhang, Xinyu, Dalei Yu, Guohua Zou, and Hua Liang. 2016. Optimal model averaging estimation for generalized linear models and generalized Linear mixed-effects models. Journal of the American Statistical Association 111: 1775–90. [Google Scholar] [CrossRef]
  31. Zhang, Xinyu, Jeng-Min Chiou, and Yanyuan Ma. 2018. Functional prediction through averaging estimated functional linear regression models. Biometrika 105: 945–62. [Google Scholar] [CrossRef]
  32. Zhao, Shangwei, Jun Liao, and Dalei Yu. 2018. Model averaging estimator in ridge regression and its large sample properties. Statistical Papers. [Google Scholar] [CrossRef]
  33. Zhu, Rong, Guohua Zou, and Xinyu Zhang. 2018. Optimal model averaging estimation for partial functional linear models. Journal of Systems Science and Mathematical Sciences 38: 777–800. [Google Scholar]
  34. Zinn, Joel. 1977. A note on the central limit theorem in Banach spaces. The Annals of Probability 5: 283–86. [Google Scholar] [CrossRef]
Figure 1. Predictor trajectories, corresponding to slightly smoothed monthly price curves. The low rising residential areas are in the upper left (a). The high rising residential areas are in upper right (b). Randomly selected profiles from the panels above are shown in the lower panels (c,d) for 20 districts.
Figure 1. Predictor trajectories, corresponding to slightly smoothed monthly price curves. The low rising residential areas are in the upper left (a). The high rising residential areas are in upper right (b). Randomly selected profiles from the panels above are shown in the lower panels (c,d) for 20 districts.
Econometrics 08 00007 g001
Table 1. Error of prediction.
Table 1. Error of prediction.
RoundsAICBICFPCAS-AICS-BICCV1CV2
10.3010.3010.2750.3010.3010.2210.221
20.2920.2920.2470.2920.2920.1780.176
30.2900.2900.2420.2900.2900.1870.187
40.2800.2800.2330.2800.2800.1760.174
50.2760.2760.2330.2760.2760.1470.149
60.3160.3160.2330.3160.3160.1880.188
70.2690.2690.2440.2690.2690.1640.164
80.2940.2940.2250.2940.2940.1740.174
90.3160.3160.2350.3160.3160.1870.187
100.2820.2820.2420.2820.2820.1740.173
110.2920.2920.2400.2920.2920.1620.162
120.2850.2850.2610.2850.2850.1880.188
130.2820.2820.2190.2820.2820.1500.149
140.2640.2640.2800.2640.2640.1880.188
150.2820.2820.2470.2820.2820.1870.187
160.2950.2950.2690.2950.2950.1850.185
170.3280.3280.2520.3280.3280.2040.202
180.3010.3010.2450.3010.3010.1870.187
190.2780.2780.2090.2780.2780.1500.150
200.3110.3110.2490.3110.3110.1830.183
Table 2. Error of fitting.
Table 2. Error of fitting.
RoundsAICBICFPCAS-AICS-BICCV1CV2
10.2870.2870.2350.2870.2870.1660.165
20.2890.2890.2440.2890.2890.1810.180
30.2900.2900.2460.2900.2900.1740.173
40.2930.2930.2490.2930.2930.1820.182
50.2960.2960.2490.2960.2960.1900.190
60.2850.2850.2490.2850.2850.1750.175
70.2970.2970.2460.2970.2970.1840.183
80.2920.2920.2520.2920.2920.1790.179
90.2830.2830.2480.2830.2830.1740.173
100.2910.2910.2460.2910.2910.1820.181
110.2910.2910.2470.2910.2910.1840.186
120.2940.2940.2400.2940.2940.1750.175
130.2930.2930.2540.2930.2930.1900.187
140.2950.2950.2330.2950.2950.1750.175
150.2930.2930.2440.2930.2930.1760.177
160.2880.2880.2370.2880.2880.1790.178
170.2820.2820.2430.2820.2820.1730.173
180.2900.2900.2450.2900.2900.1780.177
190.2940.2940.2570.2940.2940.1860.187
200.2850.2850.2440.2850.2850.1790.179
Back to TopTop