Cross-Validation Model Averaging for Generalized Functional Linear Model

: Functional data is a common and important type in econometrics and has been easier and easier to collect in the big data era. To improve estimation accuracy and reduce forecast risks with functional data, in this paper, we propose a novel cross-validation model averaging method for generalized functional linear model where the scalar response variable is related to a random function predictor by a link function. We establish asymptotic theoretical result on the optimality of the weights selected by our method when the true model is not in the candidate model set. Our simulations show that the proposed method often performs better than the commonly used model selection and averaging methods. We also apply the proposed method to Beijing second-hand house price data.


Introduction
In recent years, functional data have been increasingly popular in many scientific areas. A common question for the functional data is how to quantify the relationship between functional covariates and scalar responses. Functional linear model (FLM) and generalized functional linear model (GFLM) can take account of some associations between the response and the different points in the domain of the functional covariates, and therefore are two useful tools in many studies for functional data. These two models have now been widely used to solve practical problems, such as exploring the relationship between the growth and age in the life sciences, analyzing the weather data in different areas, recognizing the handwriting data, and conducting the diffusion tensor imaging studies. Functional data analysis usually represents functional covariates and coefficient functions by some linear combinations of a set of basis functions, such as a prespecified basis system like B-splines, Fourier and wavelet bases (James 2002), and data-adaptive basis functions from functional principal component analysis (FPCA) (Yao et al. 2005). We are concerned with the GFLM because it can estimate the flexible and nonlinear relationships between the functional covariates and scalar responses for many types of data such as binary response data, Poisson response data, and multivariate discrete response data. See, for example, James (2002), who expanded generalized linear models to generalized functional linear models with the functional principal component methodology and demonstrated that this approach can be performed for linear, logistic and censored regressions in simulations and real data analysis.
In econometrics, the relationship between time series and scalar response is often of interest. We can use GFLM instead of generalized linear model to handle the case where a time series with the dependence at different time points is used as the explanatory variables with dimension toward to infinity. On the other hand, prediction is often the main goal in econometric data analysis. Several approaches have been proposed to select some important principal components in FPCA such as AIC, BIC, and leave-one-out cross-validation (Müller and Stadtmüller 2005). However, as we will demonstrate, the model selection alone, such as AIC, is not an optimal approach for the purpose of estimation and prediction. in one model selected by AIC or BIC may lead to the loss of information from other models. Different models often capture different data characteristics and therefore model averaging generally gets higher estimating or predicting accuracy, which has received extensive attention in recent years.
Model averaging has two research directions: Bayesian Model Averaging (BMA) and Frequentist Model Averaging (FMA). We will focus on the latter in this paper. A key problem with the FMA is the choice of weights assigned to different models. In this regard, various approaches have been developed. See, for example, smoothed AIC, smoothed BIC (Buckland et al. 1997), smoothed FIC (Hjort and Claeskens 2003;Claeskens and Carroll 2007;Zhang and Liang 2011;Zhang et al. 2012;Xu et al. 2014), Adaptive method (Yang 2001), MMA method (Hansen 2007;Wan et al. 2010), OPT method , JMA method (Hansen andRacine 2012, Zhang et al. 2013), and leave-subject-out cross-validation method (Gao et al. 2016), which apply to independent, or time series, or longitudinal data.
For functional data, some model averaging methods have been studied. Zhu et al. (2018) proposed a model averaging estimator based on Mallows' criterion for partial functional linear models whose response is a scalar and the predictors are a random vector and some functional variables. Zhang et al. (2018) proposed a Jackknife model averaging for fully functional linear models whose response and predictor are both functional processes. For generalized functional linear model designed for the case where the scalar response is nonlinearly dependent on functional explanatory variables, model averaging is a good alternative to model selection that may lead to instability in variable selection or coefficient estimation caused by randomness of the data collection and so on.
In this article, we consider model averaging methods for GFLM to capture the nonlinear characteristics hidden in the data and to reduce the prediction errors and risks. The contributions of this article are threefold: We first adopt FPCA to reduce the dimensions as it provides a parsimonious representation of functional data, and then present a novel model averaging procedure based on leave-one-out cross-validation criterion (CV). Second, we prove the consistency of parameter estimator under the misspecified model with some mild conditions. The dimension of the parameter can be divergent. Third, we establish the asymptotic optimality of our method in the squared loss sense for generalized linear model with a diverging number of parameters. Our work relaxes the condition that the expectations of estimators need to exist.
The rest of the article is organized as follows. In Section 2, we introduce our proposed model averaging method for GFLM. We then establish the asymptotic property of the proposed method in Section 3. Simulation studies and a real data example of second-hand house price in Beijing are presented in Section 4. Section 5 concludes. Proofs of theoretical results are provided in Appendix A and B.

The Generalized Functional Linear Model
The data we collected for the ith subject or experimental unit are ({X i (t), t ∈ T} , y i ) , i = 1, . . . , n. We assume these data are generated independently. The predictor variable X(t) (t ∈ T) is a random curve corresponding to a square integrable stochastic process on a real interval T. The response variable is a real-valued random variable that may be continuous or discrete. For example, in a binary regression, one would have y ∈ {0, 1}.
Suppose that the given link function g(·) is a strictly monotone and twice continuously differentiable function with bounded derivatives and is thus invertible. This assumption is common in generalized linear model. See, for example, (Chen et al. 1999;Müller and Stadtmüller 2005;Ando and Li 2017). Moreover, we assume a variance function σ 2 (·), which is strictly positive with upper bound defined on the range of the link function. The generalized functional linear model or functional quasi-likelihood model is determined by a parameter function β(·), which is square integrable on its domain T, in addition to the link function g(·) and the variance function σ 2 (·).
Given a real measure dω on T, we define linear predictors and conditional means . In a generalized functional linear model, the distribution of y i would be specified with the exponential family. Thus, we should consider a functional quasi-likelihood model Note that α is a constant, and the inclusion of an intercept allows us to require E (X i (t)) = 0 for all t. We assume the errors e i are independent with the same variance. It is easy to obtain E(e i ) = 0 and Following Müller and Stadtmüller (2005), we choose an orthonormal basis ρ j , i = 1, 2 . . . of the Then, we can expand the predictor process X(t) and the parameter function β(t) as and [in the L 2 (dω) sense] with random variables ε j and coefficients β j given by ε j = X(t)ρ j (t)dω(t) and β j = β(t)ρ j (t)dω(t), respectively. By the previous assumptions that X(t) and β(t) are square integrable, we get ∑ ∞ j=1 β 2 j < ∞ and ∑ ∞ j=1 Eε 2 j < ∞. From the orthonormality of the basis function ρ j and setting it follows immediately that It will be convenient to work with standardized errors in which E(e i |X i (t)) = 0, E(e i ) = 0, and E(e 2 i ) = 1. Then, it will be sufficient to consider the following model, where the function g(·) is known. The number of parameter in model (2) is infinite. We address the difficulty caused by the infinite dimensionality of the predictors by approximating model (2) with a series of models where the number of predictors is truncated at p = p n , and the dimension p n can be a constant as large as possible with p n < n. A heuristic truncation strategy is as follows. For the ith sample, a p-truncated linear predictor The approximating model we use is Now, we consider the estimation for generalized functional linear model. First, we use FPCA to get a set of orthogonal eigenfunctions as the basis functions in the space L 2 (dω). Then, we consider a series of candidate models. The number of candidate models is M. For the mth candidate model, we adopt the first p m functional principal components to build the approximating model, We assume that p 1 < p 2 < · · · < p M . That is, the candidate models are nested. Denote ε i, 0 = 1 and β p m ) T by solving the following estimating or score equation j ε i, j and ε (i,p m ) = (ε i, 0 , . . . , ε i,p m ) T . Letβ (m) be the solution of the score equation U n,m (β (m) ) = 0, i.e.,

Model Averaging Estimation
For each candidate model, we get the estimator of the unknown parameter vector by (4). Let then we obtain the model averaging estimator of η i : j ε ij . Thus, a model averaging estimator of the conditional mean µ i is given bŷ Letβ (m) −j be the estimator of β (m) from (4) without the jth observation, that is, For the observation j, the leave-one-out truncated linear estimator of η j under the mth model is and the leave-one-out model averaging estimator of µ j is Thus, we propose the following leave-one-out criterion for choosing weights in the model averaging estimator given by (7) CV be the weight vector from CV(w) criterion. Then, pluggingŵ into (7), we obtain the final model averaging estimatorμ i (ŵ), i = 1, 2, . . . , n.

Asymptotic Property for Model Averaging Estimator
In this section, we will establish the optimal property of cross-validation model averaging for generalized functional linear model. We allow the dimension of each candidate model to be divergent as n tends to ∞.

Notations and Conditions
We denote the first and second derivatives of the function g(·) by g (. . . ) and g (. . . ), respectively, the diagonal matrix A with diagonal elements a 1 , a 2 , . . . , a q by A = diag(a 1 , a 2 , . . . , a q ), the minimum singular value of matrix A by λ min {A}, and We assume |g (·)| c < ∞ and |g (·)| c 1 < ∞, and σ 2 (·) is strictly positive with bound 0 < d 1 σ 2 (·) d 2 < ∞ and σ 2 (·) d 3 < ∞. Consider the squared loss function where µ = (µ 1 , µ 2 , . . . , µ n ) T and µ(w) = (μ 1 (w),μ 2 (w), . . . ,μ n (w)) T are the two n × 1 vectors, and · 2 is Euclidean norm. Denote where β (m) is the pseudo true parameter, which, like Flynn et al. (2013) and Lv and Liu (2014), is defined as the solution to the following score equation, and is a theoretical target under the mth candidate model with misspecification. We assume that such a solution is existent and β (m) 2 / (p m + 1) ≤ C b < ∞. ξ n represents the minimal bias between the true model and the final model generated by model averaging, which is an alternative to the risk based on L n (w). In this work, we do not require the expectation of L n (w) to exist, which is more relaxed than the common requirement on jackknife model averaging methods for generalized linear model. See, for example, Zhang et al. (2016) and Ando and Li (2017). In the following, we assume that X i (t), i = 1, 2, . . . , n are non-random with sup i |η i | C η < ∞. (ii) Ee i = 0.
Condition 1 is a requirement for generalized model to guarantee the existence of solutions to (4). In general, the existence and consistency of roots obtained by solving (4) have to be checked, so we list Condition 1. The similar condition can be found in Balan and Schiopu-Kratina (2005). In the special case where the link function is g(x) = x, the solution of (4) is a generalized least squares estimator of β (m) and Condition 1 is easy to satisfy.
Condition 2 is common for generalized linear model. See, for instance, Chen et al. (1999) and Ando and Li (2017). The least squares estimator for linear regression models is strongly consistent under Condition 2. This condition is less restrictive than (A1) of Ando and Li (2017) for proving the optimality of the weight selection procedure.
Condition 3 is similar to (2.3) of Theorem 1 in Chen et al. (1999) and is due to the nonlinearity. A counterexample is given to show thatβ (m) may not be consistent when Condition 3 (i) is dropped in Chen et al. (1999).
Condition 4 means that the speed of ξ n tending to ∞ should be faster than that of √ np 2 . This condition also implies that the true model is not in the candidate model set, which is a condition commonly used for optimal model averaging. It is easy to satisfy when the true model is an infinite dimensional model. This condition is an alternative to Condition C.3 of Zhang et al. (2016) and (A3) of Ando and Li (2017).
. By Lemma A3 in the Appendix A and Condition 3, we have Then, with the following standard condition for the application of cross-validation, which says that as n gets large, the difference between the ordinary and leave-one-out estimators of η i under the mth candidate model gets small, it can be seen that which means Condition 5 is reasonable. For the one-parameter natural exponential family models, Ando and Li (2017) showed under some regularity conditions that ∑ n i=1 η i,p m −η i,p m 2 = O p p 2 m /n satisfying our Condition 5. For the linear models where g(x) = x and σ 2 (·) = 1, ε (i,p m ) ≤ cp m /n for some constant c < ∞, which is commonly used to ensure the asymptotic optimality of cross-validation. See, for example, Condition (5.2) of Li (1987), Condition (5.2) of Andrews (1991), Condition (A.9) of Hansen and Racine (2012), Condition (C.2) of Zhang (2015), and Condition (C.3) of Zhao et al. (2018). In general, our Condition 5 is more relaxed than those in literature for the complex candidate models.
Condition 6 is to ensure that the pseduo true parameter β (m) is unique. The consistency of the estimator of β (m) can also be derived by this condition. See Lemma A3 in the Appendix A. In addition, the one-parameter natural exponential family considered in Theorem 1 of Ando and Li (2017) is an example with By the commonly used assumption that λ n,m c 0 > 0 for some constant c 0 < ∞, and the assumption (4.3) in Ando and Li (2017), this example satisfies Condition 6.
Theorem 1. Assume that Conditions 1-6 hold, thenŵ is asymptotically optimal in the sense that where p − → means convergence in probability.

Proof. See the Appendix B.
Remark 1. When the dimensions of the candidate models are fixed, condition 4 can be relaxed to n/ξ 2 n → 0.

Remark 2.
It is easy to see that if we do not require that the weights sum to one, then we can use M instead of 1 as the upper bound of ∑ M m=1 w 2 m in our proof. Thus, all the proofs are still valid for the fixed M. This implies that Theorem 1 remains true if we remove the constraint that the weights sum to one. In addition, as the candidate models are not necessarily nested in the proof, this theorem still holds when the candidate models are non-nested.

Simulation I: Fixed Number of Candidate Models
In this section, we conduct simulation experiments to compare the finite sample performance of our model averaging methods and some commonly used model selection and model averaging methods. For model selection, we consider three methods: AIC, BIC, and FPCA. FPCA is an efficient and common method in functional data analysis, which determines the final model by the cumulative contributions of the functional principal components. For model averaging, we consider the following methods, S-AIC (smoothed AIC), S-BIC (smoothed BIC), and our cross-validation model averaging, which is denoted as CV1 if we restrict the sum of weights to be 1 as before, and CV2 if no constraint on the sum of weights is imposed.
The data generating process is as follows: the predictor variable is and the parameter function is where ρ j (t) is a basis function with t ∈ [0, 1], and j ≥ 1 and J is the number of the basis functions. Here, we use B-spline base and Fourier base. For B-spline base, we choose the order of the basis functions to be 2, and the number of the basis functions to be 20. As for Fourier base, we choose the number of the basis functions as 21 and the first basis to be a constant function. In our simulation, the following four cases are considered.
We set the term ε i,j to be independently generated from N(0, R 2 /j 2 ), where R = 1, 2, . . . , 10. The response variable y i is generated from binomial distribution Binomial(p(X i (t)), 1) with the probability p(X i (t)) being g 1 0 X i (t)β(t) dt . We consider three types of link function g(·): logistic link function exp(·)/(1 + exp(·)), Probit link function, and Poisson link function. For the Poisson model, we only consider the simulations with R = 1 for Cases 1-4.
In the simulation, we use FPCA to obtain the nested candidate models. Each candidate model contains the first p m principal components. The number of candidate models is 18 for Cases 1-2 and 19 for Cases 3-4. Then we adopt the weighted iterated least squares algorithm which is a common approach in generalized linear model to get the estimates for each model. For the weights, we use the 'fmincon' function in Matlab to get the solution of CV criterion.
The sample size is set as n = 60, 200, 500. We use the 80% data as the training data {Y 1 , X 1 } with size n 1 , and the remaining data as the test data {Y 2 , X 2 } with size n 2 . Then, we compare the prediction errors. We calculate the prediction accuracy ( Ŷ 2 − Y 2 2 /n 2 ), fitting accuracy ( Ŷ 1 − Y 1 2 /n 1 ), predictor coefficient prediction accuracy ( η (2) − η (2) 2 /n 2 ), and predictor coefficient fitting accuracy /n 1 ). We repeat this process 1000 times, and then obtain mean, median, and variance of these prediction errors for each method. To save space, we present only the results on the prediction accuracy. The results on the other type accuracies are available from the authors upon request. We only report the results for logistic link function due to space limitations. Other link function results are also available from the authors.
For Case 1, the prediction errors are summarized in Tables A1-A3. From Table A1, it is seen that with R varying from 1 to 10, the prediction errors are decreasing, because the difference of probability between the two groups (one group whose response is 1 and the other group whose response is 0) becomes larger. Our methods (CV1 and CV2 in the tables) always obtain the minimum error means (Mean in the tables), medians (Median in the tables), and variances (Var in the tables). However, there is no clear tendency between CV1 and CV2, which perform similarly in most of situations. When R is small, BIC is always better than AIC, and S-BIC is always better than S-AIC. This may be due to less parameters being useful for smaller R values, and in this case, a bigger penalty on the number of parameters in the model is preferred. Moreover, when the candidate models differ significantly, AIC or BIC performs similarly to S-AIC or S-BIC, respectively. As R becomes larger, the difference between AIC and BIC or S-AIC and S-BIC becomes smaller. FPCA is always superior to AIC, BIC, S-AIC, and S-BIC, and their differences become larger as R increases. Now, we turn to Tables A2 and A3. With the sample size n increasing from 60 to 200 and 500, we can see that the prediction errors decrease for each fixed R. The median and variance of prediction errors also become smaller. AIC and BIC behave increasingly similarly. CV1 and CV2 are still the best among all the methods, and followed by FPCA.
For Case 2, the prediction errors are given in Tables A4-A6. As shown earlier, CV1 and CV2 perform the best, and followed by FPCA. Likewise, S-AIC or S-BIC is better than AIC or BIC, respectively. For Table A4, with R varying from 1 to 10, the prediction errors are decreasing except FPCA method, which gets the minimum at R = 7 with a small fluctuation. CV1 and CV2 perform equally well for different R values and sample sizes. The difference between AIC and BIC becomes small with the sample size increasing. The similar phenomenon is observed for S-AIC and S-BIC.
For Case 3, the prediction errors are provided in Tables A7-A9. For n = 60 (Table A7), CV1 or CV2 is the best when R is between 1 and 5. However, when R is between 6 and 10, the two model selection methods-AIC and BIC-are the best. The similar conclusions can be found in Table A8 with n = 200 and Table A9 with n = 500, although in the latter case, CV1 actually performs the best for all of R values. The error rates of all methods become smaller with R increasing from 1 to 6 and then bigger with R varying from 7 to 10.
For Case 4, the prediction errors are presented in Tables A10-A12. For n = 60 in Table A10, CV1, CV2, and BIC are the best, and followed by AIC . In this design, S-AIC or S-BIC is not better than AIC or BIC. For n = 200 in Table A11, BIC is the best, and followed by AIC. For n = 500 in Table A12, CV1 always performs the best, and followed by BIC.
In summary, for out-of-sample prediction, our methods CV1 and CV2 perform the best in most of cases and have smaller variances and medians of errors. Furthermore, CV1 and CV2 often perform equally well. This indicates that removing the restriction on the sum of weights may not lead to a better model averaging estimates.

Simulation II: Divergent Number of Candidate Models
We consider the situations where the number of candidate models tends to ∞ as the sample size increases. We set the sample size n to be 200, 400, and 1000, and the the number of candidate models to be 9n/100 (So M=18,36, and 90 for the three sample sizes). The data generating process is as before: the predictor variable is We set the term ε i,j to be independently generated from N(0, R 2 /j 1/2 ), where R = 1, 3, 7. The response variable y i is generated from binomial distribution with the logistic link.
The candidate models are nested. The algorithms used in the calculations are the same as that described in Section 4.1. For the simulation results, we report the errors of seven methods considered as Section 4.1. From Tables A13-A15, our methods-CV1 and CV2-perform the best in most of cases, and followed by FPCA, and SAIC. The difference between AIC and BIC, or SAIC and SBIC is decreasing with increasing R .

Application: Beijing Second-Hand House Price Data
We apply our method to the Beijing second-hand housing transaction price data, which is captured from the internet collected by the Guoxinda Group Corporation. Most of the data pass through the manual check. This data include the second-hand housing prices and the surrounding environment variables of the 2318 residential areas in Beijing. The second-hand housing prices data are monthly data from January 2015 to December 2017 for each residential area.
Our aim is to predict the increase level in house prices in next year. We are concerned about the relationship between price level to rise and the past housing price curves. We use the median of listing online prices of houses in a residential area as the house price for this residential area. We use the price curve of each residential area from January 2015 to December 2016 as a predictor variable. The response variable is a binary variable, which takes 1 if the rising ratio is high, and 0 otherwise. Here, we define the rising ratio for each district as the ratio of the average monthly price in 2017 to the average monthly price in 2016. The 25%, 50%, and 75% quantile ratios are 1.31, 1.37, and 1.44 , respectively. We focus on the residential areas whose housing prices are rising rapidly, and so if the ratio is higher than 75% quantile ratios of all residential areas, the response variable of this residential area takes 1 as its value, and 0 otherwise. Of the n = 2318 residential areas, 568 are rising fast, and 1750 are not.
For simplicity, we standardize all the price data. For each group, we plot the housing price trajectories in Figure 1. Failure to visually detect differences between the groups could result from overcrowding of these plots with too many curves, but when displaying fewer curves (lower panels of Figure 1), the same phenomenon remains. With a few exceptions, no clear visual differences between the two groups can be discerned. On the whole, the trajectories of per year from 2015 to 2016 are not much different. Therefore, the discrimination task at hand is difficult.
We randomly select 75% of all residential areas as the training set with size 1739, and the rest as the testing data with size 579. We use logistic link and B-spline functions to fit the house price curves. The number of the basis functions is 6, and the order of the B-spline basis functions is 2. Then, we adopt functional principal component analysis (Yao et al. 2005) to built the data-adaptive basis functions to reduce the dimension and deal with the correlations in house price time series. We compare the out-of-sample prediction errors of the seven methods in Section 4. We repeat every method 20 times. The results are summarized in Tables 1 and 2. It can be observed from the tables that the error of CV1 or CV2 method is lower 10% on average than those of other methods, and overall, CV1 and CV2 behave similarly. As shown in the simulation above, this indicates the constraint that the sum of weights equals 1 makes sense in practical cases. AIC and BIC perform equally well, as both choose the largest model in most cases. We also find that FPCA is better than AIC or BIC. FPCA always selects the smallest model because the cumulative reliability of the first principal component is 98%. Further, it is clear that the fitting error and prediction error of FPCA are similar. For the other methods, the fitting errors are always a little smaller than the prediction errors.

Concluding Remarks
In this paper, we proposed a model averaging approach under the framework of the generalized functional linear model. We showed that the weight chosen by the leave-one-out cross-validation method is asymptotically optimal in the sense of achieving the lowest possible squared error in a class of model averaging estimators. It can be seen from the theoretical proof that our method is also valid for the non-nested candidate model set. Numerical analysis shows that for generalized functional linear model, cross-validation model averaging is a powerful tool for estimation and prediction. A further work is to develop model averaging inference procedures based on generalized functional linear model. In addition, how to combine other covariates into generalized functional linear model is also an interesting problem. Acknowledgments: The authors thank the two referees for their constructive comments and suggestions that have substantially improved the original manuscript. The Beijing second-hand house price data is collected by the Guoxinda Group Corporation. This project was partially supported by the National Natural Science Foundation of China (Grant No.71571180).

Conflicts of Interest:
The authors declare no conflicts of interest.
are independent random variables such that P (ε i = 1) = P (ε i = −1) = 1/2, and a.s. means converges almost surely. A Banach space G is said to be type 2 if the identity map on G is type 2.
Let (S, d) be a compact metric space and C(S) be the Banach space of real-valued continuous functions on S with the supremum norm ν ∞ = sup s∈S |ν(s)| , for any ν ∈ C(S). Denote a d-continuous metric ρ on S. Let N (S, ρ, ε) denote the minimal number of ρ-balls of radius less than or equal to ε which cover S, and set H (S, ρ, ε) = log N (S, ρ, ε) .

We let
and for ν ∈ Lip(ρ), we define where s * is some fixed point in S. In addition, assume that ν j : j 1 ⊆ Lip(ρ) and e j : j 1 are independent real-valued random variables. Then, ν j e j are independent Lip(ρ)-valued random variables.
Then we have A < ∞ such that for all n, where X 1 , X 2 , . . . , X n are independent Lip(ρ)-valued random variables with mean zeros. .
Under Condition 3, we have Proof of Lemma A2. First note that for any l ∈ [0, p m ], we have where the last step is by the mean-value theorem and γ i is a point betweeen ε T 2 . From the assumptions that g(·) is a twice continuously differentiable function with bounded derivatives |g (·)| c < ∞ and |g (·)| c 1 < ∞, and σ 2 (·) is strictly positive with bound 0 < d 1 σ 2 (·) d 2 < ∞ and σ 2 (·) d 3 < ∞, we see that there is a constant c > 0 such that |v i (·)| c < ∞, and where the second inequality is by Cauchy-Schwarz inequality. Therefore, we obtain As Θ m is a compact subset of R p m +1 , and ρ(β 2 ) is the Euclidean metric in R p m +1 , (A1) is satisfied. Thus, by Lemma A1, there is a constant A > 0 uniformly for all l such that for any C > 0, we have Therefore, for any ε > 0, letting C = Ac 2 √ C 2 + 1 2 C 2 C 1 /ε, we obtain which implies (A3).
Proof of Lemma A3. By the definition of β (m) and Condition 6, then we have where β (m) is a point between β (m) and β (m) . Recalling that By Condition 1, for any κ > 0, there is an N 1 such that for all n > N 1 , we have From (A7), it can be seen that Then for any C > 0 and n > N 1 , letting δ = C (p m +1) 3 n , we have where C = Ac 2 √ C 2 + 1 2 C 2 C 1 and the second inequality is derived by (A4). As a result, for any for sufficiently large n, thus Lemma A4. Under Conditions 1-4 and 6, n (w).
We note that Let η m i * be the point between ε T (i,p m ) β (m) andη i,p m , for fixed M, Then, by Lemma A3 and Condition 3, we have which, together with Condition 4, leads to (A9).
Appendix B.1. Proof of (A11) Notice that Therefore, to prove (A11), it suffices to verify By Lemma A4, we need only to show Let η * i,p m be the point betweenη i,p m andη i,p m . Then, for any δ > 0, we have which, together with the assumption that g(·) is a twice continuously differentiable function with bounded derivatives implying max 1 i n,w∈H n |g (η Thus, to prove (A13), it suffices to show By Condition 5, for fixed M, we obtain which, together with Condition 4, leads to (A14), and thus (A13) holds.
Appendix B.2. Proof of (A12) It is readily seen that Thus, we need only to prove and sup The proof of (A15) is similar to that of Wu (1981). We denote a metric which is on H n . Let (H n , ρ) be a compact metric space. Then C (H n ) is the Banach space of real-valued continuous functions on H n with the supremum norm ∆ ∞ = sup w∈H n |∆(w)| .
Let N (H n , ρ, ε) denote the minimal number of ρ-balls of radius less than or equal to ε which cover H n , and set H (H n , ρ, ε) = log N (H n , ρ, ε) .
We let where w * is some fixed point in H n .
As for p+1 , using Lagrange theorem, we have where δ > 0 is an arbitrary constant. Since H n is a compact subset of R M , and ρ(w, w ) is the Euclidean metric in R M , (A1) is satisfied. Therefore, by Lemma A1, we see that there is a constant A < ∞ such that for all n, where the last equality is because of (A18), (A19) and sup j Ee 2 j < ∞. Therefore, and (A15) holds.
Appendix C. Simulation Results in Section 4.1 Table A1. Prediction errors with n = 60 in Case 1.