Cross-Validation Model Averaging for Generalized Functional Linear Model

Zhang, Haili; Zou, Guohua

doi:10.3390/econometrics8010007

Open AccessFeature PaperArticle

Cross-Validation Model Averaging for Generalized Functional Linear Model

by

Haili Zhang

^1,2

and

Guohua Zou

^3,*

¹

University of Chinese Academy of Sciences, Beijing 100049, China

²

Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

³

School of Mathematical Sciences, Capital Normal University, Beijing 100048, China

^*

Author to whom correspondence should be addressed.

Econometrics 2020, 8(1), 7; https://doi.org/10.3390/econometrics8010007

Submission received: 2 September 2019 / Revised: 6 February 2020 / Accepted: 18 February 2020 / Published: 24 February 2020

(This article belongs to the Special Issue Bayesian and Frequentist Model Averaging)

Download

Browse Figure

Versions Notes

Abstract

:

Functional data is a common and important type in econometrics and has been easier and easier to collect in the big data era. To improve estimation accuracy and reduce forecast risks with functional data, in this paper, we propose a novel cross-validation model averaging method for generalized functional linear model where the scalar response variable is related to a random function predictor by a link function. We establish asymptotic theoretical result on the optimality of the weights selected by our method when the true model is not in the candidate model set. Our simulations show that the proposed method often performs better than the commonly used model selection and averaging methods. We also apply the proposed method to Beijing second-hand house price data.

Keywords:

generalized functional linear model; cross-validation; model averaging; asymptotic optimality

1. Introduction

In recent years, functional data have been increasingly popular in many scientific areas. A common question for the functional data is how to quantify the relationship between functional covariates and scalar responses. Functional linear model (FLM) and generalized functional linear model (GFLM) can take account of some associations between the response and the different points in the domain of the functional covariates, and therefore are two useful tools in many studies for functional data. These two models have now been widely used to solve practical problems, such as exploring the relationship between the growth and age in the life sciences, analyzing the weather data in different areas, recognizing the handwriting data, and conducting the diffusion tensor imaging studies. Functional data analysis usually represents functional covariates and coefficient functions by some linear combinations of a set of basis functions, such as a prespecified basis system like B-splines, Fourier and wavelet bases (James 2002), and data-adaptive basis functions from functional principal component analysis (FPCA) (Yao et al. 2005). We are concerned with the GFLM because it can estimate the flexible and nonlinear relationships between the functional covariates and scalar responses for many types of data such as binary response data, Poisson response data, and multivariate discrete response data. See, for example, James (2002), who expanded generalized linear models to generalized functional linear models with the functional principal component methodology and demonstrated that this approach can be performed for linear, logistic and censored regressions in simulations and real data analysis.

In econometrics, the relationship between time series and scalar response is often of interest. We can use GFLM instead of generalized linear model to handle the case where a time series with the dependence at different time points is used as the explanatory variables with dimension toward to infinity. On the other hand, prediction is often the main goal in econometric data analysis. Several approaches have been proposed to select some important principal components in FPCA such as AIC, BIC, and leave-one-out cross-validation (Müller and Stadtmüller 2005). However, as we will demonstrate, the model selection alone, such as AIC, is not an optimal approach for the purpose of estimation and prediction. in one model selected by AIC or BIC may lead to the loss of information from other models. Different models often capture different data characteristics and therefore model averaging generally gets higher estimating or predicting accuracy, which has received extensive attention in recent years.

Model averaging has two research directions: Bayesian Model Averaging (BMA) and Frequentist Model Averaging (FMA). We will focus on the latter in this paper. A key problem with the FMA is the choice of weights assigned to different models. In this regard, various approaches have been developed. See, for example, smoothed AIC, smoothed BIC (Buckland et al. 1997), smoothed FIC (Hjort and Claeskens 2003; Claeskens and Carroll 2007; Zhang and Liang 2011; Zhang et al. 2012; Xu et al. 2014), Adaptive method (Yang 2001), MMA method (Hansen 2007; Wan et al. 2010), OPT method (Liang et al. 2011), JMA method (Hansen and Racine 2012; Zhang et al. 2013), and leave-subject-out cross-validation method (Gao et al. 2016), which apply to independent, or time series, or longitudinal data.

For functional data, some model averaging methods have been studied. Zhu et al. (2018) proposed a model averaging estimator based on Mallows’ criterion for partial functional linear models whose response is a scalar and the predictors are a random vector and some functional variables. Zhang et al. (2018) proposed a Jackknife model averaging for fully functional linear models whose response and predictor are both functional processes. For generalized functional linear model designed for the case where the scalar response is nonlinearly dependent on functional explanatory variables, model averaging is a good alternative to model selection that may lead to instability in variable selection or coefficient estimation caused by randomness of the data collection and so on.

In this article, we consider model averaging methods for GFLM to capture the nonlinear characteristics hidden in the data and to reduce the prediction errors and risks. The contributions of this article are threefold: We first adopt FPCA to reduce the dimensions as it provides a parsimonious representation of functional data, and then present a novel model averaging procedure based on leave-one-out cross-validation criterion (CV). Second, we prove the consistency of parameter estimator under the misspecified model with some mild conditions. The dimension of the parameter can be divergent. Third, we establish the asymptotic optimality of our method in the squared loss sense for generalized linear model with a diverging number of parameters. Our work relaxes the condition that the expectations of estimators need to exist.

The rest of the article is organized as follows. In Section 2, we introduce our proposed model averaging method for GFLM. We then establish the asymptotic property of the proposed method in Section 3. Simulation studies and a real data example of second-hand house price in Beijing are presented in Section 4. Section 5 concludes. Proofs of theoretical results are provided in Appendix A and Appendix B.

2. Model Averaging for Generalized Functional Linear Model

2.1. The Generalized Functional Linear Model

The data we collected for the ith subject or experimental unit are

(\{X_{i} (t), t \in T\}, y_{i}), i = 1, \dots, n

. We assume these data are generated independently. The predictor variable

X (t) (t \in T)

is a random curve corresponding to a square integrable stochastic process on a real interval T. The response variable is a real-valued random variable that may be continuous or discrete. For example, in a binary regression, one would have

y \in {0, 1}

.

Suppose that the given link function

g (\cdot)

is a strictly monotone and twice continuously differentiable function with bounded derivatives and is thus invertible. This assumption is common in generalized linear model. See, for example, (Chen et al. 1999; Müller and Stadtmüller 2005; Ando and Li 2017). Moreover, we assume a variance function

σ^{2} (\cdot)

, which is strictly positive with upper bound defined on the range of the link function. The generalized functional linear model or functional quasi-likelihood model is determined by a parameter function

β (\cdot)

, which is square integrable on its domain T, in addition to the link function

g (\cdot)

and the variance function

σ^{2} (\cdot)

.

Given a real measure

d ω

on T, we define linear predictors

η_{i} = α + \int β (t) X_{i} (t) d ω (t), i = 1, \dots n,

and conditional means

μ_{i} = g (η_{i})

, where

E (y_{i} | X_{i} (t), t \in T) = μ_{i}

, and

Var (y_{i} | X_{i} (t), t \in T) = σ^{2} (μ_{i}) = {\tilde{σ}}^{2} (η_{i})

with the function

{\tilde{σ}}^{2} (η_{i}) = σ^{2} (g (η_{i}))

. In a generalized functional linear model, the distribution of

y_{i}

would be specified with the exponential family. Thus, we should consider a functional quasi-likelihood model

y_{i} = g (α + \int β (t) X_{i} (t) d ω (t)) + e_{i}, i = 1, \dots n,

(1)

where

E (e_{i} | X_{i} (t), t \in T) = 0

and

Var (e_{i} | X_{i} (t), t \in T) = σ^{2} (μ_{i}) = {\tilde{σ}}^{2} (η_{i})

. Note that

α

is a constant, and the inclusion of an intercept allows us to require

E (X_{i} (t)) = 0

for all t. We assume the errors

e_{i}

are independent with the same variance. It is easy to obtain

E (e_{i}) = 0

and

Var (e_{i}) = Var \{E (e_{i} | X_{i} (t), t \in T)\} + E {Var (e_{i} | X_{i} (t), t \in T)} = E \{{\tilde{σ}}^{2} (η_{i})\} = σ^{2} .

Following Müller and Stadtmüller (2005), we choose an orthonormal basis

\{ρ_{j}, i = 1, 2 \dots\}

of the function space

L^{2} (d ω)

, that is

\int_{T} ρ_{j} (t) ρ_{k} (t) d ω (t) = δ_{j k}

, where

δ_{j k} = 0

for

j \neq k

and

δ_{j k} = 1

for

j = k

. Then, we can expand the predictor process

X (t)

and the parameter function

β (t)

as

X (t) = \sum_{j = 1}^{\infty} ε_{j} ρ_{j} (t),

and

β (t) = \sum_{j = 1}^{\infty} β_{j} ρ_{j} (t),

in the

L^{2} (d ω)

sense] with random variables

ε_{j}

and coefficients

β_{j}

given by

ε_{j} = \int X (t) ρ_{j} (t) d ω (t)

and

β_{j} = \int β (t) ρ_{j} (t) d ω (t)

, respectively. By the previous assumptions that

X (t)

and

β (t)

are square integrable, we get

\sum_{j = 1}^{\infty} β_{j}^{2} < \infty

and

\sum_{j = 1}^{\infty} E ε_{j}^{2} < \infty

.

From the orthonormality of the basis function

ρ_{j}

and setting

ε_{i, j} = \int X_{i} (t) ρ_{j} (t) d ω (t),

it follows immediately that

η_{i} = α + \int β (t) X_{i} (t) d ω (t) = α + \sum_{j = 1}^{\infty} β_{j} ε_{i, j} .

It will be convenient to work with standardized errors

e_{i}^{'} = e_{i} / σ (μ_{i}) = e_{i} / \tilde{σ} (η_{i}),

in which

E (e_{i}^{'} | X_{i} (t)) = 0

,

E (e_{i}^{'}) = 0

, and

E (e_{i}^{' 2}) = 1

. Then, it will be sufficient to consider the following model,

y_{i} = g (α + \sum_{j = 1}^{\infty} β_{j} ε_{i, j}) + e_{i}^{'} \tilde{σ} (α + \sum_{j = 1}^{\infty} β_{j} ε_{i, j}), i = 1, \dots, n,

(2)

where the function

g (\cdot)

is known.

The number of parameter in model (2) is infinite. We address the difficulty caused by the infinite dimensionality of the predictors by approximating model (2) with a series of models where the number of predictors is truncated at

p = p_{n}

, and the dimension

p_{n}

can be a constant as large as possible with

p_{n} < n

. A heuristic truncation strategy is as follows. For the ith sample, a p-truncated linear predictor

η_{i, p}

is

η_{i, p} = α + \sum_{j = 1}^{p} β_{j} ε_{i, j} .

The approximating model we use is

y_{i} = g (α + \sum_{j = 1}^{p} β_{j} ε_{i, j}) + e_{i}^{'} \tilde{σ} (α + \sum_{j = 1}^{p} β_{j} ε_{i, j}), i = 1, \dots, n .

Now, we consider the estimation for generalized functional linear model. First, we use FPCA to get a set of orthogonal eigenfunctions as the basis functions in the space

L^{2} (d ω)

. Then, we consider a series of candidate models. The number of candidate models is M. For the mth candidate model, we adopt the first

p_{m}

functional principal components to build the approximating model,

y_{i} = g (α^{(m)} + \sum_{j = 1}^{p_{m}} β_{j}^{(m)} ε_{i, j}) + e_{i}^{'} \tilde{σ} (α^{(m)} + \sum_{j = 1}^{p_{m}} β_{j}^{(m)} ε_{i, j}), i = 1, \dots, n .

(3)

We assume that

p_{1} < p_{2} < \dots < p_{M}

. That is, the candidate models are nested. Denote

ε_{i, 0} = 1

and

β_{0}^{(m)} = α^{(m)}

, then we estimate the unknown parameter vector

β^{(m)} = {(β_{0}^{(m)}, β_{1}^{(m)}, \dots, β_{p_{m}}^{(m)})}^{T}

by solving the following estimating or score equation

U_{n, m} (β^{(m)}) = \frac{1}{n} \sum_{i = 1}^{n} [y_{i} - g (η_{i, p_{m}})] \frac{g^{'} (η_{i, p_{m}})}{σ^{2} (μ_{i, p_{m}})} ε_{(i, p_{m})} = 0,

(4)

where

η_{i, p_{m}} = \sum_{j = 0}^{p_{m}} β_{j}^{(m)} ε_{i, j}

and

ε_{(i, p_{m})} = {(ε_{i, 0}, \dots, ε_{i, p_{m}})}^{T}

. Let

{\hat{β}}^{(m)}

be the solution of the score equation

U_{n, m} (β^{(m)}) = 0

, i.e.,

U_{n, m} ({\hat{β}}^{(m)}) = \frac{1}{n} \sum_{i = 1}^{n} [y_{i} - g ({\hat{η}}_{i, p_{m}})] \frac{g^{'} ({\hat{η}}_{i, p_{m}})}{σ^{2} (g ({\hat{η}}_{i, p_{m}}))} ε_{(i, p_{m})} = 0 .

(5)

2.2. Model Averaging Estimation

For each candidate model, we get the estimator of the unknown parameter vector by (4). Let

w \in H_{n} = \{w \in {[0, 1]}^{M} : \sum_{m = 1}^{M} w_{m} = 1\},

then we obtain the model averaging estimator of

η_{i}

:

{\hat{η}}_{i} (w) = \sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}},

(6)

where

{\hat{η}}_{i, p_{m}} = \sum_{j = 0}^{p_{m}} {\hat{β}}_{j}^{(m)} ε_{i j}

. Thus, a model averaging estimator of the conditional mean

μ_{i}

is given by

{\hat{μ}}_{i} (w) = g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}}) .

(7)

Let

{\tilde{β}}_{- j}^{(m)}

be the estimator of

β^{(m)}

from (4) without the jth observation, that is,

U_{n, m, - j} (β^{(m)}) = \frac{1}{n - 1} \sum_{i = 1, i \neq j}^{n} [y_{i} - g (η_{i, p_{m}})] \frac{g^{'} (η_{i, p_{m}})}{σ^{2} (μ_{i, p_{m}})} ε_{(i, p_{m})} = 0 .

(8)

For the observation j, the leave-one-out truncated linear estimator of

η_{j}

under the mth model is

{\tilde{η}}_{j, p_{m}} = {\tilde{β}}_{- j}^{(m) T} ε_{(j, p_{m})},

and the leave-one-out model averaging estimator of

μ_{j}

is

{\tilde{μ}}_{j} (w) = g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{j, p_{m}}) .

Thus, we propose the following leave-one-out criterion for choosing weights in the model averaging estimator given by (7)

C V (w) = \sum_{i = 1}^{n} {(y_{i} - {\tilde{μ}}_{i} (w))}^{2} = \sum_{i = 1}^{n} {[y_{i} - g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}})]}^{2} .

(9)

Let

\hat{w} = a r g min_{w \in H_{n}} C V (w)

be the weight vector from

C V (w)

criterion. Then, plugging

\hat{w}

into (7), we obtain the final model averaging estimator

{\hat{μ}}_{i} (\hat{w}), i = 1, 2, \dots, n

.

3. Asymptotic Property for Model Averaging Estimator

In this section, we will establish the optimal property of cross-validation model averaging for generalized functional linear model. We allow the dimension of each candidate model to be divergent as n tends to ∞.

Notations and Conditions

We denote the first and second derivatives of the function

g (\cdot)

by

g^{^{'}} (\dots)

and

g^{^{″'}} (\dots)

, respectively, the diagonal matrix A with diagonal elements

a_{1}, a_{2}, \dots, a_{q}

by

A = d i a g (a_{1}, a_{2}, \dots, a_{q})

, the minimum singular value of matrix A by

λ_{min} \{A\}

, and

λ_{n, m} = λ_{m i n} \{\frac{ε_{n}^{(m) T} ε_{n}^{(m)}}{n}\},

with

ε_{n}^{(m)} = {(ε_{(1, p_{m})}, ε_{(2, p_{m})}, \dots, ε_{(n, p_{m})})}^{T} .

For any

β^{(m)} \in R^{p_{m} + 1}

,

n \in N^{+}

, define

\begin{matrix} U_{n, m}^{(m)} (β^{(m)}) = \frac{1}{n} \sum_{i = 1}^{n} [g (β^{(m) T} ε_{(i, p_{m})}) - μ_{i}] \frac{g^{^{'}} (β^{(m) T} ε_{(i, p_{m})})}{σ^{2} (g (β^{(m) T} ε_{(i, p_{m})}))} ε_{(i, p_{m})} . \end{matrix}

(10)

We assume

|g^{'} (\cdot)| ⩽ c < \infty

and

|g^{″} (\cdot)| ⩽ c_{1} < \infty

, and

σ^{2} (\cdot)

is strictly positive with bound

0 < d_{1} ⩽ σ^{2} (\cdot) ⩽ d_{2} < \infty

and

|σ^{2^{'}} (\cdot)| ⩽ d_{3} < \infty

.

Consider the squared loss function

L_{n} (w) = {∥ μ - \hat{μ} (w) ∥}^{2},

where

μ = {(μ_{1}, μ_{2}, \dots, μ_{n})}^{T}

and

\hat{μ} (w) = {({\hat{μ}}_{1} (w), {\hat{μ}}_{2} (w), \dots, {\hat{μ}}_{n} (w))}^{T}

are the two

n \times 1

vectors, and

{∥ \cdot ∥}^{2}

is Euclidean norm. Denote

R_{n} (w) = \sum_{i = 1}^{n} {[g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)})]}^{2},

and

ξ_{n} = inf_{w \in H_{n}} R_{n} (w),

where

β_{★}^{(m)}

is the pseudo true parameter, which, like Flynn et al. (2013) and Lv and Liu (2014), is defined as the solution to the following score equation,

U_{n, m}^{(m)} (β^{(m)}) = 0,

and is a theoretical target under the mth candidate model with misspecification. We assume that such a solution is existent and

{∥β_{★}^{(m)}∥}^{2} / (p_{m} + 1) \leq C_{b} < \infty

.

ξ_{n}

represents the minimal bias between the true model and the final model generated by model averaging, which is an alternative to the risk based on

L_{n} (w)

. In this work, we do not require the expectation of

L_{n} (w)

to exist, which is more relaxed than the common requirement on jackknife model averaging methods for generalized linear model. See, for example, Zhang et al. (2016) and Ando and Li (2017). In the following, we assume that

X_{i} (t), i = 1, 2, \dots, n

are non-random with

{sup}_{i} |η_{i}| ⩽ C_{η} < \infty

.

Condition 1.

For some compact set

Θ_{m}

in

R^{p_{m} + 1}

,

lim_{n \to + \infty} P \{0 \in U_{n, m} (Θ_{m})\} = 1

holds.

Condition 2.

(i)

{e_{i}}, i = 1, \dots, n

are mutually independent.

(ii)

E e_{i} = 0

.

(iii)

C_{1} = {sup}_{i} E e_{i}^{2} < \infty

.

Condition 3.

{sup}_{i} {∥ε_{(i, p_{m})}∥}^{2} / (p_{m} + 1) ⩽ C_{2} < \infty

.

Condition 4.

\sqrt{n} p^{2} / ξ_{n} \to 0

with

p = max_{m} p_{m}

and

p^{4} / n = o (1)

.

Condition 5.

\sum_{i = 1}^{n} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} = O_{p} (p_{m}^{4})

.

Condition 6.

λ_{m i n} \{\frac{\partial}{\partial β^{(m)}} U_{n, m}^{(m)} (β^{(m)})\} ⩾ C_{0} > 0 .

Condition 1 is a requirement for generalized model to guarantee the existence of solutions to (4). In general, the existence and consistency of roots obtained by solving (4) have to be checked, so we list Condition 1. The similar condition can be found in Balan and Schiopu-Kratina (2005). In the special case where the link function is

g (x) = x

, the solution of (4) is a generalized least squares estimator of

β^{(m)}

and Condition 1 is easy to satisfy.

Condition 2 is common for generalized linear model. See, for instance, Chen et al. (1999) and Ando and Li (2017). The least squares estimator for linear regression models is strongly consistent under Condition 2. This condition is less restrictive than (A1) of Ando and Li (2017) for proving the optimality of the weight selection procedure.

Condition 3 is similar to (2.3) of Theorem 1 in Chen et al. (1999) and is due to the nonlinearity. A counterexample is given to show that

{\hat{β}}^{(m)}

may not be consistent when Condition 3 (i) is dropped in Chen et al. (1999).

Condition 4 means that the speed of

ξ_{n}

tending to ∞ should be faster than that of

\sqrt{n} p^{2}

. This condition also implies that the true model is not in the candidate model set, which is a condition commonly used for optimal model averaging. It is easy to satisfy when the true model is an infinite dimensional model. This condition is an alternative to Condition C.3 of Zhang et al. (2016) and (A3) of Ando and Li (2017).

Condition 5 implies

n^{- 1} \sum_{i = 1}^{n} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} = o_{p} (1)

with

p_{m}^{4} / n = o (1)

. By Lemma A3 in the Appendix A and Condition 3, we have

\sum_{i = 1}^{n} {({\hat{η}}_{i, p_{m}} - ε_{(i, p_{m})}^{T} β_{★}^{(m)})}^{2} ⩽ \sum_{i = 1}^{n} {∥ε_{(i, p_{m})}∥}^{2} {∥{\hat{β}}^{(m)} - β_{★}^{(m)}∥}^{2} = O_{p} (p_{m}^{4}) .

Then, with the following standard condition for the application of cross-validation,

|\frac{\sum_{i = 1}^{n} {({\tilde{η}}_{i, p_{m}} - ε_{(i, p_{m})}^{T} β_{★}^{(m)})}^{2}}{\sum_{i = 1}^{n} {({\hat{η}}_{i, p_{m}} - ε_{(i, p_{m})}^{T} β_{★}^{(m)})}^{2}} - 1| = o_{p} (1),

which says that as n gets large, the difference between the ordinary and leave-one-out estimators of

η_{i}

under the mth candidate model gets small, it can be seen that

\sum_{i = 1}^{n} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} ⩽ 2 \sum_{i = 1}^{n} {({\tilde{η}}_{i, p_{m}} - ε_{(i, p_{m})}^{T} β_{★}^{(m)})}^{2} + 2 \sum_{i = 1}^{n} {({\hat{η}}_{i, p_{m}} - ε_{(i, p_{m})}^{T} β_{★}^{(m)})}^{2} = O_{p} (p_{m}^{4}),

which means Condition 5 is reasonable. For the one-parameter natural exponential family models, Ando and Li (2017) showed under some regularity conditions that

\sum_{i = 1}^{n} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} = O_{p} (p_{m}^{2} / n)

satisfying our Condition 5. For the linear models where

g (x) = x

and

σ^{2} (\cdot) = 1

,

\sum_{i = 1}^{n} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} = O_{P} (p_{m}^{2} / n)

under the assumption that

ε_{(i, p_{m})}^{T} {(ε_{n}^{(m) T} ε_{n}^{(m)})}^{- 1} ε_{(i, p_{m})} \leq c p_{m} / n

for some constant

c < \infty

, which is commonly used to ensure the asymptotic optimality of cross-validation. See, for example, Condition (5.2) of Li (1987), Condition (5.2) of Andrews (1991), Condition (A.9) of Hansen and Racine (2012), Condition (C.2) of Zhang (2015), and Condition (C.3) of Zhao et al. (2018). In general, our Condition 5 is more relaxed than those in literature for the complex candidate models.

Condition 6 is to ensure that the pseduo true parameter

β_{★}^{(m)}

is unique. The consistency of the estimator of

β_{★}^{(m)}

can also be derived by this condition. See Lemma A3 in the Appendix A. In addition, the one-parameter natural exponential family considered in Theorem 1 of Ando and Li (2017) is an example with

λ_{m i n} \{\frac{\partial}{\partial β^{(m)}} U_{n, m}^{(m)} (β^{(m)})\} = λ_{m i n} \{\frac{1}{n} ε_{n}^{(m) T} Γ (β^{(m)}) ε_{n}^{(m)}\},

where

Γ (β^{(m)}) = d i a g (g^{'} (ε_{(1, p_{m})}^{T} β^{(m)}), g^{'} (ε_{(2, p_{m})}^{T} β^{(m)}), \dots, g^{'} (ε_{(n, p_{m})}^{T} β^{(m)})) .

By the commonly used assumption that

λ_{n, m} ⩾ c_{0} > 0

for some constant

c_{0} < \infty

, and the assumption (4.3) in Ando and Li (2017), this example satisfies Condition 6.

Theorem 1.

Assume that Conditions 1–6 hold, then

\hat{w}

is asymptotically optimal in the sense that

\frac{L_{n} (\hat{w})}{{inf}_{w \in H_{n}} L_{n} (w)} \overset{p}{\to} 1,

(11)

where

\overset{p}{\to}

means convergence in probability.

Proof.

See the Appendix B. □

Remark 1.

When the dimensions of the candidate models are fixed, condition 4 can be relaxed to

n / ξ_{n}^{2} \to 0 .

Remark 2.

It is easy to see that if we do not require that the weights sum to one, then we can use M instead of 1 as the upper bound of

\sum_{m = 1}^{M} w_{m}^{2}

in our proof. Thus, all the proofs are still valid for the fixed M. This implies that Theorem 1 remains true if we remove the constraint that the weights sum to one. In addition, as the candidate models are not necessarily nested in the proof, this theorem still holds when the candidate models are non-nested.

4. Numerical Examples

4.1. Simulation I: Fixed Number of Candidate Models

In this section, we conduct simulation experiments to compare the finite sample performance of our model averaging methods and some commonly used model selection and model averaging methods. For model selection, we consider three methods: AIC, BIC, and FPCA. FPCA is an efficient and common method in functional data analysis, which determines the final model by the cumulative contributions of the functional principal components. For model averaging, we consider the following methods, S-AIC (smoothed AIC), S-BIC (smoothed BIC), and our cross-validation model averaging, which is denoted as CV1 if we restrict the sum of weights to be 1 as before, and CV2 if no constraint on the sum of weights is imposed.

The data generating process is as follows: the predictor variable is

X_{i} (t) = \sum_{j = 1}^{J} ε_{i, j} ρ_{j} (t),

and the parameter function is

β (t) = \sum_{j = 1}^{J} β_{j} ρ_{j} (t),

where

ρ_{j} (t)

is a basis function with

t \in [0, 1]

, and

j \geq 1

and J is the number of the basis functions. Here, we use B-spline base and Fourier base. For B-spline base, we choose the order of the basis functions to be 2, and the number of the basis functions to be 20. As for Fourier base, we choose the number of the basis functions as 21 and the first basis to be a constant function.

In our simulation, the following four cases are considered.

Case 1: For $1 \leq j \leq 10$ , $β_{j}$ are generated from the standard normal distribution $N (0, 1)$ ; for $10 < j \leq 20$ , $β_{j} = 0$ . The basis functions $\{ρ_{j} (t), t \in [0, 1], 1 \leq j \leq 20\}$ are B-spline functions with parameters as mentioned above.
Case 2: For $1 \leq j \leq 20$ , $β_{j} = j^{- 2}$ . The basis functions $\{ρ_{j} (t), t \in [0, 1], 1 \leq j \leq 20\}$ are B-spline functions with parameters as mentioned above.
Case 3: For $1 \leq j \leq 11$ , $β_{j}$ are generated from the standard normal distribution $N (0, 1)$ ; for $11 < j \leq 21$ , $β_{j} = 0$ . The basis functions $\{ρ_{j} (t), t \in [0, 1], 1 \leq j \leq 21\}$ are Fourier functions with parameters as mentioned above.
Case 4: For $1 \leq j \leq 21$ , $β_{j} = j^{- 2}$ . The basis functions $\{ρ_{j} (t), t \in [0, 1], 1 \leq j \leq 20\}$ are Fourier functions with parameters as mentioned above.

We set the term

ε_{i, j}

to be independently generated from

N (0, R^{2} / j^{2})

, where

R = 1, 2, \dots, 10

. The response variable

y_{i}

is generated from binomial distribution

B i n o m i a l (p (X_{i} (t)), 1)

with the probability

p (X_{i} (t))

being

g (\int_{0}^{1} X_{i} (t) β (t) d t)

. We consider three types of link function

g (\cdot)

: logistic link function

e x p (\cdot) / (1 + e x p (\cdot))

, Probit link function, and Poisson link function. For the Poisson model, we only consider the simulations with

R = 1

for Cases 1–4.

In the simulation, we use FPCA to obtain the nested candidate models. Each candidate model contains the first

p_{m}

principal components. The number of candidate models is 18 for Cases 1–2 and 19 for Cases 3–4. Then we adopt the weighted iterated least squares algorithm which is a common approach in generalized linear model to get the estimates for each model. For the weights, we use the ’fmincon’ function in Matlab to get the solution of CV criterion.

The sample size is set as

n = 60, 200, 500

. We use the 80% data as the training data

\{Y_{1}, X_{1}\}

with size

n_{1}

, and the remaining data as the test data

\{Y_{2}, X_{2}\}

with size

n_{2}

. Then, we compare the prediction errors. We calculate the prediction accuracy (

{∥{\hat{Y}}_{2} - Y_{2}∥}^{2} / n_{2}

), fitting accuracy (

{∥{\hat{Y}}_{1} - Y_{1}∥}^{2} / n_{1}

), predictor coefficient prediction accuracy (

{∥{\hat{η}}_{(2)} - η_{(2)}∥}^{2} / n_{2}

), and predictor coefficient fitting accuracy (

{∥{\hat{η}}_{(1)} - η_{(1)}∥}^{2} / n_{1}

). We repeat this process 1000 times, and then obtain mean, median, and variance of these prediction errors for each method. To save space, we present only the results on the prediction accuracy. The results on the other type accuracies are available from the authors upon request. We only report the results for logistic link function due to space limitations. Other link function results are also available from the authors.

For Case 1, the prediction errors are summarized in Table A1, Table A2 and Table A3. From Table A1, it is seen that with R varying from 1 to 10, the prediction errors are decreasing, because the difference of probability between the two groups (one group whose response is 1 and the other group whose response is 0) becomes larger. Our methods (CV1 and CV2 in the tables) always obtain the minimum error means (Mean in the tables), medians (Median in the tables), and variances (Var in the tables). However, there is no clear tendency between CV1 and CV2, which perform similarly in most of situations. When R is small, BIC is always better than AIC, and S-BIC is always better than S-AIC. This may be due to less parameters being useful for smaller R values, and in this case, a bigger penalty on the number of parameters in the model is preferred. Moreover, when the candidate models differ significantly, AIC or BIC performs similarly to S-AIC or S-BIC, respectively. As R becomes larger, the difference between AIC and BIC or S-AIC and S-BIC becomes smaller. FPCA is always superior to AIC, BIC, S-AIC, and S-BIC, and their differences become larger as R increases. Now, we turn to Table A2 and Table A3. With the sample size n increasing from 60 to 200 and 500, we can see that the prediction errors decrease for each fixed R. The median and variance of prediction errors also become smaller. AIC and BIC behave increasingly similarly. CV1 and CV2 are still the best among all the methods, and followed by FPCA.

For Case 2, the prediction errors are given in Table A4, Table A5 and Table A6. As shown earlier, CV1 and CV2 perform the best, and followed by FPCA. Likewise, S-AIC or S-BIC is better than AIC or BIC, respectively. For Table A4, with R varying from 1 to 10, the prediction errors are decreasing except FPCA method, which gets the minimum at

R = 7

with a small fluctuation. CV1 and CV2 perform equally well for different R values and sample sizes. The difference between AIC and BIC becomes small with the sample size increasing. The similar phenomenon is observed for S-AIC and S-BIC.

For Case 3, the prediction errors are provided in Table A7, Table A8 and Table A9. For

n = 60

(Table A7), CV1 or CV2 is the best when R is between 1 and 5. However, when R is between 6 and 10, the two model selection methods—AIC and BIC—are the best. The similar conclusions can be found in Table A8 with

n = 200

and Table A9 with

n = 500

, although in the latter case, CV1 actually performs the best for all of R values. The error rates of all methods become smaller with R increasing from 1 to 6 and then bigger with R varying from 7 to 10.

For Case 4, the prediction errors are presented in Table A10, Table A11 and Table A12. For

n = 60

in Table A10, CV1, CV2, and BIC are the best, and followed by AIC. In this design, S-AIC or S-BIC is not better than AIC or BIC. For

n = 200

in Table A11, BIC is the best, and followed by AIC. For

n = 500

in Table A12, CV1 always performs the best, and followed by BIC.

In summary, for out-of-sample prediction, our methods CV1 and CV2 perform the best in most of cases and have smaller variances and medians of errors. Furthermore, CV1 and CV2 often perform equally well. This indicates that removing the restriction on the sum of weights may not lead to a better model averaging estimates.

4.2. Simulation II: Divergent Number of Candidate Models

We consider the situations where the number of candidate models tends to ∞ as the sample size increases. We set the sample size n to be 200, 400, and 1000, and the the number of candidate models to be

9 n / 100

(So M=18,36, and 90 for the three sample sizes). The data generating process is as before: the predictor variable is

X_{i} (t) = \sum_{j = 1}^{J} ε_{i, j} ρ_{j} (t),

and the parameter function is

β (t) = \sum_{j = 1}^{J} β_{j} ρ_{j} (t),

where

ρ_{j} (t)

is a 2-order B-spline basis function,

t \in [0, 1], j \geq 1

, and

J = n / 10

. For

1 \leq j \leq J

,

β_{j} = j^{- 1 / 2}

. We set the term

ε_{i, j}

to be independently generated from

N (0, R^{2} / j^{1 / 2})

, where

R = 1, 3, 7

. The response variable

y_{i}

is generated from binomial distribution with the logistic link.

The candidate models are nested. The algorithms used in the calculations are the same as that described in Section 4.1. For the simulation results, we report the errors of seven methods considered as Section 4.1. From Table A13, Table A14 and Table A15, our methods—CV1 and CV2—perform the best in most of cases, and followed by FPCA, and SAIC. The difference between AIC and BIC, or SAIC and SBIC is decreasing with increasing R.

4.3. Application: Beijing Second-Hand House Price Data

We apply our method to the Beijing second-hand housing transaction price data, which is captured from the internet collected by the Guoxinda Group Corporation. Most of the data pass through the manual check. This data include the second-hand housing prices and the surrounding environment variables of the 2318 residential areas in Beijing. The second-hand housing prices data are monthly data from January 2015 to December 2017 for each residential area.

Our aim is to predict the increase level in house prices in next year. We are concerned about the relationship between price level to rise and the past housing price curves. We use the median of listing online prices of houses in a residential area as the house price for this residential area. We use the price curve of each residential area from January 2015 to December 2016 as a predictor variable. The response variable is a binary variable, which takes 1 if the rising ratio is high, and 0 otherwise. Here, we define the rising ratio for each district as the ratio of the average monthly price in 2017 to the average monthly price in 2016. The 25%, 50%, and 75% quantile ratios are

1.31, 1.37,

and

1.44

, respectively. We focus on the residential areas whose housing prices are rising rapidly, and so if the ratio is higher than 75% quantile ratios of all residential areas, the response variable of this residential area takes 1 as its value, and 0 otherwise. Of the

n = 2318

residential areas, 568 are rising fast, and 1750 are not.

For simplicity, we standardize all the price data. For each group, we plot the housing price trajectories in Figure 1. Failure to visually detect differences between the groups could result from overcrowding of these plots with too many curves, but when displaying fewer curves (lower panels of Figure 1), the same phenomenon remains. With a few exceptions, no clear visual differences between the two groups can be discerned. On the whole, the trajectories of per year from 2015 to 2016 are not much different. Therefore, the discrimination task at hand is difficult.

We randomly select 75% of all residential areas as the training set with size 1739, and the rest as the testing data with size 579. We use logistic link and B-spline functions to fit the house price curves. The number of the basis functions is 6, and the order of the B-spline basis functions is 2. Then, we adopt functional principal component analysis (Yao et al. 2005) to built the data-adaptive basis functions to reduce the dimension and deal with the correlations in house price time series.

We compare the out-of-sample prediction errors of the seven methods in Section 4. We repeat every method 20 times. The results are summarized in Table 1 and Table 2. It can be observed from the tables that the error of CV1 or CV2 method is lower 10% on average than those of other methods, and overall, CV1 and CV2 behave similarly. As shown in the simulation above, this indicates the constraint that the sum of weights equals 1 makes sense in practical cases. AIC and BIC perform equally well, as both choose the largest model in most cases. We also find that FPCA is better than AIC or BIC. FPCA always selects the smallest model because the cumulative reliability of the first principal component is ~98%. Further, it is clear that the fitting error and prediction error of FPCA are similar. For the other methods, the fitting errors are always a little smaller than the prediction errors.

5. Concluding Remarks

In this paper, we proposed a model averaging approach under the framework of the generalized functional linear model. We showed that the weight chosen by the leave-one-out cross-validation method is asymptotically optimal in the sense of achieving the lowest possible squared error in a class of model averaging estimators. It can be seen from the theoretical proof that our method is also valid for the non-nested candidate model set. Numerical analysis shows that for generalized functional linear model, cross-validation model averaging is a powerful tool for estimation and prediction. A further work is to develop model averaging inference procedures based on generalized functional linear model. In addition, how to combine other covariates into generalized functional linear model is also an interesting problem.

Author Contributions

H.Z. wrote the original draft. G.Z. reviewed and revised the whole paper. All authors have read and agreed to the published version of the manuscript.

Funding

Zou’s work was partially supported by the Ministry of Science and Technology of China (Grant No. 2016YFB0502301) and the National Natural Science Foundation of China (Grant Nos. 11971323 and 11529101).

Acknowledgments

The authors thank the two referees for their constructive comments and suggestions that have substantially improved the original manuscript. The Beijing second-hand house price data is collected by the Guoxinda Group Corporation. This project was partially supported by the National Natural Science Foundation of China (Grant No.71571180).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Lemmas and Proofs

The following Definition A1 and Lemma A1 can be found in Kahane (1968); Hoffmann (1974); Hoffmann and Pisier (1976); Zinn (1977); and Wu (1981).

Definition A1.

A linear map

ν : D \to F

is of type 2 if

\sum_{i = 1}^{n} ε_{i} ν (s_{i})

converges in

F

a.s. for all sequences

\{s_{i}\} ⫅ D

such that

\sum_{i = 1}^{\infty} {∥s_{i}∥}_{D}^{2} < \infty

, where

D

and

F

are Banach space,

{\{ε_{i}\}}_{i = 1}^{\infty}

are independent random variables such that

P (ε_{i} = 1) = P (ε_{i} = - 1) = 1 / 2

, and a.s. means converges almost surely. A Banach space

G

is said to be type 2 if the identity map on

G

is type 2.

Let

(S, d)

be a compact metric space and

C (S)

be the Banach space of real-valued continuous functions on S with the supremum norm

{∥ν∥}_{\infty} = s u p_{s \in S} |ν (s)|,

for any

ν \in C (S)

. Denote a d-continuous metric

ρ

on S. Let

N (S, ρ, ε)

denote the minimal number of

ρ

-balls of radius less than or equal to

ε

which cover S, and set

H (S, ρ, ε) = log N (S, ρ, ε) .

We let

L i p (ρ) = \{ν \in C (S) : Λ (ν) = sup_{s_{1} \neq s_{2} \in S} \frac{|ν (s_{1}) - ν (s_{2})|}{ρ (s_{1}, s_{2})} < \infty)\},

and for

ν \in L i p (ρ)

, we define

{∥ν∥}_{ρ} = Λ (ν) + |ν (s^{*})|,

where

s^{*}

is some fixed point in S. In addition, assume that

\{ν_{j} : j ⩾ 1\} \subseteq L i p (ρ)

and

\{e_{j} : j ⩾ 1\}

are independent real-valued random variables. Then,

\{ν_{j} e_{j}\}

are independent

L i p (ρ)

-valued random variables.

Lemma A1.

Let

(S, d)

denote a compact metric space. Suppose that ρ is a d-continuous metric on S with

\int_{0}^{δ} H^{1 / 2} (S, ρ, u) d u < \infty for some δ > 0 .

(A1)

Then we have

A < \infty

such that for all n,

E {∥X_{1} + X_{2} + \dots + X_{n}∥}_{\infty}^{2} ⩽ A \sum_{j = 1}^{n} E {∥X_{j}∥}_{ρ}^{2},

(A2)

where

X_{1}, X_{2}, \dots, X_{n}

are independent

L i p (ρ)

-valued random variables with mean zeros.

Lemma A2.

For any

β^{(m)} \in Θ_{m}

, define

v_{i} (β^{(m)}) = \frac{g^{'} (ε_{(i, p_{m})}^{T} β^{(m)})}{σ^{2} (g (ε_{(i, p_{m})}^{T} β^{(m)}))} .

Under Condition 3, we have

\begin{matrix} sup_{β^{(m)} \in Θ_{m}} ∥\sum_{i \in [1, n]} \frac{v_{i} (β^{(m)})}{p_{m} + 1} ε_{(i, p_{m})} e_{i}∥ = O_{p} (\sqrt{n p_{m}}) . \end{matrix}

(A3)

Proof of Lemma A2.

First note that for any

l \in [0, p_{m}]

, we have

\begin{matrix} Λ (\frac{v_{i}}{p_{m} + 1} ε_{i, l}) \\ = sup_{β_{1}^{(m)} \neq β_{2}^{(m)} \in Θ_{m}} \frac{|v_{i} (β_{1}^{(m)}) - v_{i} (β_{2}^{(m)})| |ε_{i, l}|}{(p_{m} + 1) \times ρ (β_{1}^{(m)}, β_{2}^{(m)})} \\ = sup_{β_{1}^{(m)} \neq β_{2}^{(m)} \in Θ_{m}} \frac{|v_{i} (β_{1}^{(m)}) - v_{i} (β_{2}^{(m)})|}{|ε_{(i, p_{m})}^{T} β_{1}^{(m)} - ε_{(i, p_{m})}^{T} β_{2}^{(m)}|} \frac{|ε_{(i, p_{m})}^{T} β_{1}^{(m)} - ε_{(i, p_{m})}^{T} β_{2}^{(m)}|}{(p_{m} + 1) \times ρ (β_{1}^{(m)}, β_{2}^{(m)})} |ε_{i, l}| \\ = sup_{β_{1}^{(m)} \neq β_{2}^{(m)} \in Θ_{m}} \{|\frac{g^{''} (γ_{i}) * σ^{2} (γ_{i}) - g^{'} (γ_{i}) σ^{2^{'}} (γ_{i})}{σ^{4} (γ_{i})}| \times \frac{|ε_{(i, p_{m})}^{T} β_{1}^{(m)} - ε_{(i, p_{m})}^{T} β_{2}^{(m)}|}{(p_{m} + 1) \times ρ (β_{1}^{(m)}, β_{2}^{(m)})} |ε_{i, l}|\}, \end{matrix}

where the last step is by the mean-value theorem and

γ_{i}

is a point betweeen

ε_{(i, p_{m})}^{T} β_{1}^{(m)}

and

ε_{(i, p_{m})}^{T} β_{2}^{(m)}

. From the assumptions that

g (\cdot)

is a twice continuously differentiable function with bounded derivatives

|g^{'} (\cdot)| ⩽ c < \infty

and

|g^{''} (\cdot)| ⩽ c_{1} < \infty

, and

σ^{2} (\cdot)

is strictly positive with bound

0 < d_{1} ⩽ σ^{2} (\cdot) ⩽ d_{2} < \infty

and

|σ^{2^{'}} (\cdot)| ⩽ d_{3} < \infty

, we see that there is a constant

c^{'} > 0

such that

| v_{i} (\cdot) | ⩽ c^{'} < \infty

, and

\begin{matrix} Λ (\frac{v_{i}}{p_{m} + 1} ε_{i, l}) \\ ⩽ sup_{β_{1}^{(m)} \neq β_{2}^{(m)} \in Θ_{m}} \{c^{'} \times \frac{|ε_{(i, p_{m})}^{T} β_{1}^{(m)} - ε_{(i, p_{m})}^{T} β_{2}^{(m)}|}{(p_{m} + 1) \times ρ (β_{1}^{(m)}, β_{2}^{(m)})} |ε_{i, l}|\} \\ ⩽ sup_{β_{1}^{(m)} \neq β_{2}^{(m)} \in Θ_{m}} \{c^{'} \times \frac{∥ε_{(i, p_{m})}∥}{p_{m} + 1} |ε_{i, l}|\} \\ = c^{'} \frac{∥ε_{(i, p_{m})}∥}{p_{m} + 1} |ε_{i, l}|, \end{matrix}

where the second inequality is by Cauchy–Schwarz inequality. Therefore, we obtain

\begin{matrix} {∥\frac{v_{i}}{p_{m} + 1} ε_{i, l}∥}_{ρ} & = Λ (\frac{v_{i}}{p_{m} + 1} ε_{i, l}) + |\frac{v_{i} (β^{(m) *})}{p_{m} + 1} ε_{i, l}| \\ ⩽ c^{'} \frac{∥ε_{(i, p_{m})}∥}{p_{m} + 1} |ε_{i, l}| + c^{'} \frac{1}{p_{m} + 1} |ε_{i, l}| \\ = c^{'} \frac{∥ε_{(i, p_{m})}∥ + 1}{p_{m} + 1} |ε_{i, l}| \\ < \infty . \end{matrix}

As

Θ_{m}

is a compact subset of

R^{p_{m} + 1}

, and

ρ (β_{1}^{(m)}, β_{2}^{(m)})

is the Euclidean metric in

R^{p_{m} + 1}

, (A1) is satisfied. Thus, by Lemma A1, there is a constant

A > 0

uniformly for all l such that for any

C > 0

, we have

\begin{matrix} P \{sup_{β^{(m)} \in Θ_{m}} {|\sum_{i \in [1, n]} \frac{v_{i} (β^{(m)})}{p_{m} + 1} ε_{i, l} e_{i}|}^{2} > C n\} \\ = P \{{∥\sum_{i \in [1, n]} \frac{v_{i}}{p_{m} + 1} ε_{i, l} e_{i}∥}_{\infty}^{2} > C n\} \\ ⩽ \frac{1}{C n} E {∥\sum_{i \in [1, n]} \frac{v_{i}}{p_{m} + 1} ε_{i, l} e_{i}∥}_{\infty}^{2} \\ ⩽ \frac{1}{C n} A \{\sum_{i = 1}^{n} {∥\frac{v_{i}}{p_{m} + 1} ε_{i, l}∥}_{ρ}^{2}\} sup_{i} E e_{i}^{2} . \end{matrix}

Notice

\begin{matrix} sup_{β^{(m)} \in Θ_{m}} {∥\sum_{i \in [1, n]} \frac{v_{i} (β^{(m)})}{p_{m} + 1} ε_{(i, p_{m})} e_{i}∥}^{2} ⩽ \sum_{l = 0}^{p_{m}} \{sup_{β^{(m)} \in Θ_{m}} {|\sum_{i \in [1, n]} \frac{v_{i} (β^{(m)})}{p_{m} + 1} ε_{i, l} e_{i}|}^{2}\} . \end{matrix}

Therefore, for any

ε > 0

, letting

C = A {c^{'}}^{2} {(\sqrt{C_{2}} + 1)}^{2} C_{2} C_{1} / ε

, we obtain

\begin{matrix} P \{sup_{β^{(m)} \in Θ_{m}} {∥\sum_{i \in [1, n]} \frac{v_{i} (β^{(m)})}{p_{m} + 1} ε_{(i, p_{m})} e_{i}∥}^{2} > C n (p_{m} + 1)\} \\ ⩽ P \{\sum_{l = 0}^{p_{m}} [sup_{β^{(m)} \in Θ_{m}} {|\sum_{i \in [1, n]} \frac{v_{i} (β^{(m)})}{p_{m} + 1} ε_{i, l} e_{i}|}^{2}] > C n (p_{m} + 1)\} \\ ⩽ \sum_{l = 0}^{p_{m}} P \{sup_{β^{(m)} \in Θ_{m}} {|\sum_{i \in [1, n]} \frac{v_{i} (β^{(m)})}{p_{m} + 1} ε_{i, l} e_{i}|}^{2} > C n\} \\ ⩽ \sum_{l = 0}^{p_{m}} \frac{1}{C n} A \{\sum_{i = 1}^{n} {∥\frac{v_{i}}{p_{m} + 1} ε_{i, l}∥}_{ρ}^{2}\} sup_{i} E e_{i}^{2} \\ ⩽ \frac{A {c^{'}}^{2}}{C n {(p_{m} + 1)}^{2}} \sum_{i = 1}^{n} \{{[∥ε_{(i, p_{m})}∥ + 1]}^{2} {∥ε_{(i, p_{m})}∥}^{2}\} sup_{i} E e_{i}^{2} \\ ⩽ \frac{A {c^{'}}^{2} {(\sqrt{C_{2}} + 1)}^{2} C_{2} C_{1}}{C} = ε, \end{matrix}

(A4)

which implies (A3). □

Lemma A3.

Under Conditions 1–3 and 6, we have

{∥{\hat{β}}^{(m)} - β_{★}^{(m)}∥}^{2} = O_{p} (\frac{p_{m}^{3}}{n}),

(A5)

where

{\hat{β}}^{(m)}

belonging to

Θ_{m}

is the root of (4).

Proof of Lemma A3.

By the definition of

β_{★}^{(m)}

and Condition 6, then we have

\begin{matrix} {∥U_{n, m}^{(m)} (β^{(m)})∥}^{2} & = {∥U_{n, m}^{(m)} (β^{(m)}) - U_{n, m}^{(m)} (β_{★}^{(m)})∥}^{2} \\ = {∥{\frac{\partial U_{n, m}^{(m)}}{\partial β^{(m)}}|}_{β^{(m)} = {\bar{β}}^{(m)}} (β^{(m)} - β_{★}^{(m)})∥}^{2} \\ ⩾ C_{0}^{2} {∥β^{(m)} - β_{★}^{(m)}∥}^{2}, \end{matrix}

(A6)

where

{\bar{β}}^{(m)}

is a point between

β^{(m)}

and

β_{★}^{(m)}

. Recalling that

\begin{matrix} U_{n, m} ({\hat{β}}^{(m)}) = & \frac{1}{n} \sum_{i = 1}^{n} [y_{i} - g ({\hat{η}}_{i, p_{m}})] \frac{g^{'} ({\hat{η}}_{i, p_{m}})}{σ^{2} (g ({\hat{η}}_{i, p_{m}}))} ε_{(i, p_{m})} \\ = & \frac{1}{n} \sum_{i = 1}^{n} [μ_{i} + e_{i} - g ({\hat{η}}_{i, p_{m}})] \frac{g^{'} ({\hat{η}}_{i, p_{m}})}{σ^{2} (g ({\hat{η}}_{i, p_{m}}))} ε_{(i, p_{m})} \\ = & - U_{n, m}^{(m)} ({\hat{β}}^{(m)}) + \frac{1}{n} \sum_{i = 1}^{n} e_{i} \frac{g^{'} ({\hat{η}}_{i, p_{m}})}{σ^{2} (g ({\hat{η}}_{i, p_{m}}))} ε_{(i, p_{m})} \\ = & 0, \end{matrix}

we obtain

\begin{matrix} U_{n, m}^{(m)} ({\hat{β}}^{(m)}) = & \frac{1}{n} \sum_{i = 1}^{n} e_{i} \frac{g^{'} ({\hat{η}}_{i, p_{m}})}{σ^{2} (g ({\hat{η}}_{i, p_{m}}))} ε_{(i, p_{m})} \\ = & \frac{1}{n} ε_{n}^{(m) T} V_{n} ({\hat{β}}^{(m)}) e, \end{matrix}

where

V_{n} ({\hat{β}}^{(m)}) = d i a g {(\frac{g^{'} ({\hat{η}}_{i, p_{m}})}{σ^{2} (g ({\hat{η}}_{i, p_{m}}))})}_{1 ⩽ i ⩽ n}

. From (A6), we get

\begin{matrix} \{∥U_{n, m}^{(m)} (β^{(m)})∥ ⩽ C_{0} δ\} \subseteq \{∥β^{(m)} - β_{★}^{(m)}∥ ⩽ δ\}, \forall δ > 0 . \end{matrix}

(A7)

By Condition 1, for any

κ > 0

, there is an

N_{1}

such that for all

n > N_{1}

, we have

P \{0 \in U_{n, m} (Θ_{m})\} > 1 - κ .

From (A7), it can be seen that

\begin{matrix} \{0 \in U_{n, m} (Θ_{m})\} \\ = \{T h e r e i s a {\hat{β}}_{*}^{(m)} \in Θ_{m} s u c h t h a t \frac{1}{n} ε_{n}^{(m) T} V_{n} ({\hat{β}}_{*}^{(m)}) e = U_{n, m}^{(m)} ({\hat{β}}_{*}^{(m)})\} \\ \subset \{sup_{β^{(m)} \in Θ_{m}} ∥ε_{n}^{(m) T} V_{n} (β^{(m)}) e∥ > C_{0} n δ\} \cup \{∥{\hat{β}}^{(m)} - β_{★}^{(m)}∥ ⩽ δ\} . \end{matrix}

Then for any

C > 0

and

n > N_{1}

, letting

δ = C \sqrt{\frac{{(p_{m} + 1)}^{3}}{n}}

, we have

\begin{matrix} P \{∥{\hat{β}}^{(m)} - β_{★}^{(m)}∥ ⩽ \frac{C {(p_{m} + 1)}^{3 / 2}}{\sqrt{n}}\} \\ ⩾ 1 - κ - P \{sup_{β^{(m)} \in Θ_{m}} ∥ε_{n}^{(m) T} V_{n} (β^{(m)}) e∥ > C_{0} C \sqrt{n} {(p_{m} + 1)}^{3 / 2}\} \\ ⩾ 1 - κ - \frac{A c^{2} {(\sqrt{C_{2}} + 1)}^{2} C_{2} C_{1}}{C_{0}^{2} C^{2}} \\ ⩾ 1 - κ - \frac{C^{'}}{C_{0}^{2} C^{2}}, \end{matrix}

where

C^{'} = A {c^{'}}^{2} {(\sqrt{C_{2}} + 1)}^{2} C_{2} C_{1}

and the second inequality is derived by (A4). As a result, for any

κ > 0

, we can select

C = \frac{\sqrt{C^{'}}}{C_{0} \sqrt{κ}}

such that

P \{∥{\hat{β}}^{(m)} - β_{★}^{(m)}∥ > \frac{C {(p_{m} + 1)}^{3 / 2}}{\sqrt{n}}\} < 2 κ,

for sufficiently large n, thus

\begin{matrix} {∥{\hat{β}}^{(m)} - β_{★}^{(m)}∥}^{2} = O_{p} (\frac{p_{m}^{3}}{n}) . \end{matrix}

□

Lemma A4.

Under Conditions 1–4 and 6,

\begin{matrix} sup_{w \in H_{n}} |\frac{L_{n} (w)}{R_{n} (w)} - 1| = o_{p} (1) . \end{matrix}

(A8)

Proof of Lemma A4.

Write

Δ_{i} (w) = g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)})

. From the definition of

L_{n} (w)

, we have

\begin{matrix} L_{n} (w) & = \sum_{i = 1}^{n} {[g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]}^{2} \\ = \sum_{i = 1}^{n} {[g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) + g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]}^{2} \\ = \sum_{i = 1}^{n} \{{[g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)})]}^{2} + {[g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]}^{2} \\ + 2 [g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)})] [g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]\} \\ = \sum_{i = 1}^{n} Δ_{i}^{2} (w) + L_{n}^{(2)} (w) + \sum_{i = 1}^{n} 2 Δ_{i} (w) [g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})] \\ ≜ R_{n} (w) + L_{n}^{(2)} (w) + L_{n}^{(3)} (w) . \end{matrix}

We note that

\begin{matrix} \frac{|L_{n} (w) - R_{n} (w)|}{R_{n} (w)} = \frac{|L_{n}^{(2)} (w) + L_{n}^{(3)} (w)|}{R_{n} (w)} \leq sup_{w \in H_{n}} (\frac{| L_{n}^{(2)} (w) |}{R_{n} (w)} + 2 \sqrt{\frac{| L_{n}^{(2)} (w) |}{R_{n} (w)}}) . \end{matrix}

Then, (A8) is valid if

\begin{matrix} sup_{w \in H_{n}} \frac{|L_{n}^{(2)} (w)|}{R_{n} (w)} \overset{p}{\to} 0 . \end{matrix}

(A9)

Let

η_{i *}^{m}

be the point between

ε_{(i, p_{m})}^{T} β_{★}^{(m)}

and

{\hat{η}}_{i, p_{m}}

, for fixed M,

\begin{matrix} L_{n}^{(2)} (w) & = \sum_{i = 1}^{n} {[g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]}^{2} \\ = \sum_{i = 1}^{n} {[g^{'} (\sum_{m = 1}^{M} w_{m} η_{i *}^{m}) (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)} - \sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]}^{2} \\ = \sum_{i = 1}^{n} \{{[g^{'} (\sum_{m = 1}^{M} w_{m} η_{i *}^{m})]}^{2} {[\sum_{m = 1}^{M} w_{m} (ε_{(i, p_{m})}^{T} β_{★}^{(m)} - {\hat{η}}_{i, p_{m}})]}^{2}\} \\ ⩽ \sum_{i = 1}^{n} [g^{'} {(w; η_{i *})}^{2} \sum_{m = 1}^{M} {(ε_{(i, p_{m})}^{T} β_{★}^{(m)} - {\hat{η}}_{i, p_{m}})}^{2}] \\ ⩽ c^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {[ε_{(i, p_{m})}^{T} (β_{★}^{(m)} - {\hat{β}}^{(m)})]}^{2} . \end{matrix}

Then, by Lemma A3 and Condition 3, we have

\begin{matrix} sup_{w \in H_{n}} L_{n}^{(2)} (w) ⩽ & c^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {[ε_{(i, p_{m})}^{T} (β_{★}^{(m)} - {\hat{β}}^{(m)})]}^{2} = O_{p} (\sum_{m = 1}^{M} p_{m}^{4}), \end{matrix}

which, together with Condition 4, leads to (A9). □

Appendix B. Proof of Theorem 1

Let

\tilde{μ} (w) = {({\tilde{μ}}_{1} (w), {\tilde{μ}}_{2} (w), \dots, {\tilde{μ}}_{n} (w))}^{T}

, and

{\tilde{L}}_{n} (w) = {∥μ - \tilde{μ} (w)∥}^{2} .

As in Li (1987) and Ando and Li (2014), we know that

\begin{matrix} C V (w) \\ = {∥e∥}^{2} + {\tilde{L}}_{n} (w) + 2 〈e, μ - \tilde{μ} (w)〉 \\ = {∥e∥}^{2} + L_{n} (w) (\frac{{\tilde{L}}_{n} (w)}{L_{n} (w)} + \frac{2 〈e, μ - \tilde{μ} (w)〉}{L_{n} (w)}) . \end{matrix}

(A10)

As

\hat{w}

minimizes

C V (w)

over

w \in H_{n}

, it also minimizes

C V (w) - {∥e∥}^{2}

over

w \in H_{n}

. Therefore, the claim

\frac{L_{n} (\hat{w})}{{inf}_{w \in H_{n}} L_{n} (w)} \overset{p}{\to} 1

is valid if

sup_{w \in H_{n}} |\frac{{\tilde{L}}_{n} (w)}{L_{n} (w)} - 1| \overset{p}{\to} 0

(A11)

and

sup_{w \in H_{n}} |\frac{〈 e, μ - \tilde{μ} (w) 〉}{L_{n} (w)}| \overset{p}{\to} 0

(A12)

hold. In fact, if we denote

w^{*} = a r g {min}_{w \in H_{n}} L_{n} (w)

, then

\frac{L_{n} (\hat{w})}{{inf}_{w \in H_{n}} L_{n} (w)} = \frac{L_{n} (\hat{w})}{L_{n} (w^{*})} ⩾ 1,

so we only need to prove

\frac{L_{n} (\hat{w})}{L_{n} (w^{*})} ⩽ 1 + δ_{n},

where

δ_{n} ⩾ 0

for

n = 1, 2, \dots

, and

δ_{n} \overset{p}{\to} 0

. According to the definition of

\hat{w}

, we have

C V_{n} (\hat{w}) ⩽ C V_{n} (w^{*})

. Then, by (A10), we obtain

\begin{matrix} {∥e∥}^{2} + L_{n} (\hat{w}) (\frac{{\tilde{L}}_{n} (\hat{w})}{L_{n} (\hat{w})} + \frac{2 〈e, μ - \tilde{μ} (\hat{w})〉}{L_{n} (\hat{w})}) ⩽ & {∥e∥}^{2} + L_{n} (w^{*}) (\frac{{\tilde{L}}_{n} (w^{*})}{L_{n} (w^{*})} + \frac{2 〈e, μ - \tilde{μ} (w^{*})〉}{L_{n} (w^{*})}), \end{matrix}

which is equivalent to

\frac{L_{n} (\hat{w})}{L_{n} (w^{*})} (\frac{{\tilde{L}}_{n} (\hat{w})}{L_{n} (\hat{w})} + \frac{2 〈e, μ - \tilde{μ} (\hat{w})〉}{L_{n} (\hat{w})}) ⩽ \frac{{\tilde{L}}_{n} (w^{*})}{L_{n} (w^{*})} + \frac{2 〈e, μ - \tilde{μ} (w^{*})〉}{L_{n} (w^{*})} .

From (A11) and (A12), we have

\begin{matrix} \frac{L_{n} (\hat{w})}{L_{n} (w^{*})} (\frac{{\tilde{L}}_{n} (\hat{w})}{L_{n} (\hat{w})} + \frac{2 〈e, μ - \tilde{μ} (\hat{w})〉}{L_{n} (\hat{w})}) \\ ⩽ \frac{{\tilde{L}}_{n} (w^{*})}{L_{n} (w^{*})} + \frac{2 〈e, μ - \tilde{μ} (w^{*})〉}{L_{n} (w^{*})} \\ ⩽ sup_{w \in H_{n}} |\frac{{\tilde{L}}_{n} (w)}{L_{n} (w)} + \frac{2 〈e, μ - \tilde{μ} (w)〉}{L_{n} (w)}| \\ ⩽ sup_{w \in H_{n}} |\frac{{\tilde{L}}_{n} (w)}{L_{n} (w)} - 1| + 1 + sup_{w \in H_{n}} |\frac{2 〈e, μ - \tilde{μ} (w)〉}{L_{n} (w)}|, \end{matrix}

and

\begin{matrix} \frac{L_{n} (\hat{w})}{L_{n} (w^{*})} (\frac{{\tilde{L}}_{n} (\hat{w})}{L_{n} (\hat{w})} + \frac{2 〈e, μ - \tilde{μ} (\hat{w})〉}{L_{n} (\hat{w})}) \\ ⩾ \frac{L_{n} (\hat{w})}{L_{n} (w^{*})} (1 - sup_{w \in H_{n}} |\frac{{\tilde{L}}_{n} (w)}{L_{n} (w)} - 1| - sup_{w \in H_{n}} |\frac{2 〈e, μ - \tilde{μ} (w)〉}{L_{n} (w)}|) . \end{matrix}

Therefore,

\frac{1}{L_{n} (\hat{w}) / L_{n} (w^{*})} ⩾ \frac{1 - δ_{n}}{1 + δ_{n}} \to 1,

with

L_{n} (\hat{w}) / L_{n} (w^{*}) ⩾ 1

, and

δ_{n} = sup_{w \in H_{n}} |\frac{{\tilde{L}}_{n} (w)}{L_{n} (w)} - 1| + sup_{w \in H_{n}} |\frac{2 〈 e, μ - \tilde{μ} (w) 〉}{L_{n} (w)}| .

Thus, we obtain

\frac{L_{n} (\hat{w})}{L_{n} (w^{*})} \overset{p}{\to} 1 .

In the following, we prove (A11) and (A12).

Appendix B.1. Proof of (A11)

Notice that

\begin{matrix} |{\tilde{L}}_{n} (w) - L_{n} (w)| & = |\sum_{i = 1}^{n} {\tilde{μ}}_{i} {(w)}^{2} - \sum_{i = 1}^{n} {\hat{μ}}_{i} {(w)}^{2} + 2 \sum_{i = 1}^{n} μ_{i} [{\hat{μ}}_{i} (w) - {\tilde{μ}}_{i} (w)]| \\ = |\sum_{i = 1}^{n} {[{\tilde{μ}}_{i} (w) - {\hat{μ}}_{i} (w)]}^{2} - 2 \sum_{i = 1}^{n} {\hat{μ}}_{i} {(w)}^{2} + 2 \sum_{i = 1}^{n} {\tilde{μ}}_{i} (w) {\hat{μ}}_{i} (w) + 2 \sum_{i = 1}^{n} μ_{i} [{\hat{μ}}_{i} (w) - {\tilde{μ}}_{i} (w)]| \\ = |\sum_{i = 1}^{n} {[{\tilde{μ}}_{i} (w) - {\hat{μ}}_{i} (w)]}^{2} + 2 \sum_{i = 1}^{n} [μ_{i} - {\hat{μ}}_{i} (w)] [{\hat{μ}}_{i} (w) - {\tilde{μ}}_{i} (w)]| \\ = |{∥\hat{μ} (w) - \tilde{μ} (w)∥}^{2} + 2 〈μ - \hat{μ} (w), \hat{μ} (w) - \tilde{μ} (w)〉| \\ ⩽ {∥\hat{μ} (w) - \tilde{μ} (w)∥}^{2} + 2 \sqrt{L_{n} (w)} ∥\hat{μ} (w) - \tilde{μ} (w)∥ . \end{matrix}

So,

\begin{matrix} |\frac{{\tilde{L}}_{n} (w)}{L_{n} (w)} - 1| & = \frac{|{\tilde{L}}_{n} (w) - L_{n} (w)|}{L_{n} (w)} \\ ⩽ \frac{{∥\hat{μ} (w) - \tilde{μ} (w)∥}^{2} + 2 \sqrt{L_{n} (w)} ∥\hat{μ} (w) - \tilde{μ} (w)∥}{L_{n} (w)} \\ = \frac{{∥\hat{μ} (w) - \tilde{μ} (w)∥}^{2}}{L_{n} (w)} + \frac{2 ∥\hat{μ} (w) - \tilde{μ} (w)∥}{\sqrt{L_{n} (w)}} . \end{matrix}

Therefore, to prove (A11), it suffices to verify

\begin{matrix} sup_{w \in H_{n}} \frac{∥ \hat{μ} (w) - \tilde{μ} {(w) ∥}^{2}}{L_{n} (w)} \overset{p}{\to} 0 . \end{matrix}

By Lemma A4, we need only to show

\begin{matrix} sup_{w \in H_{n}} \frac{∥ \hat{μ} (w) - \tilde{μ} {(w) ∥}^{2}}{R_{n} (w)} \overset{p}{\to} 0 . \end{matrix}

(A13)

Let

η_{i, p_{m}}^{*}

be the point between

{\tilde{η}}_{i, p_{m}}

and

{\hat{η}}_{i, p_{m}}

. Then, for any

δ > 0

, we have

\begin{matrix} P \{sup_{w \in H_{n}} \frac{{∥\hat{μ} (w) - \tilde{μ} (w)∥}^{2}}{R_{n} (w)} > δ\} \\ ⩽ P \{sup_{w \in H_{n}} {∥\hat{μ} (w) - \tilde{μ} (w)∥}^{2} > δ ξ_{n}\} \\ = P \{sup_{w \in H_{n}} \sum_{i = 1}^{n} {[{\tilde{μ}}_{i} (w) - {\hat{μ}}_{i} (w)]}^{2} > δ ξ_{n}\} \\ = P \{sup_{w \in H_{n}} \sum_{i = 1}^{n} {[g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]}^{2} > δ ξ_{n}\} \\ = P \{sup_{w \in H_{n}} \sum_{i = 1}^{n} {[g^{^{'}} (η_{i, p_{m}}^{*}) (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}} - \sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]}^{2} > δ ξ_{n}\} \\ ⩽ P \{max_{1 ⩽ i ⩽ n, w \in H_{n}} {|g^{^{'}} (η_{i, p_{m}}^{*})|}^{2} sup_{w \in H_{n}} \sum_{i = 1}^{n} {[\sum_{m = 1}^{M} w_{m} ({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})]}^{2} > δ ξ_{n}\} \\ ⩽ P \{max_{1 ⩽ i ⩽ n, w \in H_{n}} {|g^{^{'}} (η_{i, p_{m}}^{*})|}^{2} sup_{w \in H_{n}} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} > δ ξ_{n}\} \\ = P \{max_{1 ⩽ i ⩽ n, w \in H_{n}} {|g^{^{'}} (η_{i, p_{m}}^{*})|}^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} > δ ξ_{n}\}, \end{matrix}

which, together with the assumption that

g (\cdot)

is a twice continuously differentiable function with bounded derivatives implying

{max}_{1 ⩽ i ⩽ n, w \in H_{n}} {| g^{^{'}} (η_{i, p_{m}}^{*}) |}^{2} ⩽ c^{2} < \infty

, leads to

P \{sup_{w \in H_{n}} \frac{{∥\hat{μ} (w) - \tilde{μ} (w)∥}^{2}}{R_{n} (w)} > δ\} ⩽ P \{c^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} / ξ_{n} > δ\} .

Thus, to prove (A13), it suffices to show

\begin{matrix} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} / ξ_{n} = o_{p} (1) . \end{matrix}

(A14)

By Condition 5, for fixed M, we obtain

\begin{matrix} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {({\tilde{η}}_{i, p_{m}} - {\hat{η}}_{i, p_{m}})}^{2} & = O_{p} (\sum_{m = 1}^{M} p_{m}^{4}), \end{matrix}

which, together with Condition 4, leads to (A14), and thus (A13) holds.

Appendix B.2. Proof of (A12)

As

|〈e, μ - \tilde{μ} (w)〉| = |\sum_{i = 1}^{n} e_{i} [g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}})]|,

it is sufficient to show

sup_{w \in H_{n}} |\sum_{i = 1}^{n} e_{i} [g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}})]| / R_{n} (w) \overset{p}{\to} 0 .

It is readily seen that

\begin{matrix} sup_{w \in H_{n}} \frac{|\sum_{i = 1}^{n} e_{i} [g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}})]|}{R_{n} (w)} \\ ⩽ sup_{w \in H_{n}} \frac{|\sum_{i = 1}^{n} e_{i} [g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)})]|}{R_{n} (w)} \\ + sup_{w \in H_{n}} \frac{|\sum_{i = 1}^{n} e_{i} [g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]|}{R_{n} (w)} \\ + sup_{w \in H_{n}} \frac{|\sum_{i = 1}^{n} e_{i} [g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}})]|}{R_{n} (w)} \\ ≜ sup_{w \in H_{n}} A_{n}^{(1)} (w) + sup_{w \in H_{n}} A_{n}^{(2)} (w) + sup_{w \in H_{n}} A_{n}^{(3)} (w) . \end{matrix}

Thus, we need only to prove

sup_{w \in H_{n}} A_{n}^{(1)} (w) \overset{p}{\to} 0,

(A15)

sup_{w \in H_{n}} A_{n}^{(2)} (w) \overset{p}{\to} 0,

(A16)

and

sup_{w \in H_{n}} A_{n}^{(3)} (w) \overset{p}{\to} 0 .

(A17)

The proof of (A15) is similar to that of Wu (1981). We denote a metric

ρ (w, w^{'}) = ∥w - w^{'}∥,

which is on

H_{n}

. Let

(H_{n}, ρ)

be a compact metric space. Then

C (H_{n})

is the Banach space of real-valued continuous functions on

H_{n}

with the supremum norm

{∥Δ∥}_{\infty} = s u p_{w \in H_{n}} |Δ (w)| .

Let

N (H_{n}, ρ, ε)

denote the minimal number of

ρ

-balls of radius less than or equal to

ε

which cover

H_{n}

, and set

H (H_{n}, ρ, ε) = log N (H_{n}, ρ, ε) .

We let

L i p (ρ) = \{Δ \in C (H_{n}) : Λ (Δ) = sup_{w \neq w^{'} \in H_{n}} \frac{|Δ (w) - Δ (w^{'})|}{ρ (w, w^{'})} < \infty\},

and for

Δ \in L i p (ρ)

, we define

{∥Δ∥}_{ρ} = Λ (Δ) + |Δ (w^{*})|,

where

w^{*}

is some fixed point in

H_{n}

.

Recalling that

Δ_{i} (w) = g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)})

, we have

\begin{matrix} Λ (\frac{Δ_{i}}{p + 1}) \\ = sup_{w \neq w^{'} \in H_{n}} \frac{|Δ_{i} (w) - Δ_{i} (w^{'})|}{(p + 1) ρ (w, w^{'})} \\ = sup_{w \neq w^{'} \in H_{n}} |g^{'} (γ_{0, i})| \times \frac{|\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)} - \sum_{m = 1}^{M} w_{m}^{'} ε_{(i, p_{m})}^{T} β_{★}^{(m)}|}{(p + 1) ρ (w, w^{'})} \\ ⩽ c \times sup_{w \neq w^{'} \in H_{n}} \frac{|\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)} - \sum_{m = 1}^{M} w_{m}^{'} ε_{(i, p_{m})}^{T} β_{★}^{(m)}|}{(p + 1) ρ (w, w^{'})} \\ ⩽ c \sqrt{\sum_{m = 1}^{M} \frac{{(ε_{(i, p_{m})}^{T} β_{★}^{(m)})}^{2}}{{(p + 1)}^{2}}}, \end{matrix}

where

γ_{0, i}

is a point between

\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}

and

\sum_{m = 1}^{M} w_{m}^{'} ε_{(i, p_{m})}^{T} β_{★}^{(m)}

. From the assumption

{∥β_{★}^{(m)}∥}^{2} / (p_{m} + 1) ⩽ C_{b} < \infty

, and Condition 3, we obtain

sup_{i} Λ (Δ_{i}) ⩽ C_{g} < \infty .

(A18)

As for

|\frac{Δ_{i} (w^{*})}{p + 1}|

, using Lagrange theorem, we have

\begin{matrix} |\frac{Δ_{i} (w^{*})}{p + 1}| & = \frac{1}{p + 1} |g^{'} (ζ_{i}) (η_{i} - \sum_{m = 1}^{M} w_{m}^{*} ε_{(i, p_{m})}^{T} β_{★}^{(m)})| \\ ⩽ c \sqrt{\sum_{m = 1}^{M} \frac{{(η_{i} - ε_{(i, p_{m})}^{T} β_{★}^{(m)})}^{2}}{{(p + 1)}^{2}}}, \end{matrix}

where

ζ_{i}

is a point between

η_{i}

and

\sum_{m = 1}^{M} w_{m}^{*} ε_{(i, p_{m})}^{T} β_{★}^{(m)}

. Again, by Condition 3,

{∥β_{★}^{(m)}∥}^{2} / (p_{m} + 1) ⩽ C_{b} < \infty

, and the assumption

{sup}_{i} |η_{i}| \leq C_{η} < \infty

, we obtain

sup_{i} |Δ_{i} (w^{*})| ⩽ \tilde{C} < \infty .

(A19)

For (A15), we have

\begin{matrix} P \{sup_{w \in H_{n}} A_{n}^{(1)} (w) > δ\} \\ ⩽ P \{sup_{w \in H_{n}} |\sum_{i = 1}^{n} e_{i} [g (η_{i}) - g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)})]| > δ ξ_{n}\} \\ = P \{sup_{w \in H_{n}} |\sum_{i = 1}^{n} e_{i} \frac{Δ_{i} (w)}{p + 1}| > δ \frac{ξ_{n}}{p + 1}\} \\ ⩽ \frac{{(p + 1)}^{2} E {[{sup}_{w \in H_{n}} |\sum_{i = 1}^{n} e_{i} \frac{Δ_{i} (w)}{p + 1}|]}^{2}}{δ^{2} ξ_{n}^{2}} \\ = \frac{{(p + 1)}^{2} E {∥\sum_{i = 1}^{n} e_{i} \frac{Δ_{i}}{p + 1}∥}_{\infty}^{2}}{δ^{2} ξ_{n}^{2}}, \end{matrix}

where

δ > 0

is an arbitrary constant. Since

H_{n}

is a compact subset of

R^{M}

, and

ρ (w, w^{'})

is the Euclidean metric in

R^{M}

, (A1) is satisfied. Therefore, by Lemma A1, we see that there is a constant

A < \infty

such that for all n,

\begin{matrix} E {∥\sum_{i = 1}^{n} e_{i} \frac{Δ_{i}}{p + 1}∥}_{\infty}^{2} & ⩽ A \sum_{i = 1}^{n} E {∥e_{i} \frac{Δ_{i}}{p + 1}∥}_{ρ}^{2} \\ ⩽ A sup_{j} E e_{j}^{2} \sum_{i = 1}^{n} {[Λ (\frac{Δ_{i}}{p + 1}) + |\frac{Δ_{i} (w^{*})}{p + 1}|]}^{2} \\ ⩽ 2 A sup_{j} E e_{j}^{2} \sum_{i = 1}^{n} (Λ^{2} (\frac{Δ_{i}}{p + 1}) + {|\frac{Δ_{i} (w^{*})}{p + 1}|}^{2}) \\ = O (n), \end{matrix}

where the last equality is because of (A18), (A19) and

{sup}_{j} E e_{j}^{2} < \infty

. Therefore,

\begin{matrix} P \{sup_{w \in H_{n}} A_{n}^{(1)} (w) > δ\} \end{matrix} = O (\frac{{(p + 1)}^{2} n}{ξ_{n}^{2}}) \to 0,

and (A15) holds.

Denote

{\tilde{Δ}}_{i} = g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})

. For (A16), we have

\begin{matrix} P \{sup_{w \in H_{n}} A_{n}^{(2)} (w) > δ\} \\ ⩽ P \{sup_{w \in H_{n}} {|\sum_{i = 1}^{n} e_{i} [g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} β_{★}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\hat{η}}_{i, p_{m}})]|}^{2} > δ^{2} ξ_{n}^{2}\} \\ ⩽ P \{sup_{w \in H_{n}} |\sum_{i = 1}^{n} e_{i}^{2} \sum_{i = 1}^{n} {\tilde{Δ}}_{i}^{2}| > δ^{2} ξ_{n}^{2}\} \\ ⩽ P \{sup_{w \in H_{n}} \sum_{i = 1}^{n} {\tilde{Δ}}_{i}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ = P \{sup_{w \in H_{n}} \sum_{i = 1}^{n} {|g^{'} (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i *}^{m}) \sum_{m = 1}^{M} [w_{m} (ε_{(i, p_{m})}^{T} β_{★}^{(m)} - {\hat{η}}_{i, p_{m}})]|}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} \\ + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ ⩽ P \{sup_{w \in H_{n}} c^{2} \sum_{i = 1}^{n} {|\sum_{m = 1}^{M} [w_{m} (ε_{(i, p_{m})}^{T} β_{★}^{(m)} - {\hat{η}}_{i, p_{m}})]|}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ ⩽ P \{c^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {(ε_{(i, p_{m})}^{T} β_{★}^{(m)} - {\hat{η}}_{i, p_{m}})}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ ⩽ P \{c^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {(ε_{(i, p_{m})}^{T} β_{★}^{(m)} - {\hat{η}}_{i, p_{m}})}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + \frac{p^{2} \sum_{i = 1}^{n} E e_{i}^{2}}{ξ_{n} \sqrt{n}} . \end{matrix}

From Lemma A3 and Condition 3, we see that

\sum_{i = 1}^{n} \sum_{m = 1}^{M} {(ε_{(i, p_{m})}^{T} β_{★}^{(m)} - {\hat{η}}_{i, p_{m}})}^{2} = \sum_{i = 1}^{n} \sum_{m = 1}^{M} {[ε_{(i, p_{m})}^{T} (β_{★}^{(m)} - {\hat{β}}^{(m)})]}^{2} = O_{p} (\sum_{m = 1}^{M} p_{m}^{4}) .

Therefore,

{lim}_{n \to + \infty} P \{{sup}_{w \in H_{n}} A_{n}^{(2)} (w) > δ\} = 0

, that is, (A16) is valid.

Write

{\bar{Δ}}_{i} = g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}})

. For (A17), we have

\begin{matrix} P \{sup_{w \in H_{n}} A_{n}^{(3)} (w) > δ\} \\ ⩽ P \{sup_{w \in H_{n}} {|\sum_{i = 1}^{n} e_{i} [g (\sum_{m = 1}^{M} w_{m} ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)}) - g (\sum_{m = 1}^{M} w_{m} {\tilde{η}}_{i, p_{m}})]|}^{2} > δ^{2} ξ_{n}^{2}\} \\ ⩽ P \{sup_{w \in H_{n}} |\sum_{i = 1}^{n} e_{i}^{2} \sum_{i = 1}^{n} {\bar{Δ}}_{i}^{2}| > δ^{2} ξ_{n}^{2}\} \\ ⩽ P \{sup_{w \in H_{n}} \sum_{i = 1}^{n} {\bar{Δ}}_{i}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ = P \{sup_{w \in H_{n}} \sum_{i = 1}^{n} {|g^{'} (\sum_{m = 1}^{M} w_{m} η_{i, p_{m}}^{*}) \sum_{m = 1}^{M} [w_{m} (ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)} - {\tilde{η}}_{i, p_{m}})]|}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} \\ + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ ⩽ P \{sup_{w \in H_{n}} c^{2} \sum_{i = 1}^{n} {|\sum_{m = 1}^{M} [w_{m} (ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)} - {\tilde{η}}_{i, p_{m}})]|}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ ⩽ P \{c^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {(ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)} - {\tilde{η}}_{i, p_{m}})}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + P \{\sum_{i = 1}^{n} e_{i}^{2} > ξ_{n} \sqrt{n} / p^{2}\} \\ ⩽ P \{c^{2} \sum_{i = 1}^{n} \sum_{m = 1}^{M} {(ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)} - {\tilde{η}}_{i, p_{m}})}^{2} > δ^{2} ξ_{n} p^{2} / \sqrt{n}\} + \frac{p^{2} \sum_{i = 1}^{n} E e_{i}^{2}}{ξ_{n} \sqrt{n}} . \end{matrix}

From Condition 5, we see that

\sum_{i = 1}^{n} \sum_{m = 1}^{M} {(ε_{(i, p_{m})}^{T} {\hat{β}}^{(m)} - {\tilde{η}}_{i, p_{m}})}^{2} = O_{p} (\sum_{m = 1}^{M} p_{m}^{4}) .

Therefore,

{lim}_{n \to + \infty} P \{{sup}_{w \in H_{n}} A_{n}^{(3)} (w) > δ\} = 0

, that is, (A17) is valid.

Appendix C. Simulation Results in Section 4.1

Table A1. Prediction errors with n = 60 in Case 1.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.432	0.408	0.404	0.433	0.408	0.394	0.393
1	Median	0.417	0.417	0.417	0.417	0.417	0.375	0.417
	Var	0.023	0.023	0.020	0.023	0.024	0.023	0.021
	Mean	0.312	0.294	0.249	0.311	0.292	0.225	0.226
2	Median	0.333	0.333	0.250	0.333	0.333	0.250	0.250
	Var	0.013	0.013	0.016	0.013	0.013	0.013	0.013
	Mean	0.273	0.262	0.226	0.273	0.260	0.188	0.189
3	Median	0.250	0.250	0.250	0.250	0.250	0.167	0.167
	Var	0.017	0.017	0.015	0.017	0.017	0.016	0.015
	Mean	0.256	0.243	0.183	0.256	0.247	0.162	0.163
4	Median	0.250	0.250	0.167	0.250	0.250	0.167	0.167
	Var	0.018	0.017	0.011	0.018	0.017	0.013	0.013
	Mean	0.203	0.196	0.148	0.203	0.193	0.133	0.134
5	Median	0.167	0.167	0.167	0.167	0.167	0.083	0.083
	Var	0.014	0.014	0.011	0.014	0.013	0.009	0.009
	Mean	0.234	0.233	0.135	0.234	0.233	0.117	0.115
6	Median	0.250	0.250	0.125	0.250	0.250	0.083	0.083
	Var	0.016	0.016	0.010	0.016	0.016	0.010	0.010
	Mean	0.214	0.213	0.149	0.214	0.214	0.118	0.117
7	Median	0.208	0.208	0.167	0.208	0.250	0.083	0.083
	Var	0.014	0.015	0.010	0.014	0.015	0.009	0.008
	Mean	0.213	0.209	0.134	0.213	0.210	0.104	0.103
8	Median	0.250	0.167	0.125	0.250	0.167	0.083	0.083
	Var	0.012	0.012	0.009	0.012	0.012	0.008	0.008
	Mean	0.196	0.196	0.128	0.196	0.196	0.096	0.099
9	Median	0.167	0.167	0.083	0.167	0.167	0.083	0.083
	Var	0.014	0.014	0.012	0.014	0.015	0.008	0.008
	Mean	0.209	0.208	0.126	0.209	0.206	0.088	0.087
10	Median	0.167	0.167	0.083	0.167	0.167	0.083	0.083
	Var	0.016	0.016	0.009	0.016	0.016	0.006	0.006

Table A2. Prediction errors with n = 200 in Case 1.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.355	0.350	0.329	0.355	0.349	0.322	0.322
1	Median	0.350	0.350	0.325	0.350	0.350	0.325	0.313
	Var	0.006	0.007	0.007	0.006	0.007	0.006	0.006
	Mean	0.262	0.262	0.234	0.262	0.262	0.227	0.227
2	Median	0.275	0.275	0.225	0.275	0.275	0.225	0.225
	Var	0.005	0.005	0.004	0.005	0.005	0.004	0.004
	Mean	0.205	0.205	0.184	0.205	0.205	0.174	0.174
3	Median	0.200	0.200	0.175	0.200	0.200	0.175	0.175
	Var	0.005	0.005	0.004	0.005	0.005	0.003	0.003
	Mean	0.163	0.163	0.134	0.163	0.163	0.128	0.128
4	Median	0.150	0.150	0.125	0.150	0.150	0.125	0.125
	Var	0.004	0.004	0.003	0.004	0.004	0.003	0.003
	Mean	0.139	0.139	0.113	0.139	0.139	0.110	0.110
5	Median	0.125	0.125	0.113	0.125	0.125	0.100	0.100
	Var	0.003	0.003	0.003	0.003	0.003	0.002	0.002
	Mean	0.136	0.136	0.101	0.136	0.136	0.094	0.094
6	Median	0.125	0.125	0.100	0.125	0.125	0.100	0.100
	Var	0.003	0.003	0.002	0.003	0.003	0.002	0.002
	Mean	0.129	0.129	0.099	0.129	0.129	0.086	0.086
7	Median	0.125	0.125	0.100	0.125	0.125	0.075	0.075
	Var	0.003	0.003	0.003	0.003	0.003	0.002	0.002
	Mean	0.121	0.121	0.091	0.121	0.121	0.083	0.082
8	Median	0.113	0.113	0.075	0.113	0.113	0.075	0.075
	Var	0.003	0.003	0.002	0.003	0.003	0.002	0.002
	Mean	0.127	0.127	0.090	0.127	0.127	0.084	0.083
9	Median	0.125	0.125	0.100	0.125	0.125	0.075	0.075
	Var	0.003	0.003	0.002	0.003	0.003	0.002	0.002
	Mean	0.121	0.121	0.088	0.121	0.121	0.069	0.069
10	Median	0.125	0.125	0.075	0.125	0.125	0.075	0.075
	Var	0.003	0.003	0.002	0.003	0.003	0.002	0.002

Table A3. Prediction errors with n = 500 in Case 1.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.349	0.349	0.332	0.349	0.349	0.330	0.330
1	Median	0.345	0.345	0.330	0.345	0.345	0.330	0.330
	Var	0.002	0.002	0.002	0.002	0.002	0.002	0.002
	Mean	0.240	0.240	0.232	0.240	0.240	0.228	0.228
2	Median	0.240	0.240	0.230	0.240	0.240	0.230	0.230
	Var	0.001	0.001	0.002	0.001	0.001	0.002	0.002
	Mean	0.176	0.176	0.174	0.176	0.176	0.168	0.168
3	Median	0.170	0.170	0.170	0.170	0.170	0.160	0.160
	Var	0.002	0.002	0.001	0.002	0.002	0.001	0.001
	Mean	0.143	0.143	0.133	0.143	0.143	0.135	0.134
4	Median	0.140	0.140	0.130	0.140	0.140	0.130	0.130
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.126	0.126	0.114	0.126	0.126	0.115	0.115
5	Median	0.120	0.120	0.110	0.120	0.120	0.110	0.110
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.109	0.109	0.097	0.109	0.109	0.095	0.096
6	Median	0.110	0.110	0.090	0.110	0.110	0.090	0.090
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.106	0.106	0.090	0.106	0.106	0.089	0.089
7	Median	0.110	0.110	0.090	0.110	0.110	0.090	0.090
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.096	0.096	0.081	0.096	0.096	0.084	0.084
8	Median	0.090	0.090	0.080	0.090	0.090	0.080	0.080
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.090	0.090	0.075	0.090	0.090	0.070	0.070
9	Median	0.085	0.085	0.070	0.085	0.085	0.065	0.065
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.091	0.091	0.075	0.091	0.091	0.069	0.068
10	Median	0.090	0.090	0.070	0.090	0.090	0.065	0.065
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001

Table A4. Prediction errors with n = 60 in Case 2.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.362	0.346	0.359	0.359	0.342	0.351	0.354
1	Median	0.333	0.333	0.333	0.333	0.333	0.333	0.333
	Var	0.021	0.021	0.021	0.021	0.021	0.021	0.022
	Mean	0.315	0.251	0.262	0.300	0.245	0.245	0.248
2	Median	0.333	0.250	0.250	0.250	0.250	0.250	0.250
	Var	0.020	0.016	0.016	0.019	0.015	0.015	0.016
	Mean	0.269	0.193	0.208	0.257	0.188	0.185	0.184
3	Median	0.250	0.167	0.167	0.250	0.167	0.167	0.167
	Var	0.016	0.014	0.014	0.015	0.013	0.012	0.013
	Mean	0.258	0.174	0.176	0.252	0.167	0.163	0.164
4	Median	0.250	0.167	0.167	0.250	0.167	0.167	0.167
	Var	0.018	0.013	0.012	0.017	0.013	0.012	0.012
	Mean	0.244	0.145	0.169	0.239	0.137	0.138	0.135
5	Median	0.250	0.167	0.167	0.250	0.167	0.083	0.083
	Var	0.017	0.010	0.013	0.017	0.010	0.011	0.011
	Mean	0.234	0.142	0.150	0.227	0.131	0.122	0.119
6	Median	0.250	0.167	0.167	0.250	0.083	0.083	0.083
	Var	0.018	0.010	0.012	0.017	0.010	0.009	0.009
	Mean	0.214	0.127	0.142	0.205	0.118	0.113	0.110
7	Median	0.167	0.083	0.167	0.167	0.083	0.083	0.083
	Var	0.016	0.011	0.012	0.016	0.010	0.009	0.009
	Mean	0.230	0.120	0.156	0.223	0.110	0.105	0.107
8	Median	0.250	0.083	0.167	0.167	0.083	0.083	0.083
	Var	0.018	0.010	0.014	0.017	0.009	0.009	0.010
	Mean	0.204	0.121	0.160	0.192	0.108	0.100	0.099
9	Median	0.167	0.083	0.167	0.167	0.083	0.083	0.083
	Var	0.017	0.009	0.016	0.016	0.009	0.008	0.008
	Mean	0.201	0.114	0.178	0.182	0.101	0.096	0.096
10	Median	0.167	0.083	0.167	0.167	0.083	0.083	0.083
	Var	0.019	0.010	0.017	0.019	0.009	0.008	0.008

Table A5. Prediction errors with n = 200 in Case 2.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.369	0.336	0.349	0.369	0.336	0.342	0.341
1	Median	0.375	0.325	0.350	0.375	0.325	0.350	0.338
	Var	0.007	0.007	0.006	0.006	0.007	0.006	0.006
	Mean	0.265	0.253	0.239	0.265	0.248	0.233	0.233
2	Median	0.275	0.250	0.250	0.275	0.250	0.225	0.225
	Var	0.006	0.005	0.005	0.006	0.005	0.005	0.005
	Mean	0.204	0.204	0.184	0.204	0.203	0.175	0.175
3	Median	0.200	0.200	0.175	0.200	0.200	0.175	0.175
	Var	0.004	0.004	0.003	0.004	0.004	0.003	0.003
	Mean	0.175	0.175	0.147	0.175	0.175	0.143	0.142
4	Median	0.175	0.175	0.150	0.175	0.175	0.150	0.125
	Var	0.004	0.004	0.004	0.004	0.004	0.004	0.003
	Mean	0.157	0.157	0.130	0.157	0.157	0.118	0.118
5	Median	0.150	0.150	0.125	0.150	0.150	0.125	0.125
	Var	0.004	0.004	0.003	0.004	0.004	0.003	0.003
	Mean	0.148	0.148	0.120	0.148	0.148	0.108	0.107
6	Median	0.150	0.150	0.125	0.150	0.150	0.100	0.100
	Var	0.004	0.004	0.003	0.004	0.004	0.002	0.002
	Mean	0.150	0.150	0.116	0.150	0.150	0.092	0.091
7	Median	0.150	0.150	0.113	0.150	0.150	0.100	0.100
	Var	0.003	0.003	0.003	0.003	0.004	0.002	0.002
	Mean	0.162	0.161	0.125	0.162	0.161	0.091	0.092
8	Median	0.150	0.150	0.125	0.150	0.150	0.088	0.100
	Var	0.005	0.005	0.004	0.005	0.005	0.002	0.002
	Mean	0.173	0.167	0.130	0.173	0.165	0.086	0.087
9	Median	0.175	0.175	0.125	0.175	0.150	0.075	0.075
	Var	0.004	0.004	0.004	0.004	0.004	0.002	0.002
	Mean	0.192	0.172	0.147	0.192	0.167	0.088	0.090
10	Median	0.200	0.175	0.150	0.200	0.150	0.075	0.075
	Var	0.006	0.005	0.005	0.006	0.005	0.002	0.002

Table A6. Prediction errors with n = 500 in Case 2.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.345	0.338	0.332	0.345	0.336	0.330	0.330
1	Median	0.350	0.340	0.330	0.350	0.340	0.330	0.330
	Var	0.003	0.002	0.002	0.003	0.002	0.002	0.002
	Mean	0.239	0.239	0.227	0.239	0.239	0.225	0.225
2	Median	0.240	0.240	0.230	0.240	0.240	0.225	0.220
	Var	0.002	0.002	0.002	0.002	0.002	0.002	0.002
	Mean	0.182	0.182	0.170	0.182	0.182	0.168	0.168
3	Median	0.180	0.180	0.170	0.180	0.180	0.170	0.170
	Var	0.002	0.002	0.001	0.002	0.002	0.001	0.001
	Mean	0.152	0.152	0.141	0.152	0.152	0.136	0.136
4	Median	0.150	0.150	0.140	0.150	0.150	0.140	0.140
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.135	0.135	0.120	0.135	0.135	0.114	0.114
5	Median	0.130	0.130	0.120	0.130	0.130	0.110	0.110
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.129	0.129	0.110	0.129	0.129	0.100	0.101
6	Median	0.130	0.130	0.110	0.130	0.130	0.100	0.100
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.128	0.128	0.107	0.128	0.128	0.092	0.093
7	Median	0.130	0.130	0.100	0.130	0.130	0.090	0.090
	Var	0.002	0.002	0.001	0.002	0.002	0.001	0.001
	Mean	0.134	0.134	0.109	0.134	0.134	0.086	0.087
8	Median	0.130	0.130	0.110	0.130	0.130	0.080	0.080
	Var	0.002	0.002	0.001	0.002	0.002	0.001	0.001
	Mean	0.147	0.147	0.117	0.147	0.147	0.086	0.088
9	Median	0.140	0.140	0.110	0.140	0.140	0.090	0.090
	Var	0.002	0.002	0.002	0.002	0.002	0.001	0.001
	Mean	0.171	0.171	0.135	0.171	0.171	0.093	0.096
10	Median	0.170	0.170	0.135	0.170	0.170	0.090	0.090
	Var	0.003	0.003	0.002	0.003	0.003	0.001	0.001

Table A7. Prediction errors with n = 60 in Case 3.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.494	0.482	0.430	0.490	0.483	0.405	0.413
1	Median	0.500	0.500	0.417	0.500	0.500	0.417	0.417
	Var	0.026	0.029	0.026	0.027	0.029	0.022	0.022
	Mean	0.428	0.412	0.317	0.427	0.412	0.318	0.303
2	Median	0.417	0.417	0.333	0.417	0.417	0.333	0.333
	Var	0.021	0.023	0.028	0.021	0.023	0.018	0.018
	Mean	0.416	0.401	0.317	0.419	0.403	0.313	0.302
3	Median	0.417	0.417	0.292	0.417	0.417	0.292	0.250
	Var	0.028	0.031	0.037	0.027	0.030	0.032	0.031
	Mean	0.424	0.387	0.393	0.420	0.382	0.357	0.344
4	Median	0.500	0.417	0.417	0.458	0.417	0.333	0.333
	Var	0.047	0.048	0.056	0.044	0.046	0.046	0.043
	Mean	0.372	0.362	0.493	0.398	0.365	0.380	0.355
5	Median	0.333	0.333	0.583	0.417	0.333	0.417	0.333
	Var	0.052	0.054	0.067	0.049	0.052	0.053	0.048
	Mean	0.400	0.383	0.608	0.427	0.390	0.446	0.430
6	Median	0.417	0.375	0.667	0.417	0.375	0.500	0.417
	Var	0.072	0.075	0.060	0.066	0.075	0.067	0.066
	Mean	0.374	0.378	0.628	0.428	0.388	0.481	0.468
7	Median	0.333	0.333	0.667	0.417	0.417	0.500	0.500
	Var	0.072	0.075	0.052	0.063	0.072	0.067	0.070
	Mean	0.457	0.457	0.673	0.527	0.474	0.615	0.593
8	Median	0.417	0.417	0.750	0.583	0.500	0.667	0.667
	Var	0.098	0.098	0.053	0.071	0.091	0.073	0.075
	Mean	0.565	0.565	0.738	0.642	0.583	0.652	0.659
9	Median	0.583	0.583	0.750	0.750	0.667	0.750	0.750
	Var	0.099	0.099	0.040	0.079	0.087	0.072	0.074
	Mean	0.565	0.565	0.744	0.662	0.613	0.698	0.694
10	Median	0.583	0.583	0.750	0.667	0.667	0.750	0.750
	Var	0.096	0.096	0.037	0.063	0.080	0.057	0.065

Table A8. Prediction errors with n = 200 in Case 3.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.406	0.403	0.366	0.406	0.401	0.341	0.342
1	Median	0.400	0.400	0.350	0.400	0.400	0.325	0.325
	Var	0.006	0.007	0.007	0.006	0.007	0.007	0.007
	Mean	0.378	0.377	0.310	0.378	0.378	0.272	0.271
2	Median	0.375	0.375	0.300	0.375	0.375	0.250	0.250
	Var	0.010	0.010	0.010	0.010	0.010	0.008	0.007
	Mean	0.428	0.428	0.324	0.428	0.427	0.253	0.251
3	Median	0.463	0.463	0.300	0.463	0.450	0.225	0.225
	Var	0.018	0.018	0.016	0.018	0.018	0.010	0.009
	Mean	0.465	0.427	0.370	0.470	0.430	0.259	0.254
4	Median	0.500	0.475	0.350	0.500	0.475	0.225	0.225
	Var	0.031	0.035	0.029	0.030	0.034	0.021	0.020
	Mean	0.281	0.231	0.507	0.310	0.228	0.282	0.276
5	Median	0.200	0.175	0.500	0.225	0.175	0.238	0.225
	Var	0.035	0.021	0.041	0.034	0.020	0.030	0.029
	Mean	0.242	0.242	0.612	0.289	0.242	0.325	0.321
6	Median	0.175	0.175	0.675	0.225	0.175	0.238	0.238
	Var	0.040	0.040	0.036	0.039	0.037	0.050	0.049
	Mean	0.298	0.298	0.712	0.363	0.294	0.368	0.362
7	Median	0.200	0.200	0.725	0.313	0.200	0.300	0.288
	Var	0.059	0.059	0.014	0.056	0.056	0.064	0.063
	Mean	0.476	0.476	0.749	0.553	0.473	0.498	0.495
8	Median	0.513	0.513	0.763	0.588	0.500	0.588	0.575
	Var	0.086	0.086	0.009	0.068	0.084	0.076	0.076
	Mean	0.497	0.497	0.785	0.625	0.500	0.592	0.586
9	Median	0.525	0.525	0.800	0.700	0.538	0.663	0.650
	Var	0.104	0.104	0.005	0.057	0.099	0.062	0.064
	Mean	0.606	0.606	0.807	0.746	0.627	0.662	0.661
10	Median	0.750	0.750	0.825	0.825	0.800	0.763	0.750
	Var	0.105	0.105	0.004	0.042	0.101	0.053	0.054

Table A9. Prediction errors with n = 500 in Case 3.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.394	0.394	0.360	0.394	0.394	0.338	0.338
1	Median	0.390	0.390	0.355	0.390	0.390	0.340	0.340
	Var	0.004	0.004	0.003	0.004	0.004	0.003	0.003
	Mean	0.345	0.345	0.280	0.345	0.345	0.241	0.244
2	Median	0.340	0.340	0.275	0.340	0.340	0.240	0.240
	Var	0.005	0.005	0.003	0.005	0.005	0.002	0.002
	Mean	0.426	0.426	0.286	0.426	0.426	0.190	0.200
3	Median	0.430	0.430	0.270	0.430	0.430	0.190	0.200
	Var	0.008	0.008	0.008	0.008	0.008	0.002	0.002
	Mean	0.524	0.490	0.390	0.526	0.490	0.170	0.190
4	Median	0.550	0.540	0.400	0.550	0.540	0.160	0.180
	Var	0.017	0.025	0.018	0.015	0.025	0.002	0.003
	Mean	0.225	0.199	0.535	0.241	0.198	0.168	0.170
5	Median	0.160	0.160	0.560	0.180	0.160	0.150	0.160
	Var	0.028	0.018	0.017	0.027	0.018	0.006	0.006
	Mean	0.186	0.183	0.665	0.225	0.184	0.183	0.183
6	Median	0.140	0.140	0.680	0.180	0.140	0.140	0.150
	Var	0.014	0.013	0.009	0.014	0.012	0.013	0.011
	Mean	0.251	0.251	0.735	0.322	0.252	0.251	0.253
7	Median	0.170	0.170	0.740	0.260	0.170	0.170	0.190
	Var	0.033	0.033	0.004	0.028	0.031	0.033	0.028
	Mean	0.376	0.376	0.776	0.511	0.379	0.376	0.383
8	Median	0.335	0.335	0.780	0.520	0.335	0.335	0.385
	Var	0.065	0.065	0.002	0.048	0.062	0.065	0.057
	Mean	0.467	0.467	0.797	0.650	0.476	0.467	0.491
9	Median	0.475	0.475	0.800	0.700	0.480	0.475	0.510
	Var	0.087	0.087	0.002	0.039	0.082	0.087	0.076
	Mean	0.652	0.652	0.822	0.820	0.675	0.652	0.713
10	Median	0.780	0.780	0.820	0.840	0.790	0.780	0.800
	Var	0.071	0.071	0.002	0.012	0.062	0.071	0.048

Table A10. Prediction errors with n = 60 in Case 4.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.389	0.378	0.417	0.396	0.381	0.381	0.387
1	Median	0.417	0.333	0.417	0.417	0.417	0.417	0.417
	Var	0.024	0.024	0.023	0.023	0.023	0.022	0.024
	Mean	0.286	0.268	0.363	0.299	0.269	0.268	0.268
2	Median	0.250	0.250	0.333	0.250	0.250	0.250	0.250
	Var	0.022	0.021	0.029	0.022	0.022	0.021	0.021
	Mean	0.230	0.219	0.382	0.259	0.228	0.219	0.219
3	Median	0.167	0.167	0.333	0.250	0.167	0.167	0.167
	Var	0.024	0.023	0.040	0.024	0.023	0.023	0.023
	Mean	0.186	0.181	0.460	0.242	0.199	0.181	0.181
4	Median	0.167	0.167	0.417	0.167	0.167	0.167	0.167
	Var	0.022	0.022	0.048	0.031	0.024	0.022	0.022
	Mean	0.195	0.194	0.545	0.284	0.216	0.194	0.194
5	Median	0.167	0.167	0.583	0.250	0.167	0.167	0.167
	Var	0.029	0.030	0.054	0.046	0.034	0.030	0.030
	Mean	0.213	0.211	0.642	0.374	0.256	0.211	0.211
6	Median	0.167	0.167	0.667	0.333	0.167	0.167	0.167
	Var	0.042	0.042	0.045	0.062	0.049	0.042	0.042
	Mean	0.208	0.210	0.680	0.424	0.268	0.210	0.210
7	Median	0.167	0.167	0.750	0.417	0.167	0.167	0.167
	Var	0.052	0.053	0.037	0.068	0.060	0.053	0.053
	Mean	0.228	0.228	0.727	0.513	0.310	0.228	0.228
8	Median	0.167	0.167	0.750	0.500	0.250	0.167	0.167
	Var	0.059	0.059	0.025	0.067	0.071	0.059	0.059
	Mean	0.259	0.258	0.730	0.572	0.366	0.258	0.258
9	Median	0.167	0.167	0.750	0.583	0.250	0.167	0.167
	Var	0.084	0.084	0.030	0.069	0.091	0.084	0.084
	Mean	0.303	0.303	0.761	0.665	0.455	0.303	0.303
10	Median	0.167	0.167	0.750	0.750	0.417	0.167	0.167
	Var	0.099	0.099	0.020	0.047	0.096	0.099	0.099

Table A11. Prediction errors with n = 200 in Case 4.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.378	0.348	0.387	0.380	0.350	0.354	0.353
1	Median	0.375	0.350	0.375	0.375	0.350	0.350	0.350
	Var	0.008	0.008	0.008	0.008	0.007	0.007	0.007
	Mean	0.277	0.251	0.330	0.287	0.253	0.258	0.258
2	Median	0.275	0.250	0.325	0.275	0.250	0.250	0.250
	Var	0.007	0.006	0.010	0.007	0.006	0.007	0.006
	Mean	0.193	0.183	0.374	0.216	0.186	0.205	0.205
3	Median	0.175	0.175	0.375	0.200	0.175	0.200	0.200
	Var	0.006	0.005	0.020	0.007	0.005	0.006	0.006
	Mean	0.168	0.167	0.512	0.219	0.171	0.217	0.216
4	Median	0.150	0.150	0.550	0.200	0.150	0.200	0.200
	Var	0.008	0.008	0.022	0.012	0.008	0.011	0.011
	Mean	0.141	0.141	0.613	0.237	0.152	0.237	0.237
5	Median	0.125	0.125	0.650	0.200	0.125	0.200	0.200
	Var	0.008	0.008	0.020	0.019	0.009	0.018	0.018
	Mean	0.132	0.132	0.700	0.294	0.146	0.292	0.291
6	Median	0.100	0.100	0.700	0.250	0.125	0.250	0.250
	Var	0.011	0.011	0.010	0.030	0.013	0.029	0.029
	Mean	0.138	0.138	0.742	0.392	0.161	0.381	0.377
7	Median	0.100	0.100	0.750	0.375	0.125	0.375	0.375
	Var	0.014	0.014	0.007	0.039	0.017	0.033	0.033
	Mean	0.154	0.154	0.769	0.512	0.193	0.490	0.487
8	Median	0.100	0.100	0.775	0.550	0.125	0.500	0.500
	Var	0.023	0.023	0.004	0.042	0.028	0.039	0.039
	Mean	0.175	0.175	0.788	0.624	0.232	0.583	0.580
9	Median	0.100	0.100	0.800	0.675	0.125	0.625	0.625
	Var	0.038	0.038	0.005	0.032	0.046	0.035	0.035
	Mean	0.192	0.192	0.800	0.695	0.282	0.654	0.653
10	Median	0.100	0.100	0.800	0.725	0.175	0.688	0.675
	Var	0.049	0.049	0.004	0.024	0.063	0.029	0.029

Table A12. Prediction errors with n = 500 in Case 4.

R		AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
	Mean	0.380	0.339	0.367	0.380	0.340	0.338	0.339
1	Median	0.380	0.340	0.360	0.380	0.340	0.340	0.340
	Var	0.003	0.003	0.004	0.003	0.003	0.003	0.003
	Mean	0.278	0.242	0.310	0.284	0.242	0.228	0.229
2	Median	0.270	0.240	0.300	0.280	0.240	0.230	0.230
	Var	0.005	0.003	0.005	0.004	0.003	0.002	0.002
	Mean	0.180	0.177	0.385	0.198	0.179	0.176	0.179
3	Median	0.170	0.170	0.380	0.190	0.170	0.170	0.180
	Var	0.002	0.002	0.009	0.003	0.002	0.002	0.002
	Mean	0.141	0.141	0.527	0.184	0.143	0.141	0.146
4	Median	0.140	0.140	0.540	0.170	0.140	0.140	0.140
	Var	0.001	0.001	0.010	0.003	0.001	0.001	0.001
	Mean	0.122	0.122	0.649	0.203	0.126	0.122	0.130
5	Median	0.120	0.120	0.660	0.185	0.120	0.120	0.120
	Var	0.002	0.002	0.004	0.007	0.002	0.002	0.002
	Mean	0.109	0.109	0.716	0.266	0.116	0.109	0.125
6	Median	0.100	0.100	0.720	0.240	0.110	0.100	0.110
	Var	0.002	0.002	0.003	0.013	0.002	0.002	0.003
	Mean	0.103	0.103	0.754	0.371	0.115	0.103	0.129
7	Median	0.090	0.090	0.750	0.360	0.100	0.090	0.120
	Var	0.003	0.003	0.002	0.020	0.004	0.003	0.005
	Mean	0.102	0.102	0.775	0.490	0.119	0.102	0.141
8	Median	0.090	0.090	0.780	0.500	0.100	0.090	0.120
	Var	0.005	0.005	0.002	0.023	0.006	0.005	0.007
	Mean	0.112	0.112	0.791	0.629	0.143	0.112	0.184
9	Median	0.090	0.090	0.790	0.650	0.110	0.090	0.140
	Var	0.009	0.009	0.002	0.015	0.012	0.009	0.017
	Mean	0.114	0.114	0.802	0.707	0.155	0.114	0.211
10	Median	0.080	0.080	0.800	0.720	0.110	0.080	0.160
	Var	0.014	0.014	0.002	0.007	0.019	0.014	0.025

Appendix D. Simulation Results in Section 4.2

Table A13. Prediction errors with

R = 1

.

Table A13. Prediction errors with

R = 1

.

N	R = 1	AIC	BIC	PCA	SAIC	SBIC	CV1	CV2
	Mean	0.329	0.325	0.312	0.323	0.322	0.313	0.313
200	Median	0.325	0.325	0.300	0.325	0.325	0.325	0.325
	Var	0.006	0.006	0.005	0.006	0.006	0.006	0.006
	Mean	0.330	0.319	0.305	0.327	0.314	0.304	0.304
400	Median	0.325	0.313	0.300	0.325	0.313	0.300	0.300
	Var	0.003	0.003	0.003	0.003	0.003	0.003	0.003
	Mean	0.332	0.326	0.304	0.330	0.326	0.305	0.304
1000	Median	0.330	0.320	0.303	0.330	0.320	0.303	0.300
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001

Table A14. Prediction errors with

R = 3

.

Table A14. Prediction errors with

R = 3

.

N	R = 3	AIC	BIC	PCA	SAIC	SBIC	CV1	CV2
	Mean	0.173	0.173	0.168	0.173	0.173	0.162	0.162
200	Median	0.175	0.175	0.175	0.175	0.175	0.175	0.163
	Var	0.003	0.003	0.002	0.003	0.003	0.002	0.002
	Mean	0.172	0.172	0.163	0.172	0.171	0.167	0.169
400	Median	0.175	0.175	0.163	0.175	0.175	0.175	0.175
	Var	0.001	0.001	0.002	0.001	0.001	0.001	0.002
	Mean	0.175	0.193	0.156	0.175	0.189	0.149	0.148
1000	Median	0.180	0.198	0.160	0.180	0.190	0.145	0.145
	Var	0.001	0.001	0.000	0.001	0.001	0.001	0.001

Table A15. Prediction errors with

R = 7

.

Table A15. Prediction errors with

R = 7

.

N	R = 7	AIC	BIC	PCA	SAIC	SBIC	CV1	CV2
	Mean	0.104	0.104	0.109	0.104	0.104	0.095	0.097
200	Median	0.100	0.100	0.100	0.100	0.100	0.100	0.088
	Var	0.003	0.003	0.003	0.003	0.003	0.003	0.002
	Mean	0.106	0.106	0.101	0.106	0.106	0.087	0.087
400	Median	0.100	0.100	0.100	0.100	0.100	0.088	0.088
	Var	0.001	0.001	0.001	0.001	0.001	0.001	0.001
	Mean	0.109	0.109	0.103	0.109	0.109	0.084	0.083
1000	Median	0.110	0.110	0.105	0.110	0.110	0.085	0.080
	Var	0.001	0.001	0.000	0.001	0.001	0.000	0.000

References

Ando, Tomohiro, and Ker Chau Li. 2014. A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association 109: 254–65. [Google Scholar] [CrossRef]
Ando, Tomohiro, and Ker Chau Li. 2017. A weight-relaxed model averaging approach for high-dimensional generalized linear models. The Annals of Statistics 45: 2654–79. [Google Scholar] [CrossRef]
Andrews, Donald W. K. 1991. Asymptotic optimality of generalized C_L, cross-validation, and generalized cross-validation in regression with heteroskedastic errors. Journal of Econometrics 47: 359–77. [Google Scholar] [CrossRef] [Green Version]
Balan, Raluca M., and Ioana Schiopu-Kratina. 2005. Asymptotic results with generalized estimating equations for longitudinal data. The Annals of Statistics 33: 522–41. [Google Scholar] [CrossRef] [Green Version]
Buckland, Steven T., Kenneth P. Burnham, and Nicole H. Augustin. 1997. Model selection: An integral part of inference. Biometrics 53: 603–18. [Google Scholar] [CrossRef]
Chen, Kani, Inchi Hu, and Zhiliang Ying. 1999. Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. The Annals of Statistics 27: 1155–63. [Google Scholar] [CrossRef]
Claeskens, Gerda, and Raymond J. Carroll. 2007. An asymptotic theory for model selection inference in general semiparametric problems. Biometrika 94: 249–65. [Google Scholar] [CrossRef]
Flynn, Cheryl J., Clifford M. Hurvich, and Jeffrey S. Simonoff. 2013. Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. Journal of the American Statistical Association 108: 1031–43. [Google Scholar] [CrossRef] [Green Version]
Gao, Yan, Xinyu Zhang, Shouyang Wang, and Guohua Zou. 2016. Model averaging based on leave-subject-out cross-validation. Journal of Econometrics 192: 139–51. [Google Scholar] [CrossRef]
Hansen, Bruce E. 2007. Least squares model averaging. Econometrica 75: 1175–89. [Google Scholar] [CrossRef] [Green Version]
Hansen, Bruce E., and Jeffrey S. Racine. 2012. Jacknife model averaging. Journal of Econometrics 167: 38–46. [Google Scholar] [CrossRef] [Green Version]
Hoffmann-Jørgensen, Jørgen. 1974. Sums of independent Banach space valued random variables. Studia Mathematica 52: 159–86. [Google Scholar] [CrossRef] [Green Version]
Hoffmann-Jørgensen, Jørgen, and Gilles Pisier. 1976. The law of large numbers and the central limit theorem in Banach spaces. The Annals of Probability 4: 587–99. [Google Scholar] [CrossRef]
Hjort, Nils L., and Gerda Claeskens. 2003. Frequentist model average estimators. Journal of the American Statistical Association 98: 879–99. [Google Scholar] [CrossRef] [Green Version]
James, Gareth M. 2002. Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64: 411–32. [Google Scholar] [CrossRef]
Kahane, Jean Pierrc. 1968. Some Random Series of Functions. Lexington: D. C. Heath. [Google Scholar]
Li, Ker Chau. 1987. Asymptotic optimality for C_p,C_L, cross-validation and generalized cross-validation: discrete index set. The Annals of Statistics 15: 958–75. [Google Scholar] [CrossRef]
Liang, Hua, Guohua Zou, Alan T. K. Wan, and Xinyu Zhang. 2011. Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association 106: 1053–66. [Google Scholar] [CrossRef]
Lv, Jinchi, and Jun S. Liu. 2014. Model selection principles in misspecified models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76: 141–67. [Google Scholar] [CrossRef] [Green Version]
Müller, Hans Georg, and Ulrich Stadtmüller. 2005. Generalized functional linear models. The Annals of Statistics 33: 774–805. [Google Scholar] [CrossRef] [Green Version]
Wan, Alan T. K., Xinyu Zhang, and Guohua Zou. 2010. Least squares model averaging by Mallows criterion. Journal of Econometrics 156: 277–83. [Google Scholar] [CrossRef]
Wu, Chien-Fu. 1981. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics 9: 501–13. [Google Scholar] [CrossRef]
Xu, Ganggang, Suojin Wang, and Jianhua Z. Huang. 2014. Focused information criterion and model averaging based on weighted composite quantile regression. Scandinavian Journal of Statistics 41: 365–81. [Google Scholar] [CrossRef]
Yang, Yuhong. 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96: 574–88. [Google Scholar] [CrossRef] [Green Version]
Yao, Fang, Müller Hans-Georg, and Wang Jane-Ling. 2005. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100: 577–90. [Google Scholar] [CrossRef]
Zhang, Xinyu, and Hua Liang. 2011. Focused information criterion and model averaging for generalized additive partial linear models. The Annals of Statistics 39: 174–200. [Google Scholar] [CrossRef] [Green Version]
Zhang, Xinyu, Alan T. K. Wan, and Sherry Z. Zhou. 2012. Focused information criteria model selection and model averaging in a Tobit model with a non-zero threshold. Journal of Business and Economic Statistics 30: 132–42. [Google Scholar] [CrossRef]
Zhang, Xinyu, Alan T. K. Wan, and Guohua Zou. 2013. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics 174: 82–94. [Google Scholar] [CrossRef]
Zhang, Xinyu. 2015. Consistency of model averaging estimators. Economics Letters 130: 120–23. [Google Scholar] [CrossRef]
Zhang, Xinyu, Dalei Yu, Guohua Zou, and Hua Liang. 2016. Optimal model averaging estimation for generalized linear models and generalized Linear mixed-effects models. Journal of the American Statistical Association 111: 1775–90. [Google Scholar] [CrossRef]
Zhang, Xinyu, Jeng-Min Chiou, and Yanyuan Ma. 2018. Functional prediction through averaging estimated functional linear regression models. Biometrika 105: 945–62. [Google Scholar] [CrossRef]
Zhao, Shangwei, Jun Liao, and Dalei Yu. 2018. Model averaging estimator in ridge regression and its large sample properties. Statistical Papers. [Google Scholar] [CrossRef]
Zhu, Rong, Guohua Zou, and Xinyu Zhang. 2018. Optimal model averaging estimation for partial functional linear models. Journal of Systems Science and Mathematical Sciences 38: 777–800. [Google Scholar]
Zinn, Joel. 1977. A note on the central limit theorem in Banach spaces. The Annals of Probability 5: 283–86. [Google Scholar] [CrossRef]

Figure 1. Predictor trajectories, corresponding to slightly smoothed monthly price curves. The low rising residential areas are in the upper left (a). The high rising residential areas are in upper right (b). Randomly selected profiles from the panels above are shown in the lower panels (c,d) for 20 districts.

Table 1. Error of prediction.

Rounds	AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
1	0.301	0.301	0.275	0.301	0.301	0.221	0.221
2	0.292	0.292	0.247	0.292	0.292	0.178	0.176
3	0.290	0.290	0.242	0.290	0.290	0.187	0.187
4	0.280	0.280	0.233	0.280	0.280	0.176	0.174
5	0.276	0.276	0.233	0.276	0.276	0.147	0.149
6	0.316	0.316	0.233	0.316	0.316	0.188	0.188
7	0.269	0.269	0.244	0.269	0.269	0.164	0.164
8	0.294	0.294	0.225	0.294	0.294	0.174	0.174
9	0.316	0.316	0.235	0.316	0.316	0.187	0.187
10	0.282	0.282	0.242	0.282	0.282	0.174	0.173
11	0.292	0.292	0.240	0.292	0.292	0.162	0.162
12	0.285	0.285	0.261	0.285	0.285	0.188	0.188
13	0.282	0.282	0.219	0.282	0.282	0.150	0.149
14	0.264	0.264	0.280	0.264	0.264	0.188	0.188
15	0.282	0.282	0.247	0.282	0.282	0.187	0.187
16	0.295	0.295	0.269	0.295	0.295	0.185	0.185
17	0.328	0.328	0.252	0.328	0.328	0.204	0.202
18	0.301	0.301	0.245	0.301	0.301	0.187	0.187
19	0.278	0.278	0.209	0.278	0.278	0.150	0.150
20	0.311	0.311	0.249	0.311	0.311	0.183	0.183

Table 2. Error of fitting.

Rounds	AIC	BIC	FPCA	S-AIC	S-BIC	CV1	CV2
1	0.287	0.287	0.235	0.287	0.287	0.166	0.165
2	0.289	0.289	0.244	0.289	0.289	0.181	0.180
3	0.290	0.290	0.246	0.290	0.290	0.174	0.173
4	0.293	0.293	0.249	0.293	0.293	0.182	0.182
5	0.296	0.296	0.249	0.296	0.296	0.190	0.190
6	0.285	0.285	0.249	0.285	0.285	0.175	0.175
7	0.297	0.297	0.246	0.297	0.297	0.184	0.183
8	0.292	0.292	0.252	0.292	0.292	0.179	0.179
9	0.283	0.283	0.248	0.283	0.283	0.174	0.173
10	0.291	0.291	0.246	0.291	0.291	0.182	0.181
11	0.291	0.291	0.247	0.291	0.291	0.184	0.186
12	0.294	0.294	0.240	0.294	0.294	0.175	0.175
13	0.293	0.293	0.254	0.293	0.293	0.190	0.187
14	0.295	0.295	0.233	0.295	0.295	0.175	0.175
15	0.293	0.293	0.244	0.293	0.293	0.176	0.177
16	0.288	0.288	0.237	0.288	0.288	0.179	0.178
17	0.282	0.282	0.243	0.282	0.282	0.173	0.173
18	0.290	0.290	0.245	0.290	0.290	0.178	0.177
19	0.294	0.294	0.257	0.294	0.294	0.186	0.187
20	0.285	0.285	0.244	0.285	0.285	0.179	0.179

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Zou, G. Cross-Validation Model Averaging for Generalized Functional Linear Model. Econometrics 2020, 8, 7. https://doi.org/10.3390/econometrics8010007

AMA Style

Zhang H, Zou G. Cross-Validation Model Averaging for Generalized Functional Linear Model. Econometrics. 2020; 8(1):7. https://doi.org/10.3390/econometrics8010007

Chicago/Turabian Style

Zhang, Haili, and Guohua Zou. 2020. "Cross-Validation Model Averaging for Generalized Functional Linear Model" Econometrics 8, no. 1: 7. https://doi.org/10.3390/econometrics8010007

APA Style

Zhang, H., & Zou, G. (2020). Cross-Validation Model Averaging for Generalized Functional Linear Model. Econometrics, 8(1), 7. https://doi.org/10.3390/econometrics8010007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Validation Model Averaging for Generalized Functional Linear Model

Abstract

1. Introduction

2. Model Averaging for Generalized Functional Linear Model

2.1. The Generalized Functional Linear Model

2.2. Model Averaging Estimation

3. Asymptotic Property for Model Averaging Estimator

Notations and Conditions

4. Numerical Examples

4.1. Simulation I: Fixed Number of Candidate Models

4.2. Simulation II: Divergent Number of Candidate Models

4.3. Application: Beijing Second-Hand House Price Data

5. Concluding Remarks

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Lemmas and Proofs

Appendix B. Proof of Theorem 1

Appendix B.1. Proof of (A11)

Appendix B.2. Proof of (A12)

Appendix C. Simulation Results in Section 4.1

Appendix D. Simulation Results in Section 4.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI