Estimation for Varying Coefficient Models with Hierarchical Structure

The varying coefficient (VC) model is a generalization of ordinary linear model, which can not only retain strong interpretability but also has the flexibility of the nonparametric model. In this paper, we investigate a VC model with hierarchical structure. A unified variable selection method for VC model is proposed, which can simultaneously select the nonzero effects and estimate the unknown coefficient functions. Meanwhile, the selected model enforces the hierarchical structure, that is, interaction terms can be selected into the model only if the corresponding main effects are in the model. The kernel method is employed to estimate the varying coefficient functions, and a combined overlapped group Lasso regularization is introduced to implement variable selection to keep the hierarchical structure. It is proved that the proposed penalty estimators have oracle properties, that is, the coefficients are estimated as well as if the true model were known in advance. Simulation studies and a real data analysis are carried out to examine the performance of the proposed method in finite sample case.


Introduction
The varying coefficient model [1] is defined as where Y is the response variable and (X 1 , X 2 , · · · , X d , U) are its associated covariates, ε is the error term, and β j (U) are unknown smooth coefficient functions of observable continuous covariate U. Compared with the linear model, the predictors X j (j = 1, 2, · · · , d) affect the response variable Y linearly, but their coefficients are allowed to change smoothly with other covariate U. Thus, each value of U is associated with a different linear model, and it also allows one to examine the extent to which the association between response Y and covariates X j varies over covariate U (see [1,2]). In environmental data analysis [2], one objective of the study is to investigate the association between the level of the pollutants and the number of daily total hospital admissions for circulatory and respiratory problems, as well as to examine how the association varies over time, where U = t = time. Besides, without the model specification of parametric structure and multivariate nonparametric structure, the VC model can significantly reduce the modelling bias and avoid the "curse of dimensionality". As the merit of good interpretability and flexibility of the varying coefficient model, a considerable amount of literature has been published on estimation and hypothesis test of the VC model since it was initiated (see, e.g., [2][3][4][5][6]). Recently, variable selection and model detection for varying coefficient model have gained much attention, for instance Wang and Xia [7] combined the ideas of the local polynomial smoothing and Lasso (least absolute shrinkage and selection operator [8]) to estimate the coefficients and select variables simultaneously; Zhao and Xue [9] employed basis function approximations and SCAD (smoothly clipped absolute deviation [10]) penalty for the semiparametric varying coefficient partially linear model; Tang et al. [11] developed a unified variable selection approach for varying coefficient models; Li et al. [12] studied the model selection and structure specification for the generalized semi-varying coefficient models; and He et al. [13] introduced a dimensionality reduction and variable selection method for multivariate varying-coefficient models with a large number of covariates, and so on.
In many complex situations, main effects combined with interaction effects may be sufficient to characterize the relationship between the response and predictors. In social, political, and economic problems and genome-wide association studies, it is useful to identify nontrivial interactions between covariates in modeling selection results, product sales, social networks, stock market changes, and disease risk. Recent years have seen a surge of interests in interaction identification in the high dimensional setting by many researchers. For instance, Hall and Xue [14] proposed a recursive approach to identify important interactions among covariates; Niu et al. [15] proposed a forward selection based screening method for identifying interactions for ultra-high dimensional data; Kong et al. [16] suggested a two-stage interaction identification method, called the interaction pursuit via distance correlation in high dimensional multi-response regression; and Radchenko and James [17] investigated variable selection for nonlinear additive regression models with interaction structures by group regularization method. Specifically, for the high dimensional linear model with interaction terms, some important works include but are not limited to the following: Choi et al. [18] reparameterized the coefficients for the interaction terms; Bien et al. [19] added a set of convex constraints to the Lasso to produce sparse interaction terms; Zhao et al. [20] introduced the composite absolute penalties family by defining groups with particular overlapping patterns to express the relationships between the predictors; and Lim and Hastie [21] developed a method for learning pairwise interactions via hierarchical group-Lasso regularization. A key feature of the models is its hierarchical structure, as the interaction effects are derived from the main effects, which means that the interaction terms exist only if the main terms are significant in the model. This is also referred to the marginality principal in generalized linear models [22] or the strong heredity in the analysis of designed experiments [23]. Compared to parametric model (2), the VC model with interaction terms is the direct extension to the nonparametric case, where the coefficients are unknown smoothing functions of some covariates. However, the estimation methods for model (2) cannot be directly extended to the VC model with interaction terms. Variable selection for the VC model including interaction effects is also important, since ignoring important predictors can lead to biased results, while including irrelevant predictors may lead to efficiency loss. Moreover, variable selection for the VC model with interaction terms is even more complex, since nonzero functional coefficients rather than nonzero parameters need to be identified. In this paper, we aim to develop a unified variable selection method for VC model with hierarchical structure, which not merely can identify the significant variables with nonzero functional coefficients. Moreover, the selected model keeps the hierarchical structure, that is, interaction terms can be selected into the model only if the corresponding main effects are in the model. Firstly, kernel smoothing method is employed to obtain the initial estimates of the varying coefficient functions. Secondly, the local penalized least squares estimates with overlapped group Lasso penalty are proposed to simultaneously achieve variable selection and coefficients estimation, and the estimators enforce the hierarchical structure. Thirdly, it is proved that the proposed estimators have the oracle properties, that is, the functional coefficients are estimated as well as if the true model were known in advance.
The rest of the paper is organized as follows. In Section 2, we propose the local penalized least squares estimator with group Lasso penalty, which enforces the hierarchical structure. The asymptotic properties of the estimators are investigated in Section 3. In Section 4, we conduct some simulations and the Boston housing data analysis to assess the finite sample performance of the new estimators. The conclusion and future works are discussed in Section 5. Proofs of the theorems are postponed to Appendix A.

Modeling and Estimation
The varying coefficient model with hierarchical structure is defined as follows, where Y i is the response variable, X ij , j = 1, 2, · · · , d are the predictive variables, U i is the index covariate, ε i is the random error and satisfies E( is the unknown smooth coefficient function, and φ jk (U i ) is the coefficient function associated with the interaction term X ij X ik . The hierarchical structure means that the interaction term X ij X ik exists if and only if both X ij and X ik exist in the model, namely φ 2 jk (u)du = 0 holds if and only if both β 2 j (u)du = 0 and β 2 k (u)du = 0 hold for any i = 1, 2, · · · , n, 1 ≤ j < k ≤ d. The model (3) is a useful extension of the linear model with hierarchical structure (2) (see, e.g., [18,19,[24][25][26][27]), which can maintain the good interpretability of parameter models and also has the flexibility of the nonparametric models.
To simplify the representation of model (3), we list the following definitions, where the superscript "τ" means transposition operation. Then, model (3) can be reformulated as Let Λ = (β 1 , β 2 , · · · , β d , φ 12 , φ 13 , · · · , φ (d−1)d ), the support set of the important main effects be M = {j : β j > 0}, and the support set of the important interaction effects be To obtain the initial estimate of coefficient matrix Λ, we minimize the following objective function, where K h (·) = 1 h K(·/h), and K(·) is a kernel function which satisfies the Condition C5 and h → 0 is the bandwidth. Denote the solution to the objective function (4) by Λ, the tth row of Λ for t = 1, 2, · · · , n has the closed form Corresponding to the assumption, only the columns of Λ indexed by the support sets M and I are nonzero, so the main task of variable selection is to identify the sparse columns of Λ efficiently. Meanwhile, to maintain the hierarchical structure of the model, we apply the idea of group Lasso proposed by Yuan and Lin [28] and give the following local penalized least squares estimation, where we assume φ jk = φ kj , which is commonly used in the model with hierarchical structure, and the assumption means the interaction effects are independent of the order of the both covariates, which also significantly reduces the computation burden by lessening the number of the functional coefficients from the order of O(d 2 ) to O(d 2 /2) (see [18,19,25,27]). λ 1 j and λ 2 jk are tuning parameters. For simplicity of calculations, we use the local quadratic approximation (see [10,29,30]) in each step of the iteration. Take Λ as the initial estimator, which is Λ (0) λ = Λ, and, for the (m + 1)th step, the objective function can be approximately represented as follows, By minimizing Q where Regard Λ λ as the Kernel Lasso (KLasso) estimator, and the specific implementation procedures can be arranged as follows: (1) Take the non-penalized estimator Λ as the initial estimator Λ (2) According to the method discussed above, iterate (7) until convergence; specifically, the iteration stops when max | β is less than 10 −3 . For model sparsity, β j and φ jk should be set 0 when they are less than a small real value c r (in our simulation, we choose c r = 0.5).
Selection of bandwidth h and tuning parameters λ 1 j and λ 2 jk (1 ≤ j < k ≤ d) is another important issue for VC model. There are two types of parameters to be considered; one common method to solve this problem is grid searching, such as cross validation (CV) or generalized cross validation (GCV) [31]. However, it would be too expensive in computation, thus here we choose h in the kernel estimation of coefficient functions by CV criterion. For the tuning parameters, we apply the idea of Zou and Li [32], where large coefficients should be given small penalties, while small coefficients should be given large penalties, and the tuning parameters can be chosen as Then, only one parameter, λ 0 , needs to be considered, which is selected according to BIC criterion where d λ is the number of nonzero coefficient functions determined by Λ λ and Then, λ 0 can be obtained by minimizing BIC λ .

Theoretical Properties
In this section, we establish asymptotic properties of the proposed estimator, including the model selection consistency and the oracle properties. First, we make some notations and list some regular conditions. Define a n = max{λ 1 j , λ 2 kl : j ∈ M, (k, l) ∈ I}, b n = min{λ 1 j , λ 2 kl : j ∈ M c , (k, l) ∈ I c }. Let X M denote a matrix which is generated from X with columns indexed by M, and Z I denotes a matrix which is generated from Z with columns indexed by I, The convergence of o p (·) and O p (·) are defined, respectively, as follows, for random variables ξ and η: ξ = o p (η) means that, for all > 0, P(|ξ/η| > ) → 0 as n → ∞; and ξ = O p (η) means that, for all > 0, there exists c > 0 such that P(|ξ/η| > c) < as n is sufficiently large. The following traditional conditions (see [3]) are also needed. C1. For 1 ≤ i ≤ n, the covariate X i is independent of the error ε i . C2. The covariate W i has finite p-order moment, i.e., E W i p < ∞, where p ≥ 2. C3. The density function of U, f (U), is continuous and has second-order derivative. C4.
. β j (U) and φ jk (U) are all bounded and have second-order continuous derivatives for j ∈ M and (j, k) ∈ I.
then the following results hold: Theorem 1 shows that the KLasso estimator can select the true model consistently. Then, we discuss the oracle properties of the KLasso estimator in Theorem 2.
Note that the optimal convergence rate of oracle estimator is O p (n − 2 5 ). We also observe that the difference of convergence rate between the KLasso estimator and the oracle estimator can be ignored over the univariate indicator set. Thus, we can conclude that the KLasso estimator shares the same asymptotic properties with the oracle estimator. Proofs of these two theorems are given in the Appendix.

Simulation Study
In this section, three examples are applied to assess the proposed procedures in terms of varying coefficient functions estimations and variable selection. The data with sample size n = 100, 200, 500 are independently generated from the following models: where the functional coefficients mainly include trigonometric function with different periods, exponential function, and power function with different locations and power. The random vector (X i1 , · · · , X i10 ) τ follows the multivariate normal distribution with zero means and Cov(X ij , X ik ) = 0.5 |k−j| , for 1 ≤ j < k ≤ 10. The index variable U i ∼Uniform[0,1] and the random error ε i follow a standard normal distribution. In the estimation procedures, Gaussian kernel function K(u) = 1 √ 2π exp(− u 2 2 ) is employed. The initial estimated coeffi-cient matrix Λ is obtained by minimizing (4), and the optimal bandwidth is selected via CV criterion, the selected bandwidth is also used for KLasso estimate procedure.
To assess the performance of the KLasso estimator in estimation accuracy, the empirical integrated squared error defined as follows is computed, As a benchmark of comparison, the mean empirical integrated squared error (MISE) and its standard error (in parenthesis) of the oracle estimators and the proposed estimators for the three models, are respectively, reported in Tables 1-3. All the empirical results were computed by R software [33] based on 1000 replications. In Tables 1-3, we can see that: (1) the MISEs decrease quickly as the sample size increases for both oracle estimator and KLasso estimator; (2) the proposed estimator performs comparably well with the oracle estimator in terms of coefficients estimations for moderate sample sizes; and (3) the estimators for Model 1 have smaller MISEs compared to those of the other two models, so the number and complexity of the nonzero coefficient functions to be estimated may affect the performance of the proposed procedure in finite samples.   Let CM denote the frequency of the nonzero coefficients being correctly estimated as nonzero, CZ denote the frequency of the zero coefficients being correctly estimated as zero, and CS denote the frequency of the correctly selected model, which means that only nonzero coefficients are estimated as nonzero. CM, CZ, and CS are summarized in Table 4. For the three models, as the sample size increases, CM, CZ, and CS all can be as large as 100%, which implies that the proposed method can identify the model well. In addition, Model 1 has the largest possibility of being correctly selected among the three models.
Besides, we also depict the quantiles curves of the estimated coefficient functions at fixed series U 1 , U 2 , · · · , U 46 , where U i = 0.05 + (i − 1) × 0.02, i = 1, 2, · · · , 46. Figures 1-3 are quantile curves for Models 1-3 with sample size 200, respectively. In these figures, we can see that the main effects and interaction effects can be correctly selected and consistently estimated. Meanwhile, the estimated curves usually underestimate at the peaks while overestimate at the valley of the curves. In summary, the proposed method for VC models with hierarchical structure works well.

The Boston Housing Data Analysis
To further investigate the performance of our method, we apply the proposed method to the Boston housing data which concerns the median value of owner-occupied homes (MV) for 506 census tracts in 1970. The dataset "Boston" is available in the R package "MASS" ( [34]). Following the basic housing value equation of Harrison and Rubinfeld [35] and the study of Fan and Huang [3], Wang and Xia [7], we consider seven covariates here: CRIM (per capita crime rate by town), RM (average number of rooms per dwelling), NOX (nitric oxides concentration), PTRATIO (pupil-teacher ratio by town), TAX (full-value property-tax rate per $10,000), AGE (proportion of owner-occupied units built prior to 1940), and LSTAT (percentage of lower status of the population). The log transformation of MV and power transformation of RM, NOX, and LSTAT are employed to fit the Boston data well. For simplicity of representation, CRIM, RM 2 , TAX, NOX 2 , PTRATIO, and AGE are, respectively, denoted by X 1 , · · · , X 6 , and they are all scaled with mean zero and standard deviation 1. Meanwhile, we take X 0 ≡ 1 as the intercept term (INT), log(MV) as the response Y, and LSTAT 1/2 as the index variable U. Consequently, the varying coefficient model with hierarchical structure is fitted to the Boston housing data.
The data are divided into training data and testing data with sample size 405 and 101, respectively, by sampling. Estimates are based on the training data, and performance of the proposed procedures are evaluated on the testing data. The CV criterion suggests a bandwidth h = 0.31, and the optimal tuning parameter selected by BIC criterion isλ 0 = 1.8. During the implementation procedures, the variables are regarded insignificant for the mode of whose estimated functional coefficients are less than 0.1. It shows that INT, CRIM, RM 2 , TAX, PTRATIO, AGE, RM 2 ×TAX are significant but the others are not. The estimated coefficient function curves for these relevant variables are depicted in Figure 4. For the testing data, compared with the VC model without considering the interaction terms, the multiple R 2 for our proposed procedure is 0.8314, and it is 0.8021 without interaction terms. Thus, the proposed VC model with hierarchical structure can fit the testing data slightly better.

Conclusions and Future Works
In this paper, the VC models with hierarchical structure are investigated, and a unified variable selection procedure is proposed, which can simultaneously select the nonzero effects and estimate the unknown coefficient functions, while the selected model enforces the hierarchical structure. It is proved that the proposed penalty estimators have the oracle properties, that is the coefficients are estimated as well as if the true model were known in advance. Simulation studies and Boston housing data analysis are carried out to examine the performance of the proposed method in finite sample case.
However, we mainly focus on the fixed dimensionality of the predictive covariates in the paper. We will investigate the VC model with hierarchical structure in the case of diverging dimensionality of the predictors in the future. Estimation and variable selection for the generalized varying coefficient model with hierarchical structure, as well as estimation for the semiparametric varying coefficient partially linear model with hierarchical structure, are also interesting topics that deserve to be studied.
We first prove that P( M c = M c ) → 1 as n → ∞. That is for any j ∈ M c , P( β j (U t ) = 0) → 1 for 1 ≤ t ≤ n. If it is not true, β j (U t ) = 0 must be the solution of the following normal equation, by Conditions C1-C6 and Lemma 2, we know that ∑ n t=1 ∑ n i=1 X ij ε i K h (U t − U i ) = O p (n 2 ). Meanwhile, according to Lemma 1,α , and then we have that β j (U t ) β j 2 +∑ k:k =j φ jk 2 = O p (1). In addition, for b n = min{λ 1 j , λ 2 kl : j ∈ M c , (k, l) ∈ I c },weget ∂Q λ ( Λ) ∂β j (U t ) = O p (n 2 ) + O p (n 2 h 2 ) + (nh) It clearly shows that for j ∈ M c , there are no solutions for ∂Q λ ( Λ) ∂β j (U t ) = 0 withβ j (U t ) = 0, which is to say for any j ∈ M c , then j ∈ M c , so P( M c = M c ) → 1 is proved, natually we get that P( M = M) → 1.
Then, we pay attention to P( I c = I c ) → 1 as n → ∞, i.e., for any (j, k) ∈ I c , P( φ jk (U t ) = 0) → 1 for 1 ≤ t ≤ n. In addition, if the claim is not true,φ jk (U t ) = 0 must be the solution of the following normal equation, Referring to the definition, we consider the following three situations for (j, k) ∈ I c , (1) (j, k) ∈ I c , and j, k ∈ M; (2) (j, k) ∈ I c , and j ∈ M and k ∈ M c ; and (3) (j, k) ∈ I c , and j, k ∈ M c .