Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection

Motivated by mobile devices that record data at a high frequency, we propose a new methodological framework for analyzing a semi-parametric regression model that allow us to study a nonlinear relationship between a scalar response and multiple functional predictors in the presence of scalar covariates. Utilizing functional principal component analysis (FPCA) and the least-squares kernel machine method (LSKM), we are able to substantially extend the framework of semi-parametric regression models of scalar responses on scalar predictors by allowing multiple functional predictors to enter the nonlinear model. Regularization is established for feature selection in the setting of reproducing kernel Hilbert spaces. Our method performs simultaneously model fitting and variable selection on functional features. For the implementation, we propose an effective algorithm to solve related optimization problems in that iterations take place between both linear mixed-effects models and a variable selection method (e.g., sparse group lasso). We show algorithmic convergence results and theoretical guarantees for the proposed methodology. We illustrate its performance through simulation experiments and an analysis of accelerometer data.


Introduction
Data captured by mobile devices have lately received much attention in the data science community. Such data are typically recorded at a high frequency, giving rise to an ample volume of information at a very fine scale, and thus present many methodological challenges in statistical modeling and data analyses. In this paper, we plan to utilize the strength of the classical kernel machine method that enjoys fast computing speed via the linear mixed-effects model to deal with such high-frequency data using a functional data analysis approach. The motivation for our proposed framework come from data collected from a tri-axis accelerometer. Accelerometers, worn on the hip or wrist as a way of monitoring physical activity, are becoming more and more common [1][2][3][4]. There are several different accelerometers available such as ActiGraph GT3X+ (ActiGraph, Pensacola, FL, USA) and Actical (Phillips Respironics, Bend, OR). Raw accelerometer data are often collected in high-resolution signals with a sampling frequency ranging from 30-100 Hz. The commercial software on these devices provides activity counts (ACs) [2,4], which are calculated from the raw accelerometer data using proprietary algorithms. As an example from our motivating dataset, Figure 1 displays a three-dimensional time series of ACs per minute, each on one axis, from one subject wearing the GT3X+ over a period of 7 days (d).
Oftentimes, different types of summaries of the tri-axis ACs are suggested in the literature as opposed to the utility of all three raw functionals [5][6][7][8]. These summary-databased approaches may be regarded as a quick and dirty dimension reduction strategy that comes up with summarized data with computationally manageable volumes, which would be then analyzed by existing methods and software. One concern with the use of summarized data would be the loss of potential fine features that can only be captured in data of high resolution. Recently, some researchers have attempted to use the entire functional AC curve through functional data analysis techniques [6,9,10]. Further details on current methods being used to retrieve and interpret accelerometer data can be found in [11]. Our contribution in this paper pertains to a new framework in that tri-axis accelerometer data are used as three-dimensional correlated functional predictors in an association analysis with a potential health outcome such as the Body Mass Index (BMI). The relationship between physical activities and childhood obesity has long been a central interest of public health sciences, and our new scalar-on-functional regression model can provide some new insights into this important scientific problem. We begin with a brief review of existing functional data models, the least-squares kernel machine model, and different variable selection techniques, which prelude the framework for this paper.

Functional Regression
There has been much attention in recent years given to functional data analysis (FDA) where either covariates, or response, or both are functional as opposed to scalar in nature [12][13][14][15][16][17]. In this paper, we focused on the methodology that allows us to relate multiple functional covariates to a scalar outcome in a nonlinear way in the presence of other scalar covariates. To proceed, let us introduce some notation. Let L 2 (T ) be the class of square-integrable functions on a compact set T . This is a separable Hilbert space with inner product < f , g >:= T f g for f , g ∈ L 2 (T ). Consider a probability space (Ω, F , P), where Z denotes a functional random variable that maps into L 2 (T ), namely Z : Ω → L 2 (T ). Define L 2 (Ω) := {Z : ( Ω Z 2 dP) 1 2 < ∞}, where P is a certain probability measure, Z 2 = < Z , Z >, and assume Z ∈ L 2 (Ω) in the rest of this paper. For convenience, we also assume that Z is mean centered, namely E(Z) = 0.
The class of functional linear models (FLM) (e.g., [13][14][15]) is proposed to relate a functional covariate Z with a mean-centered scalar outcome y, which is also known as scalar-on-functional regression: y = < b, Z > + , where the error term is a mean zero random variable uncorrelated with Z. An optimal solution of the unknown functional parameter b ∈ L 2 (T ) is typically obtained by minimizing the mean-squared error: inf b∈L 2 (T ) E(y− < b, Z >) 2 . Moreover, the mean model for the mean-centered scalar y takes the form E(y|Z) = T Z(t)b(t)dt.
As suggested in the literature, we may obtain an optimal estimator of b by expanding functional predictor Z under certain basis functions. In this paper, we focus on the utility of functional principal component analysis (FPCA) to perform the decomposition of the functional Z. By the Karhunen-Loève expansion (e.g., [18][19][20]), we may write where ς k > 0 are the eigenvalues, and the loadings are given by ξ k := 1 √ ς k < Z, φ k >. These coefficients satisfy (i) mean zero, E(ξ k ) = 0; (ii) variance one, E(ξ 2 k ) = 1; (iii) uncorrelated, E(ξ k ξ j ) = 0 for k = j. Then, the mean model may be rewritten as follows, where coefficients β k =< b, √ ς k φ k >, k = 1, · · ·, which are unknown due to the unknown b. Equation (1) presents a linear projection of scalar outcome y on the space spanned by the standardized principal components (PCs) ξ k 's of functional predictor Z. On these lines of research, Müller and Yao (2008) proposed a class of functional additive models (FAMs) that extends Equation (1) by allowing a nonparametric form of the projection: where f k is a fully unspecified nonlinear smooth function to be estimated. It is obvious that Müller and Yao's extension given in (2) takes an additive model on individual coefficient (or feature) components ξ k 's. Regularization is often needed for both (1) and (2) in order to deal with these infinite-dimensional unknowns. One of the challenges concerning regularization for (2) lies in the technical treatment in the functional space. Müller and Yao (2008) [21] proposed truncation (or a hard threshold) of the eigenspace to retain only the leading components that explain the majority of the total variation in Z. Zhu, Yao, and Zhang (2014) [15] proposed another regularization for the functions f k using the powerful COSSO method [22]. One advantage for this kind of regularization method is that sums of higher-order functional principal components are allowed to be potentially included in the fit model, if they make stronger contributions to the functional relationship than the leading functional principal components. This regularization method [15] begins with an additive model E(y|Z) = ∑ s k=1 f k (ξ k ), where s represents some initial degrees of truncation to specify the total number of additive components to be considered. Then, COSSO helps simultaneously regularize and select important functional components among the s functions f k . Although the above discussion is based on a single functional predictor Z in mind, it is appealing to extend such a framework with multiple functional predictors for a broad range of problems.
When multiple functional predictors, say Z 1 , . . . , Z p , are considered, it is not clear if the above additive model specification remains suitable to handle the complexity, especially a non-additive relationship (e.g., interactions) may be of interest to understand the association between a scalar outcome and multiple functional predictors. In effect, from both the perspectives of theoretical advances and application needs, relaxing the additive relationship is an important task in functional data analysis. Alternatively, there are some methods (e.g., [16,17]) in the literature that do not use the strategy of decomposing Z into its functional components. In this paper, we adopt the framework of kernel machine regression models to extend the methodologies with non-additive relationships between multiple functional predictors and the scalar outcome.

Least-Squares Kernel Machine
Liu, Lin, and Ghosh (2007) [23] proposed a semi-parametric regression model y i = x i β + h(z i ) + i for subject i = 1, . . . , n, where they used the least-squares kernel machine (LSKM) to analyze multidimensional genetic pathways denoted by a vector z i .
The key feature of this model is the nonlinear relationship between the outcome y i and a vector of gene expressions z i , which is characterized by a nonparametric smooth function h. Under the theory of smoothing splines, function h is assumed to lie in a reproducing kernel Hilbert space (RKHS), H K , generated by a positive-definite kernel function K(·, ·). For the ease of exposition, we suppress the bandwidth for the kernel K in the following discussion.
Then, both parameter β and function h are estimated by maximizing the scaled penalized likelihood function: where λ 1 > 0 is the tuning parameter and · H K is the norm of the RKHS. For a function where K is an n × n matrix whose (i, j) entry is K(z i , z j ) and α = (α 1 , . . . , α n ) .
It is known in the literature (e.g., [23,24]) that maximizing J(h, β) in (3) turns out to be equivalent to solving the normal equations from the following linear mixed-effects model (LMM): Y = Xβ + h + , where h is an n × 1 vector of random effects with distribution N(0, τK) and an n-dimensional vector error term ∼ N(0, σ 2 I), with τ = λ −1 1 σ 2 > 0. One remarkable advantage of solving (3) through the existing numerical procedure of the LMM is most advocated in the literature [25], where we can determine the smoothing parameter λ 1 as part of the estimation of the variance components of the LMM. Therefore, instead of using cross-validation or other information-based tuning methods on λ 1 , we can solve simultaneously for all the model parameters in (3), as shown in [23]. Utilizing this numerical strength of the kernel machine regression model, we propose a semi-parametric regression model by incorporating functional principal components of functional predictors (i.e., the z i ) to evaluate a nonlinear relationship of a scalar outcome with multiple functional covariates in a non-additive way. Assuming that function h belongs to an RKHS, we can use existing software packages for solving LMMs to obtain estimates of all model parameters and the smoothing parameter.

Feature Selection
To deal with high-dimensional functional principal components from functional covariates, we invoked the sparse regularization approach in the kernel machine regression model. Note that for both mean models (1) and (2), one needs to truncate the series from the Karhunen-Loève expansion. Regularization helps reduce from an infinite number of terms to a sum of finite terms. To introduce some notations, here we present a brief review on the group lasso (GL) [26], sparse group lasso (SGL) [27], and non-negative garrote [28]. See also the series of work originated by COSSO [22]. Yuan and Lin (2007) [26] proposed the group lasso, which solves the convex optimization problem: where L is the total number of groups of covariates and X refers to a subset of covariates associated with group . Friedman, Hastie, and Tibshirani [27] extended the group lasso to allow within-group sparsity, namely SGL, given as min The additional 1 -norm penalty term on β encourages individual sparsity, while the first penalty targets sparsity at the group level. It is easy to see that group lasso is a special case of the SGL when δ = 0.
The non-negative garrote proposed by Breiman (1995) [28] is another useful means of variable selection. It invokes a scaled version of least-squares estimation given by: Here,X = (x 1 , . . . ,x p ) is an n × p matrix with columnsx j = x jβ OLS j , withβ OLS j being the least-squares estimates from arg min β 1 2 Y − Xβ 2 2 with no constraints. Obviously, estimated j = 0 implies that covariate x j would be excluded from the fit model. Breiman's formulation that turns a variable selection problem into a parameter estimation problem will be applied for the development of feature selection on functional principal components in this paper. This paper is organized as follows. Section 2 introduces our proposed high-dimensional kernel machine regression. Section 3 outlines a simple step-by-step algorithm that is used to implement the sparse estimation method. Section 4 concerns asymptotic properties for our proposed sparse kernel machine regression. Section 5 provides simulation results to examine the performance of our method, with comparisons with existing methods. Section 6 illustrates the proposed method by an association analysis of the relationship between the BMI and functional accelerometer data. Section 7 includes our conclusions. The Appendix A contains some key technical details, including the proofs of the theoretical results, while Appendix B presents a discussion on the model identifiability issue.

Model and Estimation
Consider a regression analysis of a scalar outcome y on p functional covariates, Z , = 1, . . . , p. Let z i = (ξ 1 , . . . , ξ s ) i be the s -element vector of functional principal component (FPC) features from the i th observation of the th functional covariate Z , and let z i = [(z 1 i ) , . . . , (z p i ) ] be the grand vector of all FPC features from all p functional covariates for subject i, i = 1, . . . , n. Clearly, the set of FPC features from each functional covariate forms a group, and in total, there are p groups with s = ∑ p =1 s many FPC features and z i ∈ R s . The high dimensionality of FPC features presents the key methodological challenge in the analysis. We consider the following functional kernel machine regression (FKMR) model: where β ∈ R q is a set of parameters for the effects of q scalar covariates x = (x 1 , . . . , x q ) , h ∈ H K is an s-variate smooth nonparametric function with H K being the functional space generated by a Mercer kernel K and error terms i iid ∼ N(0, σ 2 ). The FKMR model (4) allows for not only nonlinear, but also non-additive relationships with multiple functional covariates Z via their FPC features, = 1, . . . , p, and a scalar outcome, y. The statistical task is to estimate and select important functional covariates that are related to the outcome of interest through regularizing the FPC features within each functional covariate. To proceed, following Beiman's [28] non-negative garrote method, we here introduce a new s-dimensional scaling vector γ ∈ R s , γ = (γ 1 , . . . , γ s 1 , . . . , γ s ) , by which we can set γ • z i = (γ 1 ξ 1 1 , . . . , γ s 1 ξ 1 s 1 , . . . , γ s ξ p s p ) i a new vector of weighted FPC features by γ via the Hadamard product (i.e., elementwise product). Note that γ is grouped and denoted by γ = ((γ 1 ) , . . . , (γ p ) ) where γ is an s -element vector of FPC features z of the th functional covariate Z . When the element, say γ j , is equal to zero, the corresponding FPC feature ξ j will not be selected in the set of important FPCs, and moreover, functional covariate Z is excluded from the FKMR model when the entire vector (γ ) = 0.
We estimate the unknowns in the FKMR model (4), as well as the scaling parameters γ by minimizing the penalized objective function J 1 (h, β, γ), whose expression is given on the right-hand side of the following Equation (5): where λ 1 > 0 and λ 2 > 0 are two tuning parameters, and penalty ρ(γ; δ) may be specified according to a certain regularization method. For the case of sparse group lasso (SGL), Typically, δ is predetermined and set to 0.95 or 0.05 depending on the trade-off between group and within-group sparsity, while the factor (1 − δ) controls the relative group sparsity to individual sparsity of each functional predictor Z . Meanwhile, a large tuning parameter for λ 2 would remove a certain group of FPC features from the FKMR model when all elements in the vector γ are zero. Given h ∈ H K , an equivalent optimization to the above (5) can be formulated as follows: where K(γ; Z) is an n × n matrix whose Lemma 1 below establishes the equivalency of optimization solutions between (5) and (6), which is crucial in our estimation procedure.
The proof of Theorem 1 is given in Appendix A.3. Note that there may exist multiple optimal minimizers for (5); Theorem 1 ensures only the existence of optimal solutions, but provides no guarantees for uniqueness due to the fact that (5) or (6) is a nonlinear and non-convex optimization problem. It is worth noting that in both (5) and (6), we set the bandwidth for the kernel at a fixed value due to the identifiability issue with respect to the scaling parameters γ. Refer to Appendix B for more detailed discussions on the issue of parameter identifiability.

Implementation and Algorithm
We propose an iterative algorithm to implement our proposed estimation procedure in which we require the differentiability of the kernel with respect to the scaling factor γ and some additional assumptions presented below in order to ensure algorithmic convergence. One part of the algorithm solving (5) is carried out under fixed γ, where the resulting minimization problem reduces to the equivalent maximization problem in the least-squares kernel machine (3) with the FPC features, z i , being replaced by γ • z i . As pointed out in Section 1.2, the step of numerical calculation can be easily executed in the same fashion as the solution from the linear mixed model, including the REML estimation of the smoothing parameter λ 1 . The other part of the algorithm is performed under fixed α, β and λ 1 , where we solve the nonlinear and non-convex optimization problem to update estimates of γ. Lemma 2 below helps us solve for the scaling parameter γ.

Lemma 2.
For fixed (α, β, λ 1 ), minimizing (6) over γ is equivalent to minimizing over γ the following objective function: where The proof of Lemma 2 is given in Appendix A.2. Linearizing the function F(γ) in (7) leads to an equivalent form: being the gradient of the function F with respect to γ evaluated atγ for someγ, and ∇ γ F ( ) (γ) being the columns of ∇ γ F(γ) associated with the th group of γ . This is precisely the form of the standard sparse group regularization problem: . This implies that (8) presents a standard sparse group regularization problem with a specific choice of penalty function ρ(γ; δ).
The convergence of the above iterative search algorithm for updatingγ for fixed (α, β, λ 1 ) can be justified by the proximal Gauss-Newton method [29]. Readers are referred to [30] for details on the proximal Gauss-Newton method. One of the key assumptions of the proximal Gauss-Newton method is the existence of a local minimizer. This condition is satisfied in the above (8). This is because according to Theorem 1, there exists a global minimizer.
Algorithm 1 summarizes these iterative steps, which is showed to satisfy a descent property: (r) ) under the convergence of the proximal Gauss-Newton algorithm for Step 2.2.
Algorithm 1 An iterative algorithm for optimization in FKMR.
To speed up Algorithm 1, we propose the following operational schemes that avoid setting up the pairs of (λ 1 ,λ 2 ) and performing Step 3.1. Here are a few remarks on the two algorithms. (i) Algorithm 2 depends on good starting values in order to enjoy a fast search. (ii) The main difference between Algorithms 1 and 2 is that λ 2 is fixed in Algorithm 1, while it is changing in Algorithm 2. Some similar algorithms with changing tuning parameters have been proposed in the literature, such as the single index model [31]. (iii) There is no guarantee that both algorithms converge to a global minimizer, and the proximal Gauss-Newton method used in the implementation can only find stationary points. Numerical solvers for the optimization problem in (5) or in (6) indeed remain an open problem in the field of nonlinear and nonconvex optimization.
Algorithm 2 A fast operational scheme of Algorithm 1.

1.
Step 2.1 of Algorithm 1 is performed by running the linear mixed model with our initial fixed γ from Step 1.2 of Algorithm 1 to obtained updated values of λ 1 , β, and α. 2.
Step 2.2 is performed with solving the group regularity problem (8) through the Gauss-Newton algorithm using cross-validation-based tuning (e.g., R package oem).

3.
Rerun Step 2.1 using the updated γ from Step 2.2 to obtain the estimates for β and α.

Theoretical Guarantees
Our theoretical analysis focuses on the finite-sample L 2 error bounds for the estimators (ĥ,γ) obtained by (5) or (6). Consequently, we are able to establish the estimation consistency. For simplicity, we set β = 0 and consider a general setting of random vectors z 1 , . . . , z n so that the FPC features z 1 , . . . , z n correspond to a special case. Along similar lines as those of [15,32], the estimation consistency is proven in the case of the SGL penalty function. We define a map Γ with an s-element vector γ ∈ R s , which gives rise to a collection of all scaling map functions: with any c 1 , c 2 ∈ R and Γ 1 , Γ 2 ∈ A. To perform a group regularization estimation, we define an SGL penalty by a norm on A for a fixed δ ∈ [0, 1] as follows: Consequently, the SGL regularization estimation requires the following constrained optimization: where Lemma 3 below provides the essential finite-sample inequalities that lead to the estimation consistency.

Lemma 3 (Basic inequality)
. Letĥ •Γ be the minimizer of (10). Let h 0 • Γ 0 be the true function. Then, we have: We need the following notation before presenting our theoretical guarantees. Let N (δ, M, P n ) denote the minimal δ covering number of the function set M under the empirical metric P n based on the random vectors z 1 , · · · , z n . Let N = N (δ, M, P n ) be a shorthand notation. This means that there exist functions m 1 , · · · , m N (not necessarily in the set M) such that for every function m ∈ M, there exists a j ∈ {1, · · · , N} such that Define the δ-entropy of M for the empirical metric, P n , as H(δ, M, P n ) := log(N (δ, M, P n )). Consider a functional space of the form: We postulate the following assumptions.
Assumption 1. The error term = ( 1 , . . . , n ) is uniformly sub-Gaussian; that is, for constants C 1 and C 2 , Clearly, the moment condition is bounded below from zero.
H K > 0, and the entropy of space B with respect to the empirical metric P n is bounded as follows: where C 3 is some constant and ψ ∈ (0, 1).
Theorem 2. (Consistency) Under Assumptions 1-3 above, if tuning parameters λ 1 and λ 2 satisfy Theorem 2 implies estimation consistency under the right rates for the two tuning parameters λ 1 and λ 2 . Due to the potential identifiability issues explained in detail in Appendix B, although the estimator (ĥ,Γ) may not be unique, the sum ofĥ andΓ is not too far away from the sum of the true h 0 and Γ 0 . Corollary 1. If the RKHS, H K , contains differentiable functions ∇h(z) whose norm ∇h(z) H K is uniformly bounded for all functions h ∈ H K and z ∈ R s , then Assumption 2 holds when Theorem 2 is replaced by H(δ, The proofs of Theorem 2 and Corollary 1 are given in Appendices A.4 and A.5, respectively. Often, when we are only interested in a subset of functions in the RKHS (e.g., functions with norm less than one), we can substitute the full space H K in Corollary 1 with the subspace of interest. Refer to [15] or [32], where both considered an RKHS (i.e., Sobolev space) with functions of norm less than or equal to one.

Simulation Experiments
We performed extensive simulation to investigate the performance of our proposed procedure, including the performance of SGL variable selection and its overall accuracy. Due to the limitations of space, we include results from two simulation experiments in this section, and more results may be found in the first author's Ph.D. dissertation [30].

Setup
In the evaluation of the performance accuracy, following [15], we used both quasi-R 2 and adjusted quasi-R 2 defined as follows: The latter is known to be appealing for the comparison of the estimation sparsity. There is another performance metric of interest in addition to model accuracy. Performance in variable selection is summarized in terms of the stability measured by sensitivity and specificity for both functional and variable selections under these simulation experiments. Our algorithm uses existing R packages, including emmreml, kspm, and oem. Specifically, we designed the following two simulation settings. Each of these two scenarios would be handled using certain suitable penalty functions to address the designed sparsity; for example, in Scenario 2 we used a two-level variable selection penalty (e.g., SGL) to deal with two types of sparsity in the true model. In all analyses, we used the Gaussian kernel K(u, v) = exp(− 1 p u − v 2 ) in our estimation, where p was set as the number of features, which is equivalent to dividing the γ vector by √ p. This scaling parameter may be either estimated or set to the number of features to overcome the identifiability issue according to [33], where theoretical justification was given for the use of the number of features for the bandwidth parameter in the case of the Gaussian kernel.
According to [23], due to the difficulty of the graphical display for the estimated s-dimensional function h(·) of z, we summarized the goodness-of-fit by regressing the true h on the estimatedĥ, with both being evaluated at the design points. From this concordance regression analysis, we may measure the goodness-of-fit onĥ through the average intercepts, slopes, and R-squared (also known as the coefficient of determination) obtained over the number of replications. Clearly, a high-quality fit is reflected by (i) the intercept being close to zero, (ii) the slope being close to one, and (iii) the R-squared being close to one. Moreover, we graphically display the estimated functionĥ by setting all variables equal to 0.5 except the one of interest over a grid of 100 equally spaced points on the interval [0, 1]. Such visualization of the functional estimation at each margin further facilitates the evaluation of the proposed algorithm in addition to the results obtained from the concordance regression analyses.
In all scenarios, we generated 1000 IID functional paths, of which 750 paths were assigned to the training set and 250 paths were assigned to the test set for an external performance evaluation. It is the test set that we used to display the performance accuracy. We used a one-dimensional covariate x i to show the flexibility of our model in a semiparametric setting, with independent copies of x i ∼ N(0, 1). We chose the true coefficients in the kernel machine model similar to those given in [23].

Simulation in Scenario 1
In this simple scenario with a single functional predictor, we simulated data from a model with sparsity in its FPC features. To do so, we generated a single functional predictor based on the first 15 eigenbasis of the Fourier basis functions over the interval [0, 1]: That is, a functional predictor was created as a linear combination of the 15 basis functions, where φ j (·) is the j th Fourier basis function, ς j is the jth eigenvalue of Z, and ξ j is the jth FPC feature that is simulated from a normal distribution detailed as follows.
We applied both LASSO and MCP penalty functions in our implementation, termed as FKMR Lasso and FKMR MCP , respectively. We compared the results of our method with the standard linear approach with both LASSO and MCP under the assumption of linear functional relationships, as well as the COSSO method for functional additive regression [15] using the R package COSSO [15,34]. Since the COSSO package is built for nonparametric regression (and not partial linear models), we adopted the backfitting strategy and regressed the residuals with our estimated effect of x i removed.
In addition, we compared our method with an oracle FKMR estimator, called FKMR oracle , that assumed the full knowledge of the true ζ j containing two true nonzero signals, ζ 2 and ζ 9 . We also considered two oracle versions of our proposed algorithm, FKMR oracle Lasso and FKMR oracle MCP , both of which used the knowledge of true ζ j in order to evaluate the performance of the FPCA procedure. This evaluation is important as our proposed procedure can be in principle used in simpler cases that do not involve functional covariates. Note that once we used FPCA to obtainζ j features, our algorithm essentially works in a standard regression setting with the sparsity of covariates. Thus, our proposed procedure can be in principle used in simpler cases with scalar covariates. In Scenario 1, due to the highly nonlinear relationships between the FPC features and the outcome, as expected, the naive linear model performed poorly in terms of both model selection and model consistency. The detailed simulation results for Scenario 1 can be found in the first author's Ph.D. dissertation [30]. In brief, our proposed method worked well in all aspects. In this setting, COSSO also worked well in terms of model fit, but it tended to select noisy features more frequently than our proposed method, leading to more false positives.

Simulation in Scenario 2
Now, we generated four functional predictors of the form: Z (t) = ∑ 9 j=1 ς j ξ j φ j (t), = 1, . . . , 4, where φ j , ς j , and ξ j were set in the same way as those given in Scenario 1. It follows that z = (ζ 1 1 , . . . , ζ 1 9 , . . . , ζ 4 1 , . . . , ζ 4 9 ) , where ζ j is the jth Φ-transformed feature for the th functional covariate. Sparsity was specified as follows: the first and second functional covariates, Z 1 and Z 2 , were chosen as important signals in which these transformed FPC features, {ζ 1 1 , ζ 1 3 , ζ 1 4 , ζ 2 2 , ζ 2 7 }, are five important features (three features from the Z 1 and two features from Z 2 ) that are related to the outcome: where i iid ∼ N(0, 1). This model specifies both group sparsity (two of the four functional predictors) and within-group sparsity (three of the nine FPC features in Z 1 and two of the nine FPC features in Z 2 ). In addition, we specified non-additive relationships in the true model across multiple functional covariates.
We fit the data using the proposed methods, including FKMR oracle GMCP , FKMR Lasso , FKMR GLasso , FKMR SGL , FKMR MCP , and FKMR GMCP , and the results based on 100 replicates are summarized in Table 1. For comparison, we also fit the simulated data by existing methods, including the linear model (denoted by LM + penalty), COSSO functional additive regression, and the oracle method using the knowledge of true important features in the analysis, as done in the above simulation of Scenario 1. From Table 1 regarding the goodness-of-fit, we see that all of our FKMR estimators outperformed the standard linear estimators in terms of R 2 AQ among all of our penalty functions, and they outperformed COSSO for penalties that accounted for group sparsity. In the concordance regression analysis, we see that all intercepts were close to zero, all slopes close to one, and all R 2 close to one, indicating a high goodness-of-fit for functional estimation. COSSO tended to perform on par for penalties that did not account for group sparsity (LASSO and MCP). It is evident that using a group sparsity penalty function (SGL, GLasso, and GMCP) clearly outperformed the methods that did not regularize the grouping of covariates (Lasso and MCP). In addition, our FKMR estimators (except FKMR Lasso ) performed as well as the oracle estimator FKMR oracle GMCP both in terms of R 2 AQ and in terms of our estimate of functional h. The results also indicated that there were little differences between using a concave (MCP or GMCP) penalty function or using a convex (GLasso or SGL) penalty function.
As regards the group sparsity, Table 2 indicates that the all methods had a high sensitivity of detecting functional signals, while the proposed FKMR methods had better specificity than both sparse linear models and COSSO. Concerning the within-group sparsity, it is interesting to note that a bigger difference was seen in terms of what type of penalty function was being used in feature selection. As shown in Tables 3 and 4, using a general penalty (e.g., Lasso and MCP) that does not take the grouping structure into account tended to under-select important features within a group. COSSO tended to perform well within group sparsity. Moreover, Figure 2 shows that the FKMR method estimated the five signal functions (Z 1 and Z 2 ) well.

Data Example
To show the usefulness of our proposed methodology, we analyzed data of 550 children recruited by the ELEMENTS study [35], who had consent to wear an actigraph (ActiGraph GT3X+; ActiGraph LLC. Pensacola, FL, USA). This wearable was to be placed on their non-dominant wrist for five to seven days with no interruption. The actigraph measured tri-axis accelerometer data sampled at 30 Hz, which captured three different directions of a person's movement. The BMI was the outcome of interest as it is biomarker of obesity. Sex and age were confounding factors used in the analysis. Due to some missing data, our analysis only included children who wore the device properly for 85% or more over the study period, which resulted in 395 participants, consisting of 189 males and 206 females. Other studies such as [36] have excluded days of accelerometer data with more than five percent missing. The mean ± SD BMI of the study cohort was 21.5 ± 4.1. The mean age of the study participants was 14.3 ± 2.1 y. A more detailed description of the dataset used for this paper can be found in [37]. Our primary interest was to see if the BMI is associated with physical activity in the presence of other covariates, specifically sex and age. We preprocessed the activity counts over the 7 d of wear by taking the median in the 1 min epoch over the entire 7 d of wear. For example, since all the participants started wearing the device at 3 p.m., the first data point for each individual was a median of 7 ACs (each for one day) for the 1 min epoch of 3:00-3:01 p.m. This procedure that takes the medians across the minutes from different days has been considered in other applications such as [36]. See Figure 3 as an example of the resulting time series of medians derived from the AC data displayed in Figure 1. We applied the following five models, labeled as M0-M4 for convenience, to analyze the data with the 24 h median ACs as functional predictors. Let ξ k ij be the ith person's kth FPC score for functional predictor j. M0: Linear model (LM) with only the fixed features: BMI i ∼ β 0 + β 1 Age i + β 2 Sex i ; M1: Linear model with SGL penalty (LM+SGL) using the FPCA features: In order for a direct application of the COSSO R package, we used residuals res(BMI i ) = BMI i − β 0 +β 1 Age i +β 2 Sex i in the COSSO model fit, withβ 0 ,β 1 andβ 2 being the estimates of the coefficients from Model M0.
The BMI and age were mean centered and scaled to be a standard deviation of one, so β 0 was absent in the models. Here are some key findings from the data analyses. First, in terms of the goodness-of-fit, Table 5 suggests that M3, i.e., our proposed model FKMR with the SGL penalty, gave the best performance, where the adjusted R 2 of M3 was nearly twice as big as all the other four models. Second, it is interesting to note that both the COSSO and the FKMR SGL did not select the FPC scores associated with the Z-axis. Third, as shown in Table 6, all of the FPC components chosen by COSSO were also chosen by the FKMR SGL . It is worth noting that the linear model together with the SGL penalty selected the highest number of FPC components, yet performed the worst in terms of the model fit. Table 5. Goodness-of-fit for the five models used in the data analysis.

Conclusions
In this paper, we proposed a method to model the nonlinear relationship between multiple functional predictors and a scalar outcome in the presence of other scalar confounders. We used the FPCA to decompose the functional predictors for feature extraction and used the LSKM framework to model the functional relationship between the outcome and principal components. We developed a simultaneous procedure to select important functional predictors and important features within selected functionals. We proposed a computationally efficient algorithm to implement our regularization method, which was easily programmed in R with the utility of multiple existing R packages. It should be noted that although we focused on functional regression in this paper, the method proposed can be applied to non-functional predictors. In effect, by using functional principal components, we essentially bypassed the infinite-dimensional problem and worked effectively in a non-functional framework with the FPC features. Through simulation and using data from the ELEMENT dataset, we demonstrated how the FKMR estimator outperformed existing methods in terms of both variable selection and model fit. It should be noted that the existing COSSO method did perform well in terms of variable selection, as shown in Section 5.
A technical issue pertains to identifiability limitations with regard to the bandwidth parameter and to the RKHS estimator. To overcome this, we suggested fixing the bandwidth parameter; see the detailed discussion in Section 3. We established key theoretical guarantees for our proposed estimator. In the case where there are multiple proposed estimators (and thus the identifiability issues arise), the established theoretical properties in Section 4 apply to any of those estimators.
Variable section on functional predictors presents many technical challenges, and there are many methodological problems that remain unsolved. This paper demonstrated a possible framework to regularize estimation with a bi-level sparsity of functional group sparsity and within-group sparsity. In the LSKM paper [23], it was briefly mentioned that if the relationship between the scalar outcome and p genetic pathways is additive, we can tweak the model as y i = x i β + h 1 (z 1 i ) + · · · + h p (z p i ) + i where each h j belongs to its own RKHS. It is easy to extend our method and algorithms to handle this case. For future research, an extension on longitudinal outcomes may be considered via a mixed-effects model y ij = x i β + h(z ij ) + u ij v i + ij where u ij v i are the random effects. Other useful extensions to the proposed paradigm would be on the lines of generalized linear models and Cox regression models. Data Availability Statement: The used data of physical activity counts, BMI and demographic variables (sex and age) are available upon request through a formal data request procedure outlined by the ELEMENT Cohort Study. Contact the corresponding author of this paper for the detail.

Conflicts of Interest:
The authors declare no conflict of interest.
The equivalence of forms become clear once we rewrite (6) in the matrix notation. Equation (6) can be written as follows: For fixed α , β and λ 1 , minimizing the function in (A1) with respect to γ is equivalent to With loss of the generality we use the penalty function for sparse group lasso but this proof can easily be modified for other penalty functions. Also, we fix λ 1 = λ 2 = δ = 1, and consider β ∈ R as well as set the design matrix X (or vector in this case) scaled to have norm 1. The case of β ∈ R q will follow along similar lines of arguments. Let γ ∈ D 3 with D 3 = {γ : γ 1 ≤ 1 2n Y 2 2 }. Define f (γ) = K(γ; Z) = η max (K(γ; Z)) ≥ 0, where η max (K(γ; Z)) denotes the largest eigenvalue of K(γ; Z) with the operator norm (the norm of K(γ; Z)) defined in its usual way K(γ; Z) = sup{ K(γ; Z)x 2 2 : x 2 2 = 1}. Since D 3 is compact and K(γ; Z) is continuous with respect to γ it achieves its maximum over D 3 . Thus, we define η = sup γ∈D 3 We claim that (α , β , γ ) is a global minimizer, which is proved below by contradiction.
Suppose that there exists (α,β,γ) Let q 1 , · · ·, q n be the orthonormal vectors of K(γ; Z) with its associated eigenvalues η 1 ≥ · · · ≥ η n ≥ 0. We can write outα, X, Y in terms of these basis functions wherẽ We can minimize the above objective function with respect to C˜α i andβ. First, note that for any η i = 0 we can let C˜α i = 0 as it will not affect the expression above. It is sufficient to consider η i > 0. Taking the first derivative and setting it equal to zero, we obtain the score equations the minimizer must satisfy, for our minimumβ and C˜α i In the above derivation we used the fact that 1 = X 2 2 = ∑ n i=1 (C X i ) 2 . Plugging (A4) into (A3), we obtain It follows that Thus, the β that minimizes J 2 for a given γ ∈ D 3 is in D 2 . Also, (A4) implies that | C˜α i |≤ ( Y 2 + X 2 β 2 ); consequently, the optimal α for the givenγ ∈ D 3 and β ∈ D 2 that minimizes J 2 satisfies α 2 ≤ √ n( Y 2 + b ). As a result, α ∈ D 2 . This suggests that for any (α,β,γ) / ∈ D 1 × D 2 × D 3 we can find an (α, β, γ) ∈ D 1 × D 2 × D 3 such that J 2 (α,β,γ) ≥ J 2 (α, β, γ).
Appendix A.4. Proof of Theorem 2 By Lemma 8.4 on page 129 in [32], Assumptions 1, 2, and 3 imply: where the constant c is dependent on C 1 , C 2 , C 3 , C 4 , and ψ. It follows that Therefore, for any h ∈ H K and a scaling map function Γ ∈ A, we obtain For our estimators,ĥ andΓ, it is easy to see that From (A9), we obtain the following inequality: We require λ 1 = O p (1)λ 2 , namely λ 2 and λ 1 go to zero at the same rate. We will show at the end of the proof what happens if they are not of the same order. Therefore, without loss of generality, we set λ 1 = λ 2 , denoted by λ. In what follows, we divide (A10) into two cases. Case 1: Suppose that In this case, we have Above (A11) is further discussed separately in two sub-cases. Therefore, It follows that Therefore, Consequently, we obtain Both terms in (A15) are the same rates as those in (A14). Case 2: Suppose that Then, we have This implies that In order to make (A14) and (A16) have the same rates we first equate the two term , and then solve for a common λ. The solution is given as follows: Under this λ value we obtain that (A14)-(A16) as of the form: This completes the proof of Theorem 2.
Now we discuss the situation where the tuning parameters λ 1 and λ 2 are not of the same order. As seen blow, the selection consistency may not be guaranteed. Take Case 2 as an example. Suppose that Let us consider two cases. Case 2a: SGL , following the same arguments above, we have SGL , then following the same logic as before: Both terms involve O p ( λ 1 λ 2 ) and O p ( λ 2 λ 1 ), indicating that these two tuning parameters λ 1 and λ 2 should go to zero at the same rates. Moreover, we can think of our estimator h •Γ as one operational object. See Appendix B for more details on this, which can further explain the need of one rate for the two penalties. We have shown in the proof of Theorem 1 that the optimal γ vector is restricted to be within a ball of a radius that depends on the norm of Y. For the sake of simplicity let us confine our γ to be within a norm ball of radius 1, γ ∈ G = {γ : γ 2 2 ≤ 1}. We then confine our set which we called A to be restricted to those γ, that is A = {Γ : Γ(z) = γ • z, γ ∈ G}. Since our γ ∈ R s , we can use above Lemma A1 and cover our set number of functions in the following sense. The ball of radius 1 in R s can be covered (using the Euclidean metric) by {γ 1 , · · · γ N 1 }. Since there is a one to one relationship between the functions Γ and γ, take the set {Γ 1 , . . . , Γ N 1 } and define the metric between some Γ j and Γ k in the set A as d(Γ j , Γ k ) = γ j − γ k 2 . Then, the set of functions {Γ 1 , . . . , Γ N 1 } is a δ-covering for A under this metric with entropy s log( 4+δ δ ). For each Γ j we have an induced RKHS, H K•Γ j = {h • Γ j : h ∈ H K } with entropy no larger than that of H K , which according to the assumption, has entropy ≤ Aδ −2ψ for some ψ ∈ (0, 1) and A ∈ R. Therefore, the covering number N 2 = N(δ, H K•Γ j , P n ) ≤ exp{Aδ −2ψ }. This implies that for every Γ j there exists a set {h j 1 • Γ j , · · · , h j N 2 • Γ j } such that for every h • Γ j ∈ H K•Γ j there exists an integer i ∈ {1, . . . , Set B is essentially the union of the different Hilbert spaces where vector z ∈ R s lies in the segment from γ j • z andγ • z, and C(·) is an unknown function that maps from R s into R s that allows for the formula to hold. Continuing our chain of inequalities, we obtain: Therefore, to provide a δ cover we need N 1 × N 2 + N 2 number of functions or: Taking the log we see the entropy is ≤Ãδ −2ψ + log ( C+δ δ ) s + 1 which is of the same order as ≤Ãδ −2ψ (the log term is dominated by the first term). Therefore a sufficient (but not necessary) condition for our set B to have the same entropy as that of the original RKHS H K is for the sup h < ∇h(z), ∇h(z) > to be bounded. Having bounded derivatives is reasonable for any RKHS since every RKHS satisfies the Lipschitz condition of the form: where the distance metric in R s is defined as d(X, Y) 2 = K(X, X) − 2K(X, Y) + K(Y, Y). If we restrict our functions in the RKHS of norm ≤ C for some constant C then we have a universal Lipschitz constant C to ensure bounded derivatives.

Appendix B. Discussion about the FKMR Estimator
We introduce γ as a way of performing variable selection on our vector of FPC features. We want to illustrate this technical trick with some concrete examples and discuss identifiability issues with the resulting estimator. There are two ways of looking at the estimation of the unknown functions h 0 and Γ 0 . The first way is to view our feature vector, z, as being related to the dependent variable y through the composite function h • Γ, as explained in Section 4. The second and equivalent way is to view our features as unknown. The true features take the form of γ • z, where in this case the • denotes the Hadamard product. We are given z and need to estimate the "true" features γ • z. In addition, we need to estimate the relationship between γ • z and y, which is done through the function h ∈ H K . The first way is to estimate the function h 0 • Γ 0 . The function belongs to the RKHS H K•Γ . We essentially consider many different function spaces to construct our estimator. The intersection between the function spaces is not necessarily empty, implying that our estimator may not be unique. We proceed this discussion more formally. Let K : R s × R s → R be a positive definite function. Let Γ : R s → R s . We define K • Γ : R s × R s → R as the function given by K • Γ(s, t) = K(Γ(s), Γ(t)). This new function, K • Γ is positive definite. There is a relationship between the original RKHS, H K and the new RKHS, H K•Γ . This results in H K•Γ = {h • Γ : h ∈ H K }. For any vector u ∈ H K•Γ , we have that u H K•Γ = in f { h H K : u = h • Γ}. In general, H K•Γ ⊂ H K . In (5), we take the norm with respect to the original space H K . Our iterative procedure essentially presents the second way in which the true features are unknown, whereas our theoretical arguments are justified through the first way. Given the knowledge of the features (which translates to fixing a γ), we are confined to just one RKHS, H K . Take the linear kernel, K(x 1 , x 2 ) = x 1 x 2 as an example. Suppose the truth is that y is related to a one-dimensional feature z 0 through the following formulation: y = h 0 (z 0 ) + ε where h 0 ∈ H K 1 , where K 1 is the kernel that maps from R × R → R. Therefore, if we knew the feature z 1 , we would proceed to optimize (6) using the standard LSKM. However, when each y is associated with a twodimensional vector z = (z 1 , z 2 ), where z 2 is a "noisy" feature and unrelated to y. Suppose that a priori we do not know this information. Typically we use a model y = h(z 1 , z 2 ) + ε where h ∈ H K , where K is the kernel that maps from R 2 × R 2 → R. In this case, we introduce our γ vector (γ 1 , γ 2 ) and formulate y = h(γ 1 z 1 , γ 2 z 2 ) + . All functions, h in the space H K , are of the form h(z) = x z for some two-dimensional vector x = (x 1 , x 2 ). There is a one-to-one relationship between h and x. The true function, h 0 , has an associated real number c where h 1 (z 1 ) = cz 1 . We can recover h 1 ∈ H K 1 from our estimation of h and γ if we set γ = (1, 0) and x = (c, ) , where " " is any real number. Equivalently, we can recover h 1 under γ = (1, 1) where x = (c, 0). There are many functions that may recover the original function in the RKHS corresponding to the linear space kernel. Formulating our problem in the first way, through function composition, we can estimate Γ 0 with the γ being (1, 0) or (1, 1).
We can now see that in the intersection between H K•Γ 1 and H K•Γ 2 , where Γ 1 has associated γ 1 = (1, 0) and Γ 2 has associated γ 2 = (1, 1), lies our estimate of h 1 . In truth, for the linear space RKHS, there is no need to apply our method since h 0 ∈ H K 1 can be estimated directly from the larger space H K where we set h(z) = x z where x = (c, 0). We can never hope to have variable selection consistency nor can we hope to have identifiability of our estimator for these types of spaces. However, from a goodness-of-fit standpoint, we are able to do just as good a job with many types of function compositions. Our hope is that we can glean some variable selection by penalizing the γ vector with the ρ(γ; δ) term which, going back to the above scenario, should give preference to γ = (1, 0) over γ = (1, 1). For the RKHS associated with the Gaussian Kernel, the "larger dimensional space", a Gaussian Kernel mapping from higher dimensions, does not necessarily contain the functions from a "lower dimensional space", a Gaussian Kernel mapping from lower dimensions. However through the introduction of the γ transformation of the features, we can recover the equivalent functions of the "lower dimensional space".