Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection

Naiman, Joseph; Song, Peter Xuekun

doi:10.3390/e24020203

Open AccessArticle

Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection

by

Joseph Naiman

and

Peter Xuekun Song

^*

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(2), 203; https://doi.org/10.3390/e24020203

Submission received: 4 January 2022 / Revised: 25 January 2022 / Accepted: 26 January 2022 / Published: 28 January 2022

(This article belongs to the Special Issue Big Data Analytics and Information Science for Business and Biomedical Applications II)

Download

Browse Figures

Versions Notes

Abstract

:

Motivated by mobile devices that record data at a high frequency, we propose a new methodological framework for analyzing a semi-parametric regression model that allow us to study a nonlinear relationship between a scalar response and multiple functional predictors in the presence of scalar covariates. Utilizing functional principal component analysis (FPCA) and the least-squares kernel machine method (LSKM), we are able to substantially extend the framework of semi-parametric regression models of scalar responses on scalar predictors by allowing multiple functional predictors to enter the nonlinear model. Regularization is established for feature selection in the setting of reproducing kernel Hilbert spaces. Our method performs simultaneously model fitting and variable selection on functional features. For the implementation, we propose an effective algorithm to solve related optimization problems in that iterations take place between both linear mixed-effects models and a variable selection method (e.g., sparse group lasso). We show algorithmic convergence results and theoretical guarantees for the proposed methodology. We illustrate its performance through simulation experiments and an analysis of accelerometer data.

Keywords:

functional principal component analysis; functional predictor; linear mixed-effects model; mobile device; sparse group regularization; wearable device data

1. Introduction

Data captured by mobile devices have lately received much attention in the data science community. Such data are typically recorded at a high frequency, giving rise to an ample volume of information at a very fine scale, and thus present many methodological challenges in statistical modeling and data analyses. In this paper, we plan to utilize the strength of the classical kernel machine method that enjoys fast computing speed via the linear mixed-effects model to deal with such high-frequency data using a functional data analysis approach. The motivation for our proposed framework come from data collected from a tri-axis accelerometer. Accelerometers, worn on the hip or wrist as a way of monitoring physical activity, are becoming more and more common [1,2,3,4]. There are several different accelerometers available such as ActiGraph GT3X+ (ActiGraph, Pensacola, FL, USA) and Actical (Phillips Respironics, Bend, OR). Raw accelerometer data are often collected in high-resolution signals with a sampling frequency ranging from 30–100 Hz. The commercial software on these devices provides activity counts (ACs) [2,4], which are calculated from the raw accelerometer data using proprietary algorithms. As an example from our motivating dataset, Figure 1 displays a three-dimensional time series of ACs per minute, each on one axis, from one subject wearing the GT3X+ over a period of 7 days (d).

Oftentimes, different types of summaries of the tri-axis ACs are suggested in the literature as opposed to the utility of all three raw functionals [5,6,7,8]. These summary-data-based approaches may be regarded as a quick and dirty dimension reduction strategy that comes up with summarized data with computationally manageable volumes, which would be then analyzed by existing methods and software. One concern with the use of summarized data would be the loss of potential fine features that can only be captured in data of high resolution. Recently, some researchers have attempted to use the entire functional AC curve through functional data analysis techniques [6,9,10]. Further details on current methods being used to retrieve and interpret accelerometer data can be found in [11]. Our contribution in this paper pertains to a new framework in that tri-axis accelerometer data are used as three-dimensional correlated functional predictors in an association analysis with a potential health outcome such as the Body Mass Index (BMI). The relationship between physical activities and childhood obesity has long been a central interest of public health sciences, and our new scalar-on-functional regression model can provide some new insights into this important scientific problem.

We begin with a brief review of existing functional data models, the least-squares kernel machine model, and different variable selection techniques, which prelude the framework for this paper.

1.1. Functional Regression

There has been much attention in recent years given to functional data analysis (FDA) where either covariates, or response, or both are functional as opposed to scalar in nature [12,13,14,15,16,17]. In this paper, we focused on the methodology that allows us to relate multiple functional covariates to a scalar outcome in a nonlinear way in the presence of other scalar covariates. To proceed, let us introduce some notation. Let

L^{2} (T)

be the class of square-integrable functions on a compact set

T

. This is a separable Hilbert space with inner product

< f

,

g > : = \int_{T} f g

for

f,

g \in L^{2} (T)

. Consider a probability space

(Ω, F, P)

, where Z denotes a functional random variable that maps into

L^{2} (T)

, namely

Z : Ω \mapsto L^{2} (T)

. Define

L^{2} (Ω) : = {Z : {(\int_{Ω} {∥Z∥}^{2} d P)}^{\frac{1}{2}} < \infty}

, where P is a certain probability measure,

{∥Z∥}^{2} =

< Z

,

Z >

, and assume

Z \in L^{2} (Ω)

in the rest of this paper. For convenience, we also assume that Z is mean centered, namely

E (Z) = 0

.

The class of functional linear models (FLM) (e.g., [13,14,15]) is proposed to relate a functional covariate Z with a mean-centered scalar outcome y, which is also known as scalar-on-functional regression:

y = < b, Z > + ϵ

, where the error term

ϵ

is a mean zero random variable uncorrelated with Z. An optimal solution of the unknown functional parameter

b \in L^{2} (T)

is typically obtained by minimizing the mean-squared error:

{inf}_{b \in L^{2} (T)} E {(y - < b, Z >)}^{2}

. Moreover, the mean model for the mean-centered scalar y takes the form

E (y | Z) = \int_{T} Z (t) b (t) d t

.

As suggested in the literature, we may obtain an optimal estimator of b by expanding functional predictor Z under certain basis functions. In this paper, we focus on the utility of functional principal component analysis (FPCA) to perform the decomposition of the functional Z. By the Karhunen–Loève expansion (e.g., [18,19,20]), we may write

Z (t) = \sum_{k = 1}^{\infty} \sqrt{ς_{k}} ξ_{k} ϕ_{k} (t)

, where

ς_{k} > 0

are the eigenvalues, and the loadings are given by

ξ_{k} : = \frac{1}{\sqrt{ς_{k}}} < Z, ϕ_{k} >

. These coefficients satisfy (i) mean zero,

E (ξ_{k}) = 0

; (ii) variance one,

E (ξ_{k}^{2}) = 1

; (iii) uncorrelated,

E (ξ_{k} ξ_{j}) = 0

for

k \neq j

. Then, the mean model may be rewritten as follows,

E (y | Z) = \sum_{k = 1}^{\infty} β_{k} ξ_{k},

(1)

where coefficients

β_{k} = < b, \sqrt{ς_{k}} ϕ_{k} >

,

k = 1, \cdot \cdot \cdot

, which are unknown due to the unknown b. Equation (1) presents a linear projection of scalar outcome y on the space spanned by the standardized principal components (PCs)

ξ_{k}

’s of functional predictor Z. On these lines of research, Müller and Yao (2008) proposed a class of functional additive models (FAMs) that extends Equation (1) by allowing a nonparametric form of the projection:

E (y | Z) = \sum_{k = 1}^{\infty} f_{k} (ξ_{k}),

(2)

where

f_{k}

is a fully unspecified nonlinear smooth function to be estimated. It is obvious that Müller and Yao’s extension given in (2) takes an additive model on individual coefficient (or feature) components

ξ_{k}

’s. Regularization is often needed for both (1) and (2) in order to deal with these infinite-dimensional unknowns. One of the challenges concerning regularization for (2) lies in the technical treatment in the functional space. Müller and Yao (2008) [21] proposed truncation (or a hard threshold) of the eigenspace to retain only the leading components that explain the majority of the total variation in Z. Zhu, Yao, and Zhang (2014) [15] proposed another regularization for the functions

f_{k}

using the powerful COSSO method [22]. One advantage for this kind of regularization method is that sums of higher-order functional principal components are allowed to be potentially included in the fit model, if they make stronger contributions to the functional relationship than the leading functional principal components. This regularization method [15] begins with an additive model

E (y | Z) = \sum_{k = 1}^{s} f_{k} (ξ_{k}),

where s represents some initial degrees of truncation to specify the total number of additive components to be considered. Then, COSSO helps simultaneously regularize and select important functional components among the s functions

f_{k}

. Although the above discussion is based on a single functional predictor Z in mind, it is appealing to extend such a framework with multiple functional predictors for a broad range of problems.

When multiple functional predictors, say

Z^{1}, \dots, Z^{p}

, are considered, it is not clear if the above additive model specification remains suitable to handle the complexity, especially a non-additive relationship (e.g., interactions) may be of interest to understand the association between a scalar outcome and multiple functional predictors. In effect, from both the perspectives of theoretical advances and application needs, relaxing the additive relationship is an important task in functional data analysis. Alternatively, there are some methods (e.g., [16,17]) in the literature that do not use the strategy of decomposing Z into its functional components. In this paper, we adopt the framework of kernel machine regression models to extend the methodologies with non-additive relationships between multiple functional predictors and the scalar outcome.

1.2. Least-Squares Kernel Machine

Liu, Lin, and Ghosh (2007) [23] proposed a semi-parametric regression model

y_{i} = x_{i}^{⊤} β + h (z_{i}) + ϵ_{i}

for subject

i = 1, \dots, n

, where they used the least-squares kernel machine (LSKM) to analyze multidimensional genetic pathways denoted by a vector

z_{i}

. The key feature of this model is the nonlinear relationship between the outcome

y_{i}

and a vector of gene expressions

z_{i}

, which is characterized by a nonparametric smooth function h. Under the theory of smoothing splines, function h is assumed to lie in a reproducing kernel Hilbert space (RKHS),

H_{K}

, generated by a positive-definite kernel function

K (\cdot, \cdot)

. For the ease of exposition, we suppress the bandwidth for the kernel

K

in the following discussion. Then, both parameter

β

and function h are estimated by maximizing the scaled penalized likelihood function:

J (h, β) = - \frac{1}{2} \sum_{i = 1}^{n} {y_{i} - x_{i}^{⊤} β - h (z_{i})}^{2} - \frac{1}{2} λ_{1} {∥h∥}_{H_{K}}^{2},

(3)

where

λ_{1} > 0

is the tuning parameter and

{∥\cdot∥}_{H_{K}}

is the norm of the RKHS. For a function

h \in L^{2} (H_{K})

, we have

h (\cdot) = \sum_{i = 1}^{n} α_{i} K (\cdot, z_{i})

. Then,

{∥h∥}_{H_{K}}^{2} = α^{⊤} K α

, where

K

is an

n \times n

matrix whose

(i, j)

entry is

K (z_{i}, z_{j})

and

α = {(α_{1}, \dots, α_{n})}^{⊤}

.

It is known in the literature (e.g., [23,24]) that maximizing

J (h, β)

in (3) turns out to be equivalent to solving the normal equations from the following linear mixed-effects model (LMM):

Y = X β + h + ϵ,

where

h

is an

n \times 1

vector of random effects with distribution

N (0, τ K)

and an n-dimensional vector error term

ϵ \sim N (0, σ^{2} I)

, with

τ = λ_{1}^{- 1} σ^{2} > 0

. One remarkable advantage of solving (3) through the existing numerical procedure of the LMM is most advocated in the literature [25], where we can determine the smoothing parameter

λ_{1}

as part of the estimation of the variance components of the LMM. Therefore, instead of using cross-validation or other information-based tuning methods on

λ_{1}

, we can solve simultaneously for all the model parameters in (3), as shown in [23]. Utilizing this numerical strength of the kernel machine regression model, we propose a semi-parametric regression model by incorporating functional principal components of functional predictors (i.e., the

z_{i}

) to evaluate a nonlinear relationship of a scalar outcome with multiple functional covariates in a non-additive way. Assuming that function h belongs to an RKHS, we can use existing software packages for solving LMMs to obtain estimates of all model parameters and the smoothing parameter.

1.3. Feature Selection

To deal with high-dimensional functional principal components from functional covariates, we invoked the sparse regularization approach in the kernel machine regression model. Note that for both mean models (1) and (2), one needs to truncate the series from the Karhunen–Loève expansion. Regularization helps reduce from an infinite number of terms to a sum of finite terms. To introduce some notations, here we present a brief review on the group lasso (GL) [26], sparse group lasso (SGL) [27], and non-negative garrote [28]. See also the series of work originated by COSSO [22]. Yuan and Lin (2007) [26] proposed the group lasso, which solves the convex optimization problem:

{min}_{β \in R^{p}} {∥Y - \sum_{ℓ = 1}^{L} X^{ℓ} β^{ℓ}∥}_{2}^{2} + λ \sum_{ℓ = 1}^{L} {∥β^{ℓ}∥}_{2}

, where L is the total number of groups of covariates and

X^{ℓ}

refers to a subset of covariates associated with group ℓ. Friedman, Hastie, and Tibshirani [27] extended the group lasso to allow within-group sparsity, namely SGL, given as

{min}_{β \in R^{p}} {∥Y - \sum_{ℓ = 1}^{L} X^{ℓ} β^{ℓ}∥}_{2}^{2} + λ (1 - δ) \sum_{ℓ = 1}^{L} {∥β^{ℓ}∥}_{2} + λ δ {∥β∥}_{1}

, where

δ \in [0, 1]

. The additional

ℓ_{1}

-norm penalty term on

β

encourages individual sparsity, while the first penalty targets sparsity at the group level. It is easy to see that group lasso is a special case of the SGL when

δ = 0

.

The non-negative garrote proposed by Breiman (1995) [28] is another useful means of variable selection. It invokes a scaled version of least-squares estimation given by:

a r g {min}_{d} \frac{1}{2} {∥Y - \tilde{X} d∥}_{2}^{2} + λ \sum_{j = 1}^{p} d_{j},

subject to

d_{j} \geq 0, j = 1, \dots, p

. Here,

\tilde{X} = ({\tilde{x}}_{1}, \dots, {\tilde{x}}_{p})

is an

n \times p

matrix with columns

{\tilde{x}}_{j} = x_{j} {\hat{β}}_{j}^{O L S}

, with

{\hat{β}}_{j}^{O L S}

being the least-squares estimates from

a r g {min}_{β} \frac{1}{2} {∥Y - X β∥}_{2}^{2}

with no constraints. Obviously, estimate

{\hat{d}}_{j} = 0

implies that covariate

x_{j}

would be excluded from the fit model. Breiman’s formulation that turns a variable selection problem into a parameter estimation problem will be applied for the development of feature selection on functional principal components in this paper.

This paper is organized as follows. Section 2 introduces our proposed high-dimensional kernel machine regression. Section 3 outlines a simple step-by-step algorithm that is used to implement the sparse estimation method. Section 4 concerns asymptotic properties for our proposed sparse kernel machine regression. Section 5 provides simulation results to examine the performance of our method, with comparisons with existing methods. Section 6 illustrates the proposed method by an association analysis of the relationship between the BMI and functional accelerometer data. Section 7 includes our conclusions. The Appendix A contains some key technical details, including the proofs of the theoretical results, while Appendix B presents a discussion on the model identifiability issue.

2. Model and Estimation

Consider a regression analysis of a scalar outcome y on p functional covariates,

Z^{ℓ}

,

ℓ = 1, \dots, p

. Let

z_{i}^{ℓ} = {(ξ_{1}^{ℓ}, \dots, ξ_{s_{ℓ}}^{ℓ})}_{i}^{⊤}

be the

s_{ℓ}

-element vector of functional principal component (FPC) features from the

i t h

observation of the ℓth functional covariate

Z^{ℓ}

, and let

{\vec{z}}_{i} = {[{(z_{i}^{1})}^{⊤}, \dots, {(z_{i}^{p})}^{⊤}]}^{⊤}

be the grand vector of all FPC features from all p functional covariates for subject i,

i = 1, \dots, n

. Clearly, the set of FPC features from each functional covariate forms a group, and in total, there are p groups with

s = \sum_{ℓ = 1}^{p} s_{ℓ}

many FPC features and

{\vec{z}}_{i} \in R^{s}

. The high dimensionality of FPC features presents the key methodological challenge in the analysis. We consider the following functional kernel machine regression (FKMR) model:

y_{i} = x_{i}^{⊤} β + h ({\vec{z}}_{i}) + ϵ_{i}, i = 1, \cdot \cdot \cdot, n,

(4)

where

β \in R^{q}

is a set of parameters for the effects of q scalar covariates

x = {(x_{1}, \dots, x_{q})}^{⊤}

,

h \in H_{K}

is an s-variate smooth nonparametric function with

H_{K}

being the functional space generated by a Mercer kernel

K

and error terms

ϵ_{i} \overset{i i d}{\sim} N (0, σ^{2})

. The FKMR model (4) allows for not only nonlinear, but also non-additive relationships with multiple functional covariates

Z^{ℓ}

via their FPC features,

ℓ = 1, \dots, p

, and a scalar outcome, y. The statistical task is to estimate and select important functional covariates that are related to the outcome of interest through regularizing the FPC features within each functional covariate. To proceed, following Beiman’s [28] non-negative garrote method, we here introduce a new s-dimensional scaling vector

γ \in R^{s}

,

γ = {(γ_{1}, \dots, γ_{s_{1}}, \dots, γ_{s})}^{⊤}

, by which we can set

γ \circ {\vec{z}}_{i} = {(γ_{1} ξ_{1}^{1}, \dots, γ_{s_{1}} ξ_{s_{1}}^{1}, \dots, γ_{s} ξ_{s_{p}}^{p})}_{i}^{⊤}

a new vector of weighted FPC features by

γ

via the Hadamard product (i.e., elementwise product). Note that

γ

is grouped and denoted by

γ = {((γ^{1})^{⊤}, \dots, {(γ^{p})}^{⊤})}^{⊤}

where

γ^{ℓ}

is an

s_{ℓ}

-element vector of FPC features

z^{ℓ}

of the

ℓ t h

functional covariate

Z^{ℓ}

. When the element, say

γ_{j}

, is equal to zero, the corresponding FPC feature

ξ_{j}

will not be selected in the set of important FPCs, and moreover, functional covariate

Z^{ℓ}

is excluded from the FKMR model when the entire vector

(γ^{ℓ})^{⊤} = 0

.

We estimate the unknowns in the FKMR model (4), as well as the scaling parameters

γ

by minimizing the penalized objective function

J_{1} (h, β, γ)

, whose expression is given on the right-hand side of the following Equation (5):

min_{h, β, γ} J_{1} (h, β, γ) = min_{h, β, γ} \frac{1}{2 n} \sum_{i = 1}^{n} {y_{i} - x_{i}^{⊤} β - h (γ \circ z_{i})}^{2} + \frac{1}{2} λ_{1} {∥h∥}_{H_{K}}^{2} + λ_{2} ρ (γ; δ),

(5)

where

λ_{1} > 0

and

λ_{2} > 0

are two tuning parameters, and penalty

ρ (γ; δ)

may be specified according to a certain regularization method. For the case of sparse group lasso (SGL), we take

p (γ; δ) = (1 - δ) \sum_{ℓ = 1}^{p} {∥γ^{ℓ}∥}_{2} + δ {∥γ∥}_{1}

,

δ \in [0, 1]

. Typically,

δ

is predetermined and set to

0.95

or

0.05

depending on the trade-off between group and within-group sparsity, while the factor (

1 - δ

) controls the relative group sparsity to individual sparsity of each functional predictor

Z^{ℓ}

. Meanwhile, a large tuning parameter for

λ_{2}

would remove a certain group of FPC features from the FKMR model when all elements in the vector

γ^{ℓ}

are zero. Given

h \in H_{K}

, an equivalent optimization to the above (5) can be formulated as follows:

\begin{matrix} min_{α, β, γ} J_{2} (α, β, γ) = min_{α, β, γ} \frac{1}{2 n} \sum_{i = 1}^{n} {\{y_{i} - x_{i}^{⊤} β - \sum_{k = 1}^{n} α_{k} K (γ \circ {\vec{z}}_{i}, γ \circ {\vec{z}}_{k})\}}^{2} \\ + \frac{1}{2} λ_{1} α^{⊤} K (γ; Z) α + λ_{2} ρ (γ; δ), \end{matrix}

(6)

where

K (γ; Z)

is an

n \times n

matrix whose

(i, k)

th element is

{[K (γ; Z)]}_{i k} = K (γ \circ {\vec{z}}_{i}, γ \circ {\vec{z}}_{k})

. Lemma 1 below establishes the equivalency of optimization solutions between (5) and (6), which is crucial in our estimation procedure.

Lemma 1.

A solution (

\hat{h}

,

\hat{β}

,

\hat{γ}

) is a minimizer of (5) if and only if (

\hat{α}

,

\hat{β}

,

\hat{γ}

) is a minimizer of (6), where

\hat{h} (\hat{γ} \circ \vec{z}) = \sum_{k = 1}^{n} {\hat{α}}_{k} K (\hat{γ} \circ \vec{z}, \hat{γ} \circ {\vec{z}}_{k})

.

The proof of Lemma 1 is given in Appendix A.1.

Theorem 1

(Existence of optimizers). If the kernel

K (\cdot, γ \circ \vec{z})

is continuous with respect to

γ \in R^{s}

, then there exists a global minimizer (

\hat{h}

,

\hat{β}

,

\hat{γ}

) for the optimization problem (5).

The proof of Theorem 1 is given in Appendix A.3. Note that there may exist multiple optimal minimizers for (5); Theorem 1 ensures only the existence of optimal solutions, but provides no guarantees for uniqueness due to the fact that (5) or (6) is a nonlinear and non-convex optimization problem. It is worth noting that in both (5) and (6), we set the bandwidth for the kernel at a fixed value due to the identifiability issue with respect to the scaling parameters

γ

. Refer to Appendix B for more detailed discussions on the issue of parameter identifiability.

3. Implementation and Algorithm

We propose an iterative algorithm to implement our proposed estimation procedure in which we require the differentiability of the kernel with respect to the scaling factor

γ

and some additional assumptions presented below in order to ensure algorithmic convergence. One part of the algorithm solving (5) is carried out under fixed

γ

, where the resulting minimization problem reduces to the equivalent maximization problem in the least-squares kernel machine (3) with the FPC features,

{\vec{z}}_{i}

, being replaced by

γ \circ {\vec{z}}_{i}

. As pointed out in Section 1.2, the step of numerical calculation can be easily executed in the same fashion as the solution from the linear mixed model, including the REML estimation of the smoothing parameter

λ_{1}

. The other part of the algorithm is performed under fixed

α

,

β

and

λ_{1}

, where we solve the nonlinear and non-convex optimization problem to update estimates of

γ

. Lemma 2 below helps us solve for the scaling parameter

γ

.

Lemma 2.

For fixed (α, β,

λ_{1}

), minimizing (6) over γ is equivalent to minimizing over γ the following objective function:

\frac{1}{2 n} {∥F (γ) - \tilde{Y}∥}_{2}^{2} + λ_{2} ρ (γ; δ), for λ_{2} > 0,

(7)

where

F (γ) = K (γ; Z) α

and

\tilde{Y} = Y - X β - \frac{n}{2} λ_{1} α

.

The proof of Lemma 2 is given in Appendix A.2. Linearizing the function

F (γ)

in (7) leads to an equivalent form:

\begin{matrix} min_{γ} \frac{1}{2 n} {∥\tilde{Y} - \sum_{ℓ = 1}^{p} \nabla_{γ} F^{(ℓ)} (\tilde{γ}) γ^{ℓ}∥}_{2}^{2} + λ_{2} ρ (γ; δ), \end{matrix}

(8)

where

\tilde{Y} = (Y - X β - \frac{n}{2} λ_{1} α) - F (\tilde{γ}) + \nabla_{γ} F (\tilde{γ}) \tilde{γ}

, with

\nabla_{γ} F (\tilde{γ})

being the gradient of the function

F

with respect to

γ

evaluated at

\tilde{γ}

for some

\tilde{γ}

, and

\nabla_{γ} F^{(ℓ)} (\tilde{γ})

being the columns of

\nabla_{γ} F (\tilde{γ})

associated with the ℓth group of

γ^{ℓ}

. This is precisely the form of the standard sparse group regularization problem:

{min}_{β \in R^{p}} \frac{1}{2 n} {∥Y - \sum_{ℓ = 1}^{p} X^{ℓ} β^{ℓ}∥}_{2}^{2} + λ_{2} ρ (γ; δ) .

This implies that (8) presents a standard sparse group regularization problem with a specific choice of penalty function

ρ (γ; δ)

.

The convergence of the above iterative search algorithm for updating

\tilde{γ}

for fixed (

α

,

β

,

λ_{1}

) can be justified by the proximal Gauss–Newton method [29]. Readers are referred to [30] for details on the proximal Gauss–Newton method. One of the key assumptions of the proximal Gauss–Newton method is the existence of a local minimizer. This condition is satisfied in the above (8). This is because according to Theorem 1, there exists a global minimizer.

Algorithm 1 summarizes these iterative steps, which is showed to satisfy a descent property:

J_{2} (α^{(r + 1)}, β^{(r + 1)}, γ^{(r + 1)})

\leq J_{2} (α^{(r)}, β^{(r)}, γ^{(r)})

under the convergence of the proximal Gauss–Newton algorithm for Step 2.2.

Algorithm 1 An iterative algorithm for optimization in FKMR.

1.1: Perform FPCA (e.g., the R package fdapace) to extract the functional component features for the p functional predictors, and store them in a grand vector for each individual subject ${\vec{z}}_{i} = [{(z_{i}^{1})}^{⊤}, \dots, {(z_{i}^{p})}^{⊤})]^{⊤}$ , $i = 1, \cdot \cdot \cdot, n$ ;
1.2: Initialize $γ$ to be a vector of ones. which translates to mapping the original component scores to itself. Set up a grid of possible tuning parameters for $λ_{1}$ and $λ_{2}$ , respectively. Set the kernel bandwidth parameter, which may depend on $λ_{1}$ . For each pair of $(λ_{1}, λ_{2})$ from our grid, perform Steps 2.1–2.3 and 3.1 below.
2.1: At the $(r + 1)$ -th step in the algorithm, first solve the LSKM problem with fixed ( $γ^{(r)}, λ_{1}$ ) (based on a closed-form solution) to update $β^{(r + 1)}$ and $α^{(r + 1)}$ .
2.2: Solve the group regularity problem (8) with fixed $\tilde{γ} = γ^{(r)}$ and fixed ( $α^{(r + 1)}$ , $β^{(r + 1)}$ , $λ_{1}$ , $λ_{2}$ ) using the $r + 1$ updates from the previous iteration. At this step, the proximal Gauss–Newton algorithm produces an update $γ^{(r + 1)}$ at convergence.
2.3: Repeat Steps 2.1–2.2 until convergence.
3.1: Perform cross-validation over all pairs of ( $λ_{1}, λ_{2}$ ) to determine the final $(α, β, γ)$ .

To speed up Algorithm 1, we propose the following operational schemes that avoid setting up the pairs of (

λ_{1}

,

λ_{2}

) and performing Step 3.1. Here are a few remarks on the two algorithms. (i) Algorithm 2 depends on good starting values in order to enjoy a fast search. (ii) The main difference between Algorithms 1 and 2 is that

λ_{2}

is fixed in Algorithm 1, while it is changing in Algorithm 2. Some similar algorithms with changing tuning parameters have been proposed in the literature, such as the single index model [31]. (iii) There is no guarantee that both algorithms converge to a global minimizer, and the proximal Gauss–Newton method used in the implementation can only find stationary points. Numerical solvers for the optimization problem in (5) or in (6) indeed remain an open problem in the field of nonlinear and nonconvex optimization.

Algorithm 2 A fast operational scheme of Algorithm 1.

1.: Step 2.1 of Algorithm 1 is performed by running the linear mixed model with our initial fixed $γ$ from Step 1.2 of Algorithm 1 to obtained updated values of $λ_{1}$ , $β$ , and $α$ .
2.: Step 2.2 is performed with solving the group regularity problem (8) through the Gauss–Newton algorithm using cross-validation-based tuning (e.g., R package oem).
3.: Rerun Step 2.1 using the updated $γ$ from Step 2.2 to obtain the estimates for $β$ and $α$ .

4. Theoretical Guarantees

Our theoretical analysis focuses on the finite-sample

L_{2}

error bounds for the estimators

(\hat{h}, \hat{γ})

obtained by (5) or (6). Consequently, we are able to establish the estimation consistency. For simplicity, we set

β = 0

and consider a general setting of random vectors

z_{1}, \dots, z_{n}

so that the FPC features

{\vec{z}}_{1}, \dots, {\vec{z}}_{n}

correspond to a special case. Along similar lines as those of [15,32], the estimation consistency is proven in the case of the SGL penalty function. We define a map

Γ

with an s-element vector

γ \in R^{s}

, which gives rise to a collection of all scaling map functions:

A = {Γ : R^{s} \mapsto R^{s} ∣ Γ (z) = γ \circ z, z \in R^{s} and γ \in R^{s}}

. Since

Γ

is a linear (and bounded) operator,

A

is a real vector space where

(c_{1} Γ_{1} + c_{2} Γ_{2}) (z) = c_{1} Γ_{1} (z) + c_{2} Γ_{2} (z)

with any

c_{1}, c_{2} \in R

and

Γ_{1}, Γ_{2} \in A

. To perform a group regularization estimation, we define an SGL penalty by a norm on

A

for a fixed

δ \in [0, 1]

as follows:

{∥Γ∥}_{S G L} = δ \sum_{ℓ = 1}^{p} {∥γ^{ℓ}∥}_{2} + (1 - δ) {∥γ∥}_{1} .

(9)

Consequently, the SGL regularization estimation requires the following constrained optimization:

min_{Γ \in A, h \in H_{K}} J_{3} (Γ, h) = min_{Γ \in A, h \in H_{K}} {∥Y - h \circ Γ∥}_{n}^{2} + λ_{1} {∥h∥}_{H_{K}}^{2} + λ_{2} {∥Γ∥}_{S G L},

(10)

where

{∥Y - h \circ Γ∥}_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {\{y_{i} - (h \circ Γ) (z_{i})\}}^{2}

. Lemma 3 below provides the essential finite-sample inequalities that lead to the estimation consistency.

Lemma 3

(Basic inequality). Let

\hat{h} \circ \hat{Γ}

be the minimizer of (10). Let

h_{0} \circ Γ_{0}

be the true function. Then, we have:

J_{3} (\hat{Γ}, \hat{h}) \leq 2 {(ϵ, \hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0})}_{n} + λ_{1} {∥h_{0}∥}_{H_{K}}^{2} + λ_{2} {∥Γ_{0}∥}_{S G L},

(11)

where

2 {(ϵ, \hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0})}_{n} = \frac{2}{n} \sum_{i = 1}^{n} ϵ_{i} \{(\hat{h} \circ \hat{Γ}) (z_{i}) - (h_{0} \circ Γ_{0}) (z_{i})\}

.

We need the following notation before presenting our theoretical guarantees. Let

N (δ, M, P_{n})

denote the minimal

δ

covering number of the function set

M

under the empirical metric

P_{n}

based on the random vectors

z_{1}, \dots, z_{n}

. Let

N = N (δ, M, P_{n})

be a shorthand notation. This means that there exist functions

m_{1}, \dots, m_{N}

(not necessarily in the set

M

) such that for every function

m \in M

, there exists a

j \in {1, \dots, N}

such that

{∥m - m_{j}∥}_{P_{n}} \leq δ

, with

{∥m - m_{j}∥}_{P_{n}} : = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {m (z_{i}) - m_{j} (z_{i})}^{2}}

. Define the

δ

-entropy of

M

for the empirical metric,

P_{n}

, as

H (δ, M, P_{n}) : = l o g (N (δ, M, P_{n}))

. Consider a functional space of the form:

B = \{b : = b (h, Γ) = \frac{h \circ Γ - h_{0} \circ Γ_{0}}{{∥h∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥Γ∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2}} | h \in H_{K}, Γ \in A\} .

We postulate the following assumptions.

Assumption A1.

The error term

ϵ = {(ϵ_{1}, \dots, ϵ_{n})}^{⊤}

is uniformly sub-Gaussian; that is, for constants

C_{1}

and

C_{2}

,

max_{n \geq 1} max_{i = 1, \dots, n} C_{1}^{2} [E \{e x p (\frac{{ϵ_{i}}^{2}}{C_{1}^{2}})\} - 1] \leq C_{2} .

Clearly, the moment condition is bounded below from zero.

Assumption A2.

{∥Γ_{0}∥}_{S G L}^{2} + {∥h_{0}∥}_{H_{K}}^{2} > 0

, and the entropy of space

B

with respect to the empirical metric

P_{n}

is bounded as follows:

H (δ, B, P_{n}) \leq C_{3} δ^{- 2 ψ},

where

C_{3}

is some constant and

ψ \in (0, 1)

.

Assumption A3.

{sup}_{b \in B} {∥b∥}_{P_{n}} \leq C_{4}

for some constant

C_{4}

.

Theorem 2.

(Consistency) Under Assumptions 1–3 above, if tuning parameters

λ_{1}

and

λ_{2}

satisfy

λ_{2}^{- 1} = n^{\frac{1}{1 + ψ}} {({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L})}^{\frac{1 - ψ}{1 + ψ}}, a n d λ_{1} = O_{p} (1) λ_{2},

then we have

{∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (n^{- \frac{1}{2 + 2 ψ}}) {({∥h∥}_{H_{K}}^{2} + {∥Γ∥}_{S G L})}^{\frac{ψ}{1 + ψ}}, a n d

(12)

{∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L} = O_{p} (1) ({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}) .

(13)

Theorem 2 implies estimation consistency under the right rates for the two tuning parameters

λ_{1}

and

λ_{2}

. Due to the potential identifiability issues explained in detail in Appendix B, although the estimator

(\hat{h}, \hat{Γ})

may not be unique, the sum of

\hat{h}

and

\hat{Γ}

is not too far away from the sum of the true

h_{0}

and

Γ_{0}

.

Corollary 1.

If the RKHS,

H_{K}

, contains differentiable functions

\nabla h (z)

whose norm

{∥\nabla h (z)∥}_{H_{K}}

is uniformly bounded for all functions

h \in H_{K}

and

z \in R^{s}

, then Assumption 2 holds when Theorem 2 is replaced by

H (δ, H_{K}, P_{n}) \leq C_{1} δ^{- 2 ψ}, for all δ \geq 0

.

The proofs of Theorem 2 and Corollary 1 are given in Appendix A.4 and Appendix A.5, respectively. Often, when we are only interested in a subset of functions in the RKHS (e.g., functions with norm less than one), we can substitute the full space

H_{K}

in Corollary 1 with the subspace of interest. Refer to [15] or [32], where both considered an RKHS (i.e., Sobolev space) with functions of norm less than or equal to one.

5. Simulation Experiments

We performed extensive simulation to investigate the performance of our proposed procedure, including the performance of SGL variable selection and its overall accuracy. Due to the limitations of space, we include results from two simulation experiments in this section, and more results may be found in the first author’s Ph.D. dissertation [30].

5.1. Setup

In the evaluation of the performance accuracy, following [15], we used both quasi-

R^{2}

and adjusted quasi-

R^{2}

defined as follows:

\begin{matrix} R_{Q}^{2} : = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y_{i}})}^{2}}, a n d R_{A Q}^{2} : = 1 - (1 - R_{Q}^{2}) (\frac{n - 1}{n - (k + 1)}) . \end{matrix}

The latter is known to be appealing for the comparison of the estimation sparsity. There is another performance metric of interest in addition to model accuracy. Performance in variable selection is summarized in terms of the stability measured by sensitivity and specificity for both functional and variable selections under these simulation experiments. Our algorithm uses existing R packages, including emmreml, kspm, and oem.

Specifically, we designed the following two simulation settings.

Scenario 1: A single functional predictor with sparsity in the FPC features.

Scenario 2: Multiple functional predictors with sparsity in the functional predictors and with sparsity in the FPC features of important functional predictors.

Each of these two scenarios would be handled using certain suitable penalty functions to address the designed sparsity; for example, in Scenario 2 we used a two-level variable selection penalty (e.g., SGL) to deal with two types of sparsity in the true model. In all analyses, we used the Gaussian kernel

K (u, v) = exp (- \frac{1}{p} {∥u - v∥}^{2})

in our estimation, where p was set as the number of features, which is equivalent to dividing the

γ

vector by

\sqrt{p}

. This scaling parameter may be either estimated or set to the number of features to overcome the identifiability issue according to [33], where theoretical justification was given for the use of the number of features for the bandwidth parameter in the case of the Gaussian kernel.

According to [23], due to the difficulty of the graphical display for the estimated s-dimensional function

h (\cdot)

of

z

, we summarized the goodness-of-fit by regressing the true h on the estimated

\hat{h}

, with both being evaluated at the design points. From this concordance regression analysis, we may measure the goodness-of-fit on

\hat{h}

through the average intercepts, slopes, and R-squared (also known as the coefficient of determination) obtained over the number of replications. Clearly, a high-quality fit is reflected by (i) the intercept being close to zero, (ii) the slope being close to one, and (iii) the R-squared being close to one. Moreover, we graphically display the estimated function

\hat{h}

by setting all variables equal to 0.5 except the one of interest over a grid of 100 equally spaced points on the interval

[0, 1]

. Such visualization of the functional estimation at each margin further facilitates the evaluation of the proposed algorithm in addition to the results obtained from the concordance regression analyses.

In all scenarios, we generated 1000 IID functional paths, of which 750 paths were assigned to the training set and 250 paths were assigned to the test set for an external performance evaluation. It is the test set that we used to display the performance accuracy. We used a one-dimensional covariate

x_{i}

to show the flexibility of our model in a semi-parametric setting, with independent copies of

x_{i} \sim N (0, 1)

. We chose the true coefficients in the kernel machine model similar to those given in [23].

5.2. Simulation in Scenario 1

In this simple scenario with a single functional predictor, we simulated data from a model with sparsity in its FPC features. To do so, we generated a single functional predictor based on the first 15 eigenbasis of the Fourier basis functions over the interval

[0, 1]

:

Z (t) = \sum_{j = 1}^{15} \sqrt{ς_{j}} ξ_{j} ϕ_{j} (t)

. That is, a functional predictor was created as a linear combination of the 15 basis functions, where

ϕ_{j} (\cdot)

is the

j t h

Fourier basis function,

ς_{j}

is the jth eigenvalue of Z, and

ξ_{j}

is the jth FPC feature that is simulated from a normal distribution detailed as follows.

There were 100 sampled points that were first equally spaced in the interval

[0, 1]

and then varied with certain small deviations drawn from

ν \sim N (0, 0.001)

. Set

ς_{j} = 45 \times 0 . 64^{j}

and

ξ_{j} \sim N (0, 1)

independently over

j = 1, \dots, 15

. As was done in [17], instead of directly using

ξ_{j}

, we used

ζ_{j} = Φ (ξ_{j})

, where

Φ

is the CDF of the standard normal. This resulted in

\vec{z} = {(ζ_{1}, \dots, ζ_{15})}^{⊤}

. We chose the second,

ζ_{2}

, and ninth,

ζ_{9}

, features as important features in the following true nonlinear non-additive model:

y_{i} = 2 x_{i} + 20 cos (2 π ζ_{i 2}) - 10 sin (2 π ζ_{i 9}) + ζ_{i 2} ζ_{i 9} + ϵ_{i},

with

ϵ_{i} \overset{i i d}{\sim} N (0, 1)

. FPCA was performed by the R package PACE [34], producing the estimated FPC scores,

\hat{ξ_{j}}

, as well as the estimated eigenvalues,

\hat{ς_{j}}

, which in turn enabled us to compute

{\hat{ζ}}_{j}

,

j = 1, \dots, 15

.

We applied both LASSO and MCP penalty functions in our implementation, termed as

F K M R_{L a s s o}

and

F K M R_{M C P}

, respectively. We compared the results of our method with the standard linear approach with both LASSO and MCP under the assumption of linear functional relationships, as well as the COSSO method for functional additive regression [15] using the R package COSSO [15,34]. Since the COSSO package is built for nonparametric regression (and not partial linear models), we adopted the backfitting strategy and regressed the residuals with our estimated effect of

x_{i}

removed.

In addition, we compared our method with an oracle FKMR estimator, called

F K M R^{o r a c l e}

, that assumed the full knowledge of the true

ζ_{j}

containing two true nonzero signals,

ζ_{2}

and

ζ_{9}

. We also considered two oracle versions of our proposed algorithm,

F K M R_{L a s s o}^{o r a c l e}

and

F K M R_{M C P}^{o r a c l e}

, both of which used the knowledge of true

ζ_{j}

in order to evaluate the performance of the FPCA procedure. This evaluation is important as our proposed procedure can be in principle used in simpler cases that do not involve functional covariates. Note that once we used FPCA to obtain

{\hat{ζ}}_{j}

features, our algorithm essentially works in a standard regression setting with the sparsity of covariates. Thus, our proposed procedure can be in principle used in simpler cases with scalar covariates. In Scenario 1, due to the highly nonlinear relationships between the FPC features and the outcome, as expected, the naive linear model performed poorly in terms of both model selection and model consistency. The detailed simulation results for Scenario 1 can be found in the first author’s Ph.D. dissertation [30]. In brief, our proposed method worked well in all aspects. In this setting, COSSO also worked well in terms of model fit, but it tended to select noisy features more frequently than our proposed method, leading to more false positives.

5.3. Simulation in Scenario 2

Now, we generated four functional predictors of the form:

Z^{ℓ} (t) = \sum_{j = 1}^{9} \sqrt{ς_{j}^{ℓ}} ξ_{j}^{ℓ} ϕ_{j}^{ℓ} (t)

,

ℓ = 1, \dots, 4,

where

ϕ_{j}^{ℓ}

,

ς_{j}^{ℓ}

, and

ξ_{j}^{ℓ}

were set in the same way as those given in Scenario 1. It follows that

\vec{z} = {(ζ_{1}^{1}, \dots, ζ_{9}^{1}, \dots, ζ_{1}^{4}, \dots, ζ_{9}^{4})}^{⊤}

, where

ζ_{j}^{ℓ}

is the jth

Φ

-transformed feature for the ℓth functional covariate. Sparsity was specified as follows: the first and second functional covariates,

Z^{1}

and

Z^{2}

, were chosen as important signals in which these transformed FPC features,

{ζ_{1}^{1}, ζ_{3}^{1}, ζ_{4}^{1}, ζ_{2}^{2}, ζ_{7}^{2}}

, are five important features (three features from the

Z^{1}

and two features from

Z^{2}

) that are related to the outcome:

\begin{matrix} y_{i} = & 2 x_{i} + ζ_{i 1}^{1} + ζ_{i 3}^{1} + ζ_{i 4}^{1} + ζ_{i 2}^{2} + ζ_{i 7}^{2} + 10 cos (2 π ζ_{i 1}^{1}) - 10 {(ζ_{i 2}^{2})}^{2} + 10 {(ζ_{i 7}^{2})}^{2} - 10 {(ζ_{i 3}^{1})}^{2} \\ + 10 exp (- ζ_{i 3}^{1}) ζ_{i 4}^{1} - 8 sin (2 π ζ_{i 7}^{2}) cos (2 π ζ_{i 3}^{1}) + 20 ζ_{i 1}^{1} ζ_{i 7}^{2} + ϵ_{i}, i = 1, \dots, n, \end{matrix}

where

ϵ_{i} \overset{i i d}{\sim} N (0, 1)

. This model specifies both group sparsity (two of the four functional predictors) and within-group sparsity (three of the nine FPC features in

Z^{1}

and two of the nine FPC features in

Z^{2}

). In addition, we specified non-additive relationships in the true model across multiple functional covariates.

We fit the data using the proposed methods, including

F K M R_{G M C P}^{o r a c l e}

,

F K M R_{L a s s o}

,

F K M R_{G L a s s o}

,

F K M R_{S G L}

,

F K M R_{M C P}

, and

F K M R_{G M C P}

, and the results based on 100 replicates are summarized in Table 1. For comparison, we also fit the simulated data by existing methods, including the linear model (denoted by LM + penalty), COSSO functional additive regression, and the oracle method using the knowledge of true important features in the analysis, as done in the above simulation of Scenario 1. From Table 1 regarding the goodness-of-fit, we see that all of our FKMR estimators outperformed the standard linear estimators in terms of

R_{A Q}^{2}

among all of our penalty functions, and they outperformed COSSO for penalties that accounted for group sparsity. In the concordance regression analysis, we see that all intercepts were close to zero, all slopes close to one, and all

R^{2}

close to one, indicating a high goodness-of-fit for functional estimation. COSSO tended to perform on par for penalties that did not account for group sparsity (LASSO and MCP). It is evident that using a group sparsity penalty function (SGL, GLasso, and GMCP) clearly outperformed the methods that did not regularize the grouping of covariates (Lasso and MCP). In addition, our FKMR estimators (except

F K M R_{L a s s o}

) performed as well as the oracle estimator

F K M R_{G M C P}^{o r a c l e}

both in terms of

R_{A Q}^{2}

and in terms of our estimate of functional h. The results also indicated that there were little differences between using a concave (MCP or GMCP) penalty function or using a convex (GLasso or SGL) penalty function.

As regards the group sparsity, Table 2 indicates that the all methods had a high sensitivity of detecting functional signals, while the proposed FKMR methods had better specificity than both sparse linear models and COSSO. Concerning the within-group sparsity, it is interesting to note that a bigger difference was seen in terms of what type of penalty function was being used in feature selection. As shown in Table 3 and Table 4, using a general penalty (e.g., Lasso and MCP) that does not take the grouping structure into account tended to under-select important features within a group. COSSO tended to perform well within group sparsity. Moreover, Figure 2 shows that the FKMR method estimated the five signal functions (

Z^{1}

and

Z^{2}

) well.

6. Data Example

To show the usefulness of our proposed methodology, we analyzed data of 550 children recruited by the ELEMENTS study [35], who had consent to wear an actigraph (ActiGraph GT3X+; ActiGraph LLC. Pensacola, FL, USA). This wearable was to be placed on their non-dominant wrist for five to seven days with no interruption. The actigraph measured tri-axis accelerometer data sampled at 30 Hz, which captured three different directions of a person’s movement. The BMI was the outcome of interest as it is biomarker of obesity. Sex and age were confounding factors used in the analysis. Due to some missing data, our analysis only included children who wore the device properly for 85% or more over the study period, which resulted in 395 participants, consisting of 189 males and 206 females. Other studies such as [36] have excluded days of accelerometer data with more than five percent missing. The mean ± SD BMI of the study cohort was 21.5 ± 4.1. The mean age of the study participants was 14.3 ± 2.1 y. A more detailed description of the dataset used for this paper can be found in [37]. Our primary interest was to see if the BMI is associated with physical activity in the presence of other covariates, specifically sex and age. We preprocessed the activity counts over the 7 d of wear by taking the median in the 1 min epoch over the entire 7 d of wear. For example, since all the participants started wearing the device at 3 p.m., the first data point for each individual was a median of 7 ACs (each for one day) for the 1 min epoch of 3:00–3:01 p.m. This procedure that takes the medians across the minutes from different days has been considered in other applications such as [36]. See Figure 3 as an example of the resulting time series of medians derived from the AC data displayed in Figure 1.

We applied the following five models, labeled as M0–M4 for convenience, to analyze the data with the 24 h median ACs as functional predictors. Let

ξ_{i j}^{k}

be the ith person’s kth FPC score for functional predictor j.

M0:: Linear model (LM) with only the fixed features: $B M I_{i} \sim β_{0} + β_{1} A g e_{i} + β_{2} S e x_{i};$
M1:: Linear model with SGL penalty (LM+SGL) using the FPCA features: $B M I_{i} \sim β_{0} + β_{1} A g e_{i} + β_{2} S e x_{i} + \sum_{j = 1}^{3} \sum_{k = 1}^{s_{k}} β_{j}^{k} ξ_{i j}^{k};$
M2:: LSKM using the FPCA features: $B M I_{i} \sim β_{0} + β_{1} A g e_{i} + β_{2} S e x_{i} + h (z_{i});$
M3:: FKMR model with SGL penalty ( $F K M R_{S G L}$ ) using the FPCA features: $B M I_{i} \sim β_{0} + β_{1} A g e_{i} + β_{2} S e x + h (γ \circ z_{i});$
M4:: COSSO using the FPCA features: $r e s (B M I_{i}) | z_{i} \sim \sum_{j = 1}^{3} \sum_{k = 1}^{s_{k}} f_{i j} (ξ_{i j}^{k}) .$ In order for a direct application of the COSSO R package, we used residuals $r e s (B M I_{i}) = B M I_{i} - \hat{β_{0}} + \hat{β_{1}} A g e_{i} + \hat{β_{2}} S e x_{i}$ in the COSSO model fit, with ${\hat{β}}_{0}, {\hat{β}}_{1}$ and ${\hat{β}}_{2}$ being the estimates of the coefficients from Model M0.

The BMI and age were mean centered and scaled to be a standard deviation of one, so

β_{0}

was absent in the models. Here are some key findings from the data analyses. First, in terms of the goodness-of-fit, Table 5 suggests that M3, i.e., our proposed model FKMR with the SGL penalty, gave the best performance, where the adjusted

R^{2}

of M3 was nearly twice as big as all the other four models. Second, it is interesting to note that both the COSSO and the

F K M R_{S G L}

did not select the FPC scores associated with the Z-axis. Third, as shown in Table 6, all of the FPC components chosen by COSSO were also chosen by the

F K M R_{S G L}

. It is worth noting that the linear model together with the SGL penalty selected the highest number of FPC components, yet performed the worst in terms of the model fit.

7. Conclusions

In this paper, we proposed a method to model the nonlinear relationship between multiple functional predictors and a scalar outcome in the presence of other scalar confounders. We used the FPCA to decompose the functional predictors for feature extraction and used the LSKM framework to model the functional relationship between the outcome and principal components. We developed a simultaneous procedure to select important functional predictors and important features within selected functionals. We proposed a computationally efficient algorithm to implement our regularization method, which was easily programmed in R with the utility of multiple existing R packages. It should be noted that although we focused on functional regression in this paper, the method proposed can be applied to non-functional predictors. In effect, by using functional principal components, we essentially bypassed the infinite-dimensional problem and worked effectively in a non-functional framework with the FPC features. Through simulation and using data from the ELEMENT dataset, we demonstrated how the FKMR estimator outperformed existing methods in terms of both variable selection and model fit. It should be noted that the existing COSSO method did perform well in terms of variable selection, as shown in Section 5.

A technical issue pertains to identifiability limitations with regard to the bandwidth parameter and to the RKHS estimator. To overcome this, we suggested fixing the bandwidth parameter; see the detailed discussion in Section 3. We established key theoretical guarantees for our proposed estimator. In the case where there are multiple proposed estimators (and thus the identifiability issues arise), the established theoretical properties in Section 4 apply to any of those estimators.

Variable section on functional predictors presents many technical challenges, and there are many methodological problems that remain unsolved. This paper demonstrated a possible framework to regularize estimation with a bi-level sparsity of functional group sparsity and within-group sparsity. In the LSKM paper [23], it was briefly mentioned that if the relationship between the scalar outcome and p genetic pathways is additive, we can tweak the model as

y_{i} = x_{i}^{⊤} β + h_{1} (z_{i}^{1}) + \dots + h_{p} (z_{i}^{p}) + ϵ_{i}

where each

h_{j}

belongs to its own RKHS. It is easy to extend our method and algorithms to handle this case. For future research, an extension on longitudinal outcomes may be considered via a mixed-effects model

y_{i j} = x_{i}^{⊤} β + h (z_{i j}) + u_{i j}^{⊤} v_{i} + ϵ_{i j}

where

u_{i j}^{⊤} v_{i}

are the random effects. Other useful extensions to the proposed paradigm would be on the lines of generalized linear models and Cox regression models.

Author Contributions

Conceptualization, P.X.S. and J.N.; Formal analysis, J.N.; Methodology, J.N. and P.X.S.; Supervision, P.X.S.; Writing—original draft, J.N.; Writing—review & editing, P.X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by NSF DMS#2113564.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The used data of physical activity counts, BMI and demographic variables (sex and age) are available upon request through a formal data request procedure outlined by the ELEMENT Cohort Study. Contact the corresponding author of this paper for the detail.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Technical Assumptions and Proofs

Appendix A.1. Proof of Lemma 1

It suffices to show that for any

J_{1} (h, β, γ)

in (5) we can always find

α \in R^{n}

such that

J_{1} (\tilde{h} = \sum_{i = 1}^{n} α_{i} K (\cdot, γ \circ {\vec{z}}_{i}), γ, β) \leq J_{1} (h, β, γ)

where

\tilde{h}

is the projection of h onto the linearly spanned space given by

s p a n {K (\cdot, γ \circ {\vec{z}}_{i}), \dots, K (\cdot, γ \circ {\vec{z}}_{n})}

. For any h we can write

h = h^{⊥} + \tilde{h}

where

h^{⊥} \in s p a n {K (\cdot, γ \circ {\vec{z}}_{1}), \cdot \cdot \cdot, K (\cdot, γ \circ {\vec{z}}_{n})}^{⊥}

. Since

H_{k}

is a reproducing kernel Hilbert space we can rewrite (5) as follows:

\begin{matrix} J_{1} (h, γ, β) & = \frac{1}{2 n} \sum_{i = 1}^{n} {y_{i} - x_{i}^{⊤} β - < h, K (\cdot, γ \circ {\vec{z}}_{i}) >}^{2} + \frac{1}{2} λ_{1} {∥h∥}_{H_{k}}^{2} + λ_{2} ρ (γ; δ) . \end{matrix}

Since

< h^{⊥}, K (\cdot, γ \circ {\vec{z}}_{i}) > = 0

for every i, we obtain

\begin{matrix} J_{1} (h, γ, β) & = \frac{1}{2 n} \sum_{i = 1}^{n} {\{y_{i} - x_{i}^{⊤} β - \sum_{k = 1}^{n} α_{k} K (γ \circ {\vec{z}}_{i}, γ \circ {\vec{z}}_{k}))\}}^{2} + \frac{1}{2} λ_{1} {∥h^{⊥} + \tilde{h}∥}_{H_{k}}^{2} + λ_{2} ρ (γ; δ) \\ \geq \frac{1}{2 n} \sum_{i = 1}^{n} {\{y_{i} - x_{i}^{⊤} β - \sum_{k = 1}^{n} α_{k} K (γ \circ {\vec{z}}_{i}, γ \circ {\vec{z}}_{k}))\}}^{2} + \frac{1}{2} λ_{1} {∥\tilde{h}∥}_{H_{k}}^{2} + λ_{2} ρ (γ; δ) \\ = J_{1} (\tilde{h}, γ, β) . \end{matrix}

Appendix A.2. Proof of Lemma 2

The equivalence of forms become clear once we rewrite (6) in the matrix notation. Equation (6) can be written as follows:

\begin{matrix} min_{α, β, γ} J_{2} (α, β, γ) = min_{α, β, γ} \frac{1}{2 n} {∥Y - X β - K (γ; Z) α∥}_{2}^{2} + \frac{1}{2} λ_{1} α^{⊤} K (γ; Z) α + λ_{2} ρ (γ; δ) . \end{matrix}

(A1)

For fixed

α

,

β

and

λ_{1}

, minimizing the function in (A1) with respect to

γ

is equivalent to

min_{γ} \{\frac{1}{2 n} {∥(Y - X β - \frac{n}{2} λ_{1} α) - K (γ; Z) α∥}_{2}^{2} + λ_{2} ρ (γ; δ)\} .

(A2)

Appendix A.3. Proof of Theorem 1

With loss of the generality we use the penalty function for sparse group lasso but this proof can easily be modified for other penalty functions. Also, we fix

λ_{1} = λ_{2} = δ = 1

, and consider

β \in R

as well as set the design matrix

X

(or vector in this case) scaled to have norm 1. The case of

β \in R^{q}

will follow along similar lines of arguments. Let

γ \in D_{3}

with

D_{3} = {γ : {∥γ∥}_{1} \leq \frac{1}{2 n} {∥Y∥}_{2}^{2}}

. Define

f (γ) = ∥K (γ; Z)∥ = η_{m a x} (K (γ; Z)) \geq 0

, where

η_{m a x} (K (γ; Z))

denotes the largest eigenvalue of

K (γ; Z)

with the operator norm (the norm of

K (γ; Z)

) defined in its usual way

∥K (γ; Z)∥ = s u p {{∥K (γ; Z) x∥}_{2}^{2} : {∥x∥}_{2}^{2} = 1}

. Since

D_{3}

is compact and

K (γ; Z)

is continuous with respect to

γ

it achieves its maximum over

D_{3}

. Thus, we define

η^{★} = s u p_{γ \in D_{3}} f (γ) \geq 0

. Define

D_{2} = {β : ∣ β ∣ \leq (1 + η^{★}) {∥Y∥}_{2}}

, where the upper bound is denoted by

b^{★} = (1 + η^{★}) {∥Y∥}_{2} \geq 0 .

Moreover, define

D_{1} = {α : {∥α∥}_{2} \leq \sqrt{n} ({∥Y∥}_{2} + b^{★})} .

Since

D_{1}, D_{2}

and

D_{3}

are compact there exists a

(α^{★}, β^{★}, γ^{★})

such that

J_{2} (α^{★}, β^{★}, γ^{★}) \leq J_{2} (α, β, γ)

for all

(α, β, γ) \in D_{1} \times D_{2} \times D_{3}

. Note that

J_{2} (0, 0, 0) = \frac{1}{2 n} {∥Y∥}_{2}^{2}

and

(0, 0, 0) \in D_{1} \times D_{2} \times D_{3}

. We claim that

(α^{★}, β^{★}, γ^{★})

is a global minimizer, which is proved below by contradiction.

Suppose that there exists

(\tilde{α}, \tilde{β}, \tilde{γ}) \notin D_{1} \times D_{2} \times D_{3}

where

J_{2} (\tilde{α}, \tilde{β}, \tilde{γ}) < J_{2} (α^{★}, β^{★}, γ^{★})

. We must have that

\tilde{γ} \in D_{3}

; if not, we have

J_{2} (\tilde{α}, \tilde{β}, \tilde{γ}) \geq {∥\tilde{γ}∥}_{1} \geq J_{2} (0, 0, 0) \geq J_{2} (α^{★}, β^{★}, γ^{★})

. Let

q_{1}, \cdot \cdot \cdot, q_{n}

be the orthonormal vectors of

K (\tilde{γ}; Z)

with its associated eigenvalues

η_{1} \geq \dots \geq η_{n} \geq 0

. We can write out

\tilde{α}, X, Y

in terms of these basis functions where

\tilde{α} = \sum_{i = 1}^{n} < \tilde{α}, q_{i} > q_{i}

,

Y = \sum_{i = 1}^{n} < Y, q_{i} > q_{i}

and

X = \sum_{i = 1}^{n} < X, q_{i} > q_{i}

. Let

C_{i}^{\tilde{α}} = < \tilde{α}, q_{i} >

,

C_{i}^{Y} = < Y, q_{i} >

and

C_{i}^{X} = < X, q_{i} >

. It follows that

J_{2} (\tilde{α}, \tilde{β}, \tilde{γ}) \geq \frac{1}{2 n} {∥\sum_{i = 1}^{n} C_{i}^{Y} q_{i} - \sum_{i = 1}^{n} C_{i}^{X} \tilde{β} q_{i} - \sum_{i = 1}^{n} C_{i}^{\tilde{α}} η_{i} q_{i}∥}_{2}^{2} + \frac{1}{2} \sum_{i = 1}^{n} {(C_{i}^{\tilde{α}})}^{2} η_{i},

which is equal to

\frac{1}{2 n} \sum_{i = 1}^{n} {(C_{i}^{Y} - C_{i}^{X} \tilde{β} - C_{i}^{\tilde{α}} η_{i})}^{2} + \frac{1}{2} \sum_{i = 1}^{n} {(C_{i}^{\tilde{α}})}^{2} η_{i}

. We can minimize the above objective function with respect to

C_{i}^{\tilde{α}}

and

\tilde{β}

. First, note that for any

η_{i} = 0

we can let

C_{i}^{\tilde{α}} = 0

as it will not affect the expression above. It is sufficient to consider

η_{i} > 0

. Taking the first derivative and setting it equal to zero, we obtain the score equations the minimizer must satisfy, for our minimum

\tilde{β}

and

C_{i}^{\tilde{α}}

β = \sum_{i = 1}^{n} C_{i}^{X} (C_{i}^{Y} - C_{i}^{\tilde{α}} η_{i})

(A3)

C_{i}^{\tilde{α}} = \frac{1}{n + η_{i}} (C_{i}^{Y} - C_{i}^{X} \tilde{β}) .

(A4)

In the above derivation we used the fact that

1 = {∥X∥}_{2}^{2} = \sum_{i = 1}^{n} {(C_{i}^{X})}^{2}

. Plugging (A4) into (A3), we obtain

β = \frac{\sum_{i = 1}^{n} C_{i}^{X} C_{i}^{Y} (1 - \frac{η_{i}}{n + η_{i}})}{1 - \sum_{i = 1}^{n} {(C_{i}^{X})}^{2} \frac{η_{i}}{n + η_{i}}} .

(A5)

It follows that

β \leq \frac{\sum_{i = 1}^{n} ∣ C_{i}^{X} C_{i}^{Y} ∣}{1 - \sum_{i = 1}^{n} {(C_{i}^{X})}^{2} \frac{η^{★}}{n + η^{★}}} \leq \frac{{∥X∥}_{2} {∥Y∥}_{2}}{{∥X∥}_{2}^{2} (1 - \frac{η^{★}}{n + η^{★}})} \leq \frac{{∥Y∥}_{2}}{(1 - \frac{η^{★}}{1 + η^{★}})} = b^{★} .

Thus, the

β

that minimizes

J_{2}

for a given

γ \in D_{3}

is in

D_{2}

. Also, (A4) implies that

∣ C_{i}^{\tilde{α}} ∣ \leq ({∥Y∥}_{2} + {∥X∥}_{2} {∥β∥}_{2})

; consequently, the optimal

α

for the given

\tilde{γ} \in D_{3}

and

β \in D_{2}

that minimizes

J_{2}

satisfies

{∥α∥}_{2} \leq \sqrt{n} ({∥Y∥}_{2} + b^{★})

. As a result,

α \in D_{2}

. This suggests that for any

(\tilde{α}, \tilde{β}, \tilde{γ}) \notin D_{1} \times D_{2} \times D_{3}

we can find an

(α, β, γ) \in D_{1} \times D_{2} \times D_{3}

such that

J_{2} (\tilde{α}, \tilde{β}, \tilde{γ}) \geq J_{2} (α, β, γ)

.

Appendix A.4. Proof of Theorem 2

By Lemma 8.4 on page 129 in [32], Assumptions 1, 2, and 3 imply:

P (sup_{b \in B} \frac{\frac{1}{\sqrt{n}} | \sum_{i = 1}^{n} ϵ_{i} b (z_{i}) |}{{∥b∥}_{P_{n}}^{1 - ψ}} \geq T) \leq c exp (- \frac{T^{2}}{c^{2}}), T \geq c

(A6)

where the constant c is dependent on

C_{1}, C_{2}, C_{3}, C_{4},

and

ψ

. It follows that

sup_{b \in B} \frac{\frac{1}{\sqrt{n}} | \sum_{i = 1}^{n} ϵ_{i} b (z_{i}) |}{{∥b∥}_{P_{n}}^{1 - ψ}} = O_{p} (1) .

(A7)

Therefore, for any

h \in H_{K}

and a scaling map function

Γ \in A

, we obtain

\frac{\sqrt{n} {(ϵ, h \circ Γ - h_{0} \circ Γ_{0})}_{n} {({∥h∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥Γ∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{- ψ}}{{∥h \circ Γ - h_{0} \circ Γ_{0}∥}_{P_{n}}^{1 - ψ}} = O_{p} (1) .

(A8)

For our estimators,

\hat{h}

and

\hat{Γ}

, it is easy to see that

\begin{matrix} {(ϵ, \hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0})}_{n} = \\ O_{p} (n^{- \frac{1}{2}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{1 - ψ} & {({∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{ψ} . \end{matrix}

(A9)

From (A9), we obtain the following inequality:

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{2} + λ_{1} {∥\hat{h}∥}_{H_{K}}^{2} + λ_{2} {∥\hat{Γ}∥}_{S G L}^{2} \leq \\ O_{p} (n^{- \frac{1}{2}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{1 - ψ} {({∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{ψ} \\ + λ_{1} {∥h_{0}∥}_{H_{K}}^{2} + λ_{2} {∥Γ_{0}∥}_{S G L}^{2} . \end{matrix}

(A10)

We require

λ_{1} = O_{p} (1) λ_{2}

, namely

λ_{2}

and

λ_{1}

go to zero at the same rate. We will show at the end of the proof what happens if they are not of the same order. Therefore, without loss of generality, we set

λ_{1} = λ_{2}

, denoted by

λ

. In what follows, we divide (A10) into two cases.

Case 1: Suppose that

\begin{matrix} O_{p} (n^{- \frac{1}{2}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{1 - ψ} {({∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{ψ} \\ \geq λ ({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2}) . \end{matrix}

In this case, we have

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{2} + λ ({∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2}) \leq \\ O_{p} (n^{- \frac{1}{2}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{1 - ψ} {({∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{ψ} . \end{matrix}

(A11)

Above (A11) is further discussed separately in two sub-cases.

Case 1a: If

{∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2} \leq {∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2}

, then we have

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{2} + λ ({∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2}) \leq \\ O_{p} (n^{- \frac{1}{2}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{1 - ψ} {({∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2})}^{ψ} . \end{matrix}

(A12)

Therefore,

\begin{matrix} {({∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2})}^{ψ} \leq O_{p} (n^{- \frac{ψ}{2 (1 - ψ)}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{ψ} λ^{- \frac{ψ}{1 - ψ}} . \end{matrix}

(A13)

It follows that

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (n^{- \frac{1}{2 (1 - ψ)}}) O_{p} (λ^{- \frac{ψ}{1 - ψ}}), \\ {∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} = O_{p} (n^{- \frac{1}{1 - ψ}}) O_{p} (λ^{- \frac{1 + ψ}{1 - ψ}}) . \end{matrix}

(A14)

Case 1b: If

{∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2} \geq {∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2}

, then:

{∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} = O_{p} ({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2}) O_{p} (1) .

Therefore,

{∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (n^{- \frac{1}{2 (1 + ψ)}}) {({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ∥}_{S G L}^{2}])}^{\frac{ψ}{1 + ψ}} .

Consequently, we obtain

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (n^{- \frac{1}{2 (1 - ψ)}}) O_{p} (λ^{- \frac{ψ}{1 - ψ}}), \\ {∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} = O_{p} (n^{- \frac{1}{1 - ψ}}) O_{p} (λ^{- \frac{1 + ψ}{1 - ψ}}) . \end{matrix}

(A15)

Both terms in (A15) are the same rates as those in (A14).

Case 2: Suppose that

\begin{matrix} O_{p} (n^{- \frac{1}{2}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{1 - ψ} {({∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{ψ} \\ \leq λ ({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2}) . \end{matrix}

Then, we have

{∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{2} + λ ({∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2}) \leq 2 λ ({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2}) .

This implies that

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (λ^{\frac{1}{2}}) {({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{\frac{1}{2}}, \\ {∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} = O_{p} (1) ({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2}) . \end{matrix}

(A16)

In order to make (A14) and (A16) have the same rates we first equate the two term

O_{p} (λ^{\frac{1}{2}}) {({∥h∥}_{H_{K}}^{2} + {∥Γ∥}_{S G L}^{2})}^{\frac{1}{2}}

and

O_{p} (n^{- \frac{1}{2 (1 - ψ)}}) O_{p} (λ^{- \frac{ψ}{1 - ψ}})

, and then solve for a common

λ

. The solution is given as follows:

λ^{- 1} = n^{\frac{1}{1 + ψ}} {({∥h∥}_{H_{K}}^{2} + {∥Γ∥}_{S G L}^{2})}^{\frac{1 - ψ}{1 + ψ}} .

Under this

λ

value we obtain that (A14)–(A16) as of the form:

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (n^{- \frac{1}{2 (1 + ψ)}}) {({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{\frac{ψ}{1 + ψ}}, \end{matrix}

(A17)

\begin{matrix} {∥\hat{h}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} = O_{p} (1) ({∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{0}∥}_{S G L}^{2}) . \end{matrix}

(A18)

This completes the proof of Theorem 2.

Now we discuss the situation where the tuning parameters

λ_{1}

and

λ_{2}

are not of the same order. As seen blow, the selection consistency may not be guaranteed. Take Case 2 as an example. Suppose that

\begin{matrix} O_{p} (n^{- \frac{1}{2}}) {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n}^{1 - ψ} {({∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2})}^{ψ} \\ \leq λ_{1} {∥h_{0}∥}_{H_{K}}^{2} + λ_{2} {∥Γ_{0}∥}_{S G L}^{2} . \end{matrix}

Let us consider two cases.

Case 2a: If

λ_{1} {∥h_{0}∥}_{H_{K}}^{2} \leq λ_{2} {∥Γ_{0}∥}_{S G L}^{2}

, following the same arguments above, we have

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (λ_{2}^{\frac{1}{2}}) {∥Γ_{0}∥}_{S G L}), \\ {∥\hat{h}∥}_{H_{K}}^{2} = O_{p} (\frac{λ_{2}}{λ_{1}}) {∥Γ_{0}∥}_{S G L}^{2}, \\ {∥\hat{Γ}∥}_{S G L}^{2} = O_{p} (1) {∥Γ_{0}∥}_{S G L}^{2} . \end{matrix}

(A19)

Case 2b: If

λ_{1} {∥h_{0}∥}_{H_{K}}^{2} \geq λ_{2} {∥Γ_{0}∥}_{S G L}^{2}

, then following the same logic as before:

\begin{matrix} {∥\hat{h} \circ \hat{Γ} - h_{0} \circ Γ_{0}∥}_{n} = O_{p} (λ_{1}^{\frac{1}{2}}) {∥h_{0}∥}_{H_{K}}), \\ {∥\hat{Γ}∥}_{S G L}^{2} = O_{p} (\frac{λ_{1}}{λ_{2}}) {∥h_{0}∥}_{H_{K}}^{2}, \\ {∥\hat{h}∥}_{H_{K}}^{2} = O_{p} (1) {∥h_{0}∥}_{H_{K}}^{2} . \end{matrix}

(A20)

Both terms involve

O_{p} (\frac{λ_{1}}{λ_{2}})

and

O_{p} (\frac{λ_{2}}{λ_{1}})

, indicating that these two tuning parameters

λ_{1}

and

λ_{2}

should go to zero at the same rates. Moreover, we can think of our estimator

\hat{h} \circ \hat{Γ}

as one operational object. See Appendix B for more details on this, which can further explain the need of one rate for the two penalties.

Appendix A.5. Proof of Corollary 1

For convenience, we present the following lemma proved by [32] (on page 20).

Lemma A1.

(Geer’s Lemma) A d dimensional ball of radius R,

B_{d} (R)

, in

R^{d}

with Euclidean metric can be covered by

{(\frac{4 R + δ}{δ})}^{d}

balls of radius δ.

We have shown in the proof of Theorem 1 that the optimal

γ

vector is restricted to be within a ball of a radius that depends on the norm of

Y

. For the sake of simplicity let us confine our

γ

to be within a norm ball of radius 1,

γ \in G = {γ : {∥γ∥}_{2}^{2} \leq 1}

. We then confine our set which we called

A

to be restricted to those

γ

, that is

A = {Γ : Γ (z) = γ \circ z, γ \in G}

. Since our

γ \in R^{s}

, we can use above Lemma A1 and cover our set

A

with

N_{1} = {(\frac{4 + δ}{δ})}^{s}

number of functions in the following sense. The ball of radius 1 in

R^{s}

can be covered (using the Euclidean metric) by

{γ_{1}, \dots γ_{N_{1}}}

. Since there is a one to one relationship between the functions

Γ

and

γ

, take the set

{Γ_{1}, \dots, Γ_{N_{1}}}

and define the metric between some

Γ_{j}

and

Γ_{k}

in the set

A

as

d (Γ_{j}, Γ_{k}) = {∥γ_{j} - γ_{k}∥}_{2}

. Then, the set of functions

{Γ_{1}, \dots, Γ_{N_{1}}}

is a

δ

-covering for

A

under this metric with entropy s

l o g (\frac{4 + δ}{δ})

. For each

Γ_{j}

we have an induced RKHS,

H_{K \circ Γ_{j}} = {h \circ Γ_{j} : h \in H_{K}}

with entropy no larger than that of

H_{K}

, which according to the assumption, has entropy

\leq A δ^{- 2 ψ}

for some

ψ \in (0, 1)

and

A \in R

. Therefore, the covering number

N_{2} = N (δ, H_{K \circ Γ_{j}}, P_{n}) \leq exp {A δ^{- 2 ψ}}

. This implies that for every

Γ_{j}

there exists a set

{h_{j_{1}} \circ Γ_{j}, \dots, h_{j_{N_{2}}} \circ Γ_{j}}

such that for every

h \circ Γ_{j} \in H_{K \circ Γ_{j}}

there exists an integer

i \in {1, \dots, N_{2}}

we have

{∥h \circ Γ_{j} - h_{j_{i}} \circ Γ_{j}∥}_{P_{n}} \leq δ

. Set

B

is essentially the union of the different Hilbert spaces of the form

H_{K \circ Γ}

. Under the setup, a natural estimate of the

d e l t a

-covering number of this set would be approximately of size

N_{1} \times N_{2}

where functions take the form of

{h_{1_{1}} \circ Γ_{1}, \dots, h_{1_{N_{2}}} \circ Γ_{1}, \dots, h_{{N_{1}}_{1}} \circ Γ_{N_{1}}, \dots, h_{{N_{1}}_{N_{2}}} \circ Γ_{N_{1}}}

. In addition, we add

N_{2}

functions from the set

{h_{1} \circ Γ_{0}, \dots, h_{N_{2}} \circ Γ_{0}}

where

Γ_{0}

is the true

Γ_{0}

(or one of the true

Γ_{0}

). Since

H_{K \circ Γ_{j}}

is a Hilbert space for every j, if

h \circ Γ_{j} \in H_{K \circ Γ_{j}}

so is

\frac{h \circ Γ_{j}}{{∥h∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{j}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2}}

. We can simply ignore the denominator and substitute

\frac{h \circ Γ_{j}}{{∥h∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{j}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2}}

with

\tilde{h} \circ Γ_{j} \in H_{K \circ Γ_{j}}

where

\tilde{h} = \frac{h}{{∥h∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥Γ_{j}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2}} .

We now prove Corollary 1.

Proof.

Set

M = {sup}_{h} < \nabla h (z), \nabla h (z) >

where the inner product is the standard Euclidean inner product. This is for a fixed

z

, or under the assumption that the gradient is uniformly bounded, we can take the

{sup}_{h \in H_{K}, z \in R^{s}} < \nabla h (z), \nabla h (z) >

. Let

N_{1} = {\frac{4 + (\frac{δ}{3 M^{\frac{1}{2}}})}{(\frac{δ}{3 M^{\frac{1}{2}}})}}^{s}

which is the number of balls needed to provide a

(\frac{δ}{3 M^{\frac{1}{2}}})

covering for a norm 1 ball in

R^{s}

. Let

N_{2} = exp \{(A {(\frac{δ}{3})}^{- 2 ψ})\}

which is the covering number needed to provide a

\frac{δ}{3}

cover of our space

H_{K}

. Let:

\begin{matrix} \tilde{\hat{h}} \circ \hat{Γ} - {\tilde{h}}_{0} \circ Γ_{0} = \\ \frac{\hat{h} \circ \hat{Γ}}{{∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2}} - \frac{h_{0} \circ Γ_{0}}{{∥\hat{h}∥}_{H_{K}}^{2} + {∥h_{0}∥}_{H_{K}}^{2} + {∥\hat{Γ}∥}_{S G L}^{2} + {∥Γ_{0}∥}_{S G L}^{2}} \end{matrix}

be an arbitrary function in the set

B

. There exists a

Γ_{j}

where

j \in {1, \dots, N_{1}}

such that

d (Γ_{j}, \hat{Γ}) \leq \frac{δ}{3 max_{i = 1, \dots, n} {∥z_{i}∥}_{2} \sqrt{M}}

, and there exists an i where

i \in {1, \dots, N_{2}}

such that

{∥\tilde{\hat{h}} \circ Γ_{j} - h_{j_{i}} \circ Γ_{j}∥}_{P_{n}} \leq \frac{δ}{3}

.

Similarly, there exists a

t \in {1, \dots, N_{2}}

such that

{∥{\tilde{h}}_{0} \circ Γ_{0} - h_{t} \circ Γ_{0}∥}_{P_{n}} \leq \frac{δ}{3}

. We construct our approximating function of

\tilde{\hat{h}} \circ \hat{Γ} - {\tilde{h}}_{0} \circ Γ_{0}

as

h_{j_{i}} \circ Γ_{j} - h_{t} \circ Γ_{0}

. We now show that this function is within

δ

of our arbitrary function

\tilde{\hat{h}} \circ \hat{Γ} - {\tilde{h}}_{0} \circ Γ_{0}

. Applying the mean value theorem for multivariate functions,

\tilde{\hat{h}} \circ \hat{Γ} (z) = \tilde{\hat{h}} \circ Γ_{j} (z) + \nabla \tilde{\hat{h}} (C (z)) (\hat{Γ (z)} - Γ_{j} (z))

, we have:

\begin{matrix} ‖ (\tilde{\hat{h}} \circ & \hat{Γ} - {\tilde{h}}_{0} \circ Γ_{0}) - (h_{j_{i}} \circ Γ_{j} - h_{t} \circ Γ_{0}) ‖_{P_{n}} \\ \leq {∥\tilde{\hat{h}} \circ \hat{Γ} - h_{j_{i}} \circ Γ_{j}∥}_{P_{n}} + {∥{\tilde{h}}_{0} \circ Γ_{0} - h_{t} \circ Γ_{0}∥}_{P_{n}} \\ \leq {∥\tilde{\hat{h}} \circ \hat{Γ} - h_{j_{i}} \circ Γ_{j}∥}_{P_{n}} + \frac{δ}{3} \\ = {∥\tilde{\hat{h}} \circ Γ_{j} - h_{j_{i}} \circ Γ_{j} + \nabla \tilde{\hat{h}} (C (\cdot)) (\hat{Γ} - Γ_{j})∥}_{P_{n}} + \frac{δ}{3} \end{matrix}

where vector

z \in R^{s}

lies in the segment from

γ_{j} \circ z

and

\hat{γ} \circ z

, and

C (\cdot)

is an unknown function that maps from

R^{s}

into

R^{s}

that allows for the formula to hold. Continuing our chain of inequalities, we obtain:

\begin{matrix} {∥\tilde{\hat{h}} \circ Γ_{j} - h_{j_{i}} \circ Γ_{j} + \nabla \tilde{\hat{h}} (C (\cdot)) (\hat{Γ} - Γ_{j})∥}_{P_{n}} + \frac{δ}{3} \leq \\ {∥\nabla \tilde{\hat{h}} (C (\cdot)) (\hat{Γ} - Γ_{j})∥}_{P_{n}} + \frac{δ}{3} + \frac{δ}{3} = \\ \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(\nabla \tilde{\hat{h}} (C (z_{i})) (\hat{Γ} (z_{i}) - Γ_{j} (z_{i})))}^{2}} + \frac{δ}{3} + \frac{δ}{3} \leq \\ \sqrt{\frac{1}{n} \sum_{i = 1}^{n} M {∥\hat{γ} \circ z_{i} - γ_{j} \circ z_{i}∥}_{2}^{2}} + \frac{δ}{3} + \frac{δ}{3} \leq \\ \sqrt{M {(\frac{δ}{3 max_{i = 1, \dots, n} {∥z_{i}∥}_{2} \sqrt{M}})}^{2} \max_{i = 1, \dots, n} {∥z_{i}∥}_{2}^{2}} + \frac{δ}{3} + \frac{δ}{3} = \\ \frac{δ}{3} + \frac{δ}{3} + \frac{δ}{3} = δ . \end{matrix}

Therefore, to provide a

δ

cover we need

N_{1} \times N_{2} + N_{2}

number of functions or:

\begin{matrix} exp {(A {(\frac{δ}{3})}^{- 2 ψ})} {(\frac{4 + (\frac{δ}{3 M^{\frac{1}{2}}})}{(\frac{δ}{3 M^{\frac{1}{2}}})})}^{s} + exp \{(A {(\frac{δ}{3})}^{- 2 ψ})\} = \\ exp {\tilde{A} δ^{- 2 ψ}} {(\frac{C + δ}{δ})}^{s} + exp {\tilde{A} δ^{- 2 ψ}}, \end{matrix}

where

\tilde{A} = \frac{A}{3^{- 2 ψ}}

and

C = 12 M^{\frac{1}{2}}

. Taking the log we see the entropy is

\leq \tilde{A} δ^{- 2 ψ} + log ({(\frac{C + δ}{δ})}^{s} + 1)

which is of the same order as

\leq \tilde{A} δ^{- 2 ψ}

(the

l o g

term is dominated by the first term). Therefore a sufficient (but not necessary) condition for our set

B

to have the same entropy as that of the original RKHS

H_{K}

is for the

{sup}_{h} < \nabla h (z), \nabla h (z) >

to be bounded. Having bounded derivatives is reasonable for any RKHS since every RKHS satisfies the Lipschitz condition of the form:

\begin{matrix} | h (X) - h (Y) | = | < h, K_{X} > - < h, K_{Y} > | \leq {∥h∥}_{H_{K}} < K_{X}, K_{Y} >^{\frac{1}{2}} = {∥h∥}_{H_{K}} d (X, Y), \end{matrix}

where the distance metric in

R^{s}

is defined as

d {(X, Y)}^{2} = K (X, X) - 2 K (X, Y) + K (Y, Y)

. If we restrict our functions in the RKHS of norm

\leq C

for some constant C then we have a universal Lipschitz constant C to ensure bounded derivatives. □

Appendix B. Discussion about the FKMR Estimator

We introduce

γ

as a way of performing variable selection on our vector of FPC features. We want to illustrate this technical trick with some concrete examples and discuss identifiability issues with the resulting estimator. There are two ways of looking at the estimation of the unknown functions

h_{0}

and

Γ_{0}

. The first way is to view our feature vector,

z

, as being related to the dependent variable y through the composite function

h \circ Γ

, as explained in Section 4. The second and equivalent way is to view our features as unknown. The true features take the form of

γ \circ z

, where in this case the ∘ denotes the Hadamard product. We are given

z

and need to estimate the “true" features

γ \circ z

. In addition, we need to estimate the relationship between

γ \circ z

and y, which is done through the function

h \in H_{K}

.

The first way is to estimate the function

h_{0} \circ Γ_{0}

. The function belongs to the RKHS

H_{K \circ Γ}

. We essentially consider many different function spaces to construct our estimator. The intersection between the function spaces is not necessarily empty, implying that our estimator may not be unique. We proceed this discussion more formally. Let

K : R^{s} \times R^{s} \mapsto R

be a positive definite function. Let

Γ : R^{s} \mapsto R^{s}

. We define

K \circ Γ : R^{s} \times R^{s} \mapsto R

as the function given by

K \circ Γ (s, t) = K (Γ (s), Γ (t))

. This new function,

K \circ Γ

is positive definite. There is a relationship between the original RKHS,

H_{K}

and the new RKHS,

H_{K \circ Γ}

. This results in

H_{K \circ Γ} = {h \circ Γ : h \in H_{K}}

. For any vector

u \in H_{K \circ Γ}

, we have that

{∥u∥}_{H_{K \circ Γ}} = i n f {{∥h∥}_{H_{K}} : u = h \circ Γ}

. In general,

H_{K \circ Γ} \neg \subset H_{K}

. In (5), we take the norm with respect to the original space

H_{K}

. Our iterative procedure essentially presents the second way in which the true features are unknown, whereas our theoretical arguments are justified through the first way. Given the knowledge of the features (which translates to fixing a

γ

), we are confined to just one RKHS,

H_{K}

. Take the linear kernel,

K (x_{1}, x_{2}) = x_{1}^{⊤} x_{2}

as an example. Suppose the truth is that y is related to a one-dimensional feature

z_{0}

through the following formulation:

y = h_{0} (z_{0}) + ε

where

h_{0} \in H_{K_{1}}

, where

K_{1}

is the kernel that maps from

R \times R \mapsto R

. Therefore, if we knew the feature

z_{1}

, we would proceed to optimize (6) using the standard LSKM. However, when each y is associated with a two-dimensional vector

z = (z_{1}, z_{2})

, where

z_{2}

is a “noisy” feature and unrelated to y. Suppose that a priori we do not know this information. Typically we use a model

y = h (z_{1}, z_{2}) + ε

where

h \in H_{K}

, where

K

is the kernel that maps from

R^{2} \times R^{2} \mapsto R

. In this case, we introduce our

γ

vector

(γ_{1}, γ_{2})

and formulate

y = h (γ_{1} z_{1}, γ_{2} z_{2}) + ϵ

. All functions, h in the space

H_{K}

, are of the form

h (z) = x^{⊤} z

for some two-dimensional vector

x = (x_{1}, x_{2})

. There is a one-to-one relationship between h and

x

. The true function,

h_{0}

, has an associated real number c where

h_{1} (z_{1}) = c z_{1}

. We can recover

h_{1} \in H_{K_{1}}

from our estimation of h and

γ

if we set

γ = (1, 0)

and

x = (c, ★)

, where "★" is any real number. Equivalently, we can recover

h_{1}

under

γ = (1, 1)

where

x = (c, 0)

. There are many functions that may recover the original function in the RKHS corresponding to the linear space kernel. Formulating our problem in the first way, through function composition, we can estimate

Γ_{0}

with the

γ

being

(1, 0)

or

(1, 1)

.

We can now see that in the intersection between

H_{K \circ Γ_{1}}

and

H_{K \circ Γ_{2}}

, where

Γ_{1}

has associated

γ_{1} = (1, 0)

and

Γ_{2}

has associated

γ_{2} = (1, 1)

, lies our estimate of

h_{1}

. In truth, for the linear space RKHS, there is no need to apply our method since

h_{0} \in H_{K_{1}}

can be estimated directly from the larger space

H_{K}

where we set

h (z) = x^{⊤} z

where

x = (c, 0)

. We can never hope to have variable selection consistency nor can we hope to have identifiability of our estimator for these types of spaces. However, from a goodness-of-fit standpoint, we are able to do just as good a job with many types of function compositions. Our hope is that we can glean some variable selection by penalizing the

γ

vector with the

ρ (γ; δ)

term which, going back to the above scenario, should give preference to

γ = (1, 0)

over

γ = (1, 1)

. For the RKHS associated with the Gaussian Kernel, the “larger dimensional space”, a Gaussian Kernel mapping from higher dimensions, does not necessarily contain the functions from a “lower dimensional space”, a Gaussian Kernel mapping from lower dimensions. However through the introduction of the

γ

transformation of the features, we can recover the equivalent functions of the "lower dimensional space”.

References

Chandler, J.L.; Brazendale, K.; Beets, M.W.; Mealing, B.A. Classification of Physical Activity Intensities Using a Wrist-worn Accelerometer in 8–12-Year-old Children. Pediatric Obes. 2016, 11, 120–127. [Google Scholar] [CrossRef] [PubMed]
Chen, K.Y.; Bassett, D.R. The Technology of Accelerometry-based Activity Monitors: Current and Future. Med. Sci. Sport. Exerc. 2005, 37, S490–S500. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bai, J.; Di, C.; Xiao, L.; Evenson, K.R.; LaCroix, A.Z.; Crainiceanu, C.M.; Buchner, D.M. An Activity Index for Raw Accelerometry Data and Its Comparison with Other Activity Metrics. PLoS ONE 2016, 11, e0160644. [Google Scholar] [CrossRef] [PubMed] [Green Version]
John, D.; Freedson, P. ActiGraph and Actical Physical Activity Monitors: A Peek under the Hood. Med. Sci. Sport. Exerc. 2012, 44, S86–S89. [Google Scholar] [CrossRef] [Green Version]
Kim, Y.; Lee, J.M.; Peters, B.P.; Gaesser, G.A.; Welk, G.J. Examination of Different Accelerometer Cut-points for Assessing Sedentary Behaviors in Children. PLoS ONE 2014, 9, e90630. [Google Scholar] [CrossRef]
Bai, J.; Sun, Y.; Schrack, J.A.; Crainiceanu, C.M.; Wang, M.C. A Two-stage Model for Wearable Device Data. Biometrics 2018, 74, 744–752. [Google Scholar] [CrossRef]
Sasaki, J.E.; Hickey, A.M.; Staudenmayer, J.W.; John, D.; Kent, J.A.; Freedson, P.S. Performance of Activity Classification Algorithms in Free-Living Older Adults. Med. Sci. Sport. Exerc. 2016, 48, 941–950. [Google Scholar] [CrossRef] [Green Version]
Di, C.Z.; Crainiceanu, C.M.; Caffo, B.S.; Punjabi, N.M. Multilevel Functional Principal Component Analysis. Ann. Appl. Stat. 2009, 3, 458–488. [Google Scholar] [CrossRef]
Goldsmith, J.; Liu, X.; Rundle, A.; Jacobson, J. New Insights into Activity Patterns in Children, Found Using Functional Data Analyses. Med. Sci. Sport. Exerc. 2016, 48, 1723–1729. [Google Scholar] [CrossRef] [Green Version]
Li, H.; Keadle, S.K.; Staudenmayer, J.; Assaad, H.; Huang, J.Z.; Carroll, R.J. Methods to Assess An Exercise Intervention Trial Based on 3-Level Functional Data. Biostatistics 2015, 16, 754–771. [Google Scholar] [CrossRef] [Green Version]
Zhang, Y.; Li, H.; Keadle, S.K.; Matthews, C.E.; Carroll, R.J. A Review of Statistical Analyses on Physical Activity Data Collected from Accelerometers. Stat. Biosci. 2019, 11, 465–476. [Google Scholar] [CrossRef] [PubMed]
Ramsay, J.O.; Silverman, B.W. Functional Data Analysis; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Cardot, H.; Ferraty, F.; Sarda, P. Spline Estimators for the Functional Linear model. Stat. Sin. 2003, 13, 571–591. [Google Scholar]
Cardot, H.; Ferraty, F.; Sarda, P. Functional Linear Model. Stat. Probab. Lett. 1999, 45, 11–22. [Google Scholar] [CrossRef]
Zhu, H.; Yao, F.; Zhang, H.H. Structured Functional Additive Regression in Reproducing Kernel Hilbert Spaces. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2014, 76, 581–603. [Google Scholar] [CrossRef]
Ferraty, F.; Mas, A.; Vieu, P. Nonparametric Regression on Functional Data: Inference and Practical Aspects. Aust. N. Z. J. Stat. 2007, 49, 267–286. [Google Scholar] [CrossRef] [Green Version]
McLean, M.W.; Hooker, G.; Staicu, A.M.; Scheipl, F.; Ruppert, D. Functional Generalized Additive Models. J. Comput. Graph. Stat. 2014, 23, 249–269. [Google Scholar] [CrossRef] [Green Version]
Bosq, D. Linear Processes in Function Spaces; Lecture Notes in Statistics; Springer: New York, NY, USA, 2000; Volume 149. [Google Scholar]
Hall, P.; Müller, H.G.; Wang, J.L. Properties of Principal Component Methods for Functional and Longitudinal Data Analysis. Ann. Stat. 2006, 34, 1493–1517. [Google Scholar] [CrossRef] [Green Version]
Hall, P.; Hosseini-Nasab, M. On Properties of Functional Principal Components Analysis. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2006, 68, 109–126. [Google Scholar] [CrossRef]
Müller, H.G.; Yao, F. Functional Additive Models. J. Am. Stat. Assoc. 2008, 103, 1534–1544. [Google Scholar] [CrossRef]
Lin, Y.; Zhang, H.H. Component Selection and Smoothing in Multivariate Nonparametric Regression. Ann. Stat. 2006, 34, 2272–2297. [Google Scholar] [CrossRef] [Green Version]
Liu, D.; Lin, X.; Ghosh, D. Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Squares Kernel Machines and Linear Mixed Models. Biometrics 2007, 63, 1079–1088. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wood, S.N. Generalized Additive Models: An Introduction with R; Chapman and Hall: London, UK, 2006. [Google Scholar]
Lin, X.; Zhang, D. Inference in Generalized Additive Mixed Models by Using Smoothing Splines. J. R. Stat. Soc. Ser. (Stat. Methodol.) 1999, 61, 381–400. [Google Scholar] [CrossRef] [Green Version]
Yuan, M.; Lin, Y. Model Selection and Estimation in Regression with Grouped Variables. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2006, 68, 49–67. [Google Scholar] [CrossRef]
Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. A Sparse-Group Lasso. J. Comput. Graph. Stat. 2013, 22, 231–245. [Google Scholar] [CrossRef]
Breiman, L. Better Subset Regression Using the Nonnegative Garrote. Technometrics 1995, 37, 373–384. [Google Scholar] [CrossRef]
Salzo, S.; Villa, S. Convergence Analysis of a Proximal Gauss–Newton Method. Comput. Optim. Appl. 2012, 53, 557–589. [Google Scholar] [CrossRef] [Green Version]
Naiman, J. Multivariate Functional Kernel Machine Regression and Feature Selection with Applications to Accelerometer Mobile Health Devices. Ph.D. Dissertation, University of Michigan, Ann Arbor, MI, USA, 2020. [Google Scholar]
Peng, H.; Huang, T. Penalized Least Squares for Single Index Models. J. Stat. Plan. Inference 2011, 141, 1362–1379. [Google Scholar] [CrossRef]
Geer, S.A. Empirical Processes in M-Estimation; Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Hainmueller, J.; Hazlett, C. Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach. Political Anal. 2014, 22, 143–168. [Google Scholar] [CrossRef]
Yao, F.; Müller, H.G.; Wang, J.L. Functional Data Analysis for Sparse Longitudinal Data. J. Am. Stat. Assoc. 2005, 100, 577–590. [Google Scholar] [CrossRef]
Lewis, R.C.; Meeker, J.D.; Peterson, K.E.; Lee, J.M.; Pace, G.G.; Cantoral, A.; Téllez-Rojo, M.M. Predictors of Urinary Bisphenol A and Phthalate Metabolite Concentrations in Mexican Children. Chemosphere 2013, 93, 2390–2398. [Google Scholar] [CrossRef] [Green Version]
Schrack, J.A.; Zipunnikov, V.; Goldsmith, J.; Bai, J.; Simonsick, E.M.; Crainiceanu, C.; Ferrucci, L. Assessing the Physical Cliff: Detailed Quantification of Age-related Differences in Daily Patterns of Physical Activity. J. Gerontol. Ser. Biol. Sci. Med. Sci. 2014, 69, 973–979. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jansen, E.C.; Dunietz, G.L.; Chervin, R.D.; Baylin, A.; Baek, J.; Banker, M.; Song, P.X.K.; Cantoral, A.; Tellez Rojo, M.M.; Peterson, K.E. Adiposity in Adolescents: The Interplay of Sleep Duration and Sleep Variability. J. Pediatr. 2018, 203, 309–316. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Activity counts over 7 d from a tri-axis (X-, Y- and Z-axis) accelerometer of a subject.

Figure 2. Five marginal estimates of important feature functions with 95% shaded confidence bands evaluated at 100 grid points while holding all other components equal to

0.5

in Scenario 2.

Figure 2. Five marginal estimates of important feature functions with 95% shaded confidence bands evaluated at 100 grid points while holding all other components equal to

0.5

in Scenario 2.

Figure 3. The 24 h minute-by-minute medians of 7 d ACs for one subject.

Table 1. Goodness-of-fit and the concordance regression for Scenario 2.

Model	$R_{AQ}^{2}$	$β$	Reg of h on $\hat{h}$
Model	$R_{AQ}^{2}$	$β$	Intercept	Slope	$R^{2}$
$F K M R_{L a s s o}$	0.830	2.00	−0.062	1.01	0.848
$F K M R_{G L a s s o}$	0.937	1.99	−0.055	1.01	0.972
$F K M R_{S G L}$	0.928	2.00	−0.051	1.01	0.955
$F K M R_{M C P}$	0.835	2.01	−0.062	1.01	0.856
$F K M R_{G M C P}$	0.935	1.99	−0.056	1.01	0.970
$F K M R_{G M C P}^{o r a c l e}$	0.911	1.99	−0.049	1.01	0.937
COSSO	0.832	–	–	–	–
LM + Lasso	0.453	–	–	–	–
LM + GLasso	0.324	–	–	–	–
LM + SGL	0.450	–	–	–	–
LM + MCP	0.513	–	–	–	–
LM + GMCP	0.307	–	–	–	–

Table 2. Sensitivity and specificity of functional selection for Scenario 2.

Model	Selection Frequency
Model	$\hat{Z^{1}}$	$\hat{Z^{2}}$	$\hat{Z^{3}}$	$\hat{Z^{4}}$
$F K M R_{L a s s o}$	100	100	0	0
$F K M R_{G L a s s o}$	100	100	4	4
$F K M R_{S G L}$	100	100	0	0
$F K M R_{M C P}$	100	100	0	0
$F K M R_{G M C P}$	100	100	3	4
COSSO	100	100	5	6
LM + Lasso	100	100	19	21
LM + GLasso	94	99	7	8
LM + SGL	100	100	19	18
LM + MCP	100	100	20	19
LM + GMCP	93	99	7	8

Table 3. FPC feature selection for signal functional

Z^{1}

in Scenario 2.

Table 3. FPC feature selection for signal functional

Z^{1}

in Scenario 2.

Model	Selection Frequency
Model	$\hat{ζ_{1}^{1}}$	$\hat{ζ_{2}^{1}}$	$\hat{ζ_{3}^{1}}$	$\hat{ζ_{4}^{1}}$	$\hat{ζ_{5}^{1}}$	$\hat{ζ_{6}^{1}}$	$\hat{ζ_{7}^{1}}$	$\hat{ζ_{8}^{1}}$	$\hat{ζ_{9}^{1}}$
$F K M R_{L a s s o}$	100	1	97	0	0	0	0	0	0
$F K M R_{G L a s s o}$	100	100	100	100	100	100	100	100	100
$F K M R_{S G L}$	100	21	100	71	26	20	17	16	15
$F K M R_{M C P}$	100	1	99	1	0	0	0	0	0
$F K M R_{G M C P}$	100	100	100	100	100	100	100	100	100
COSSO	100	2	100	93	1	0	0	1	0
LM + Lasso	100	10	100	100	10	8	7	10	5
LM + GLasso	94	94	94	94	94	94	94	94	94
LM + SGL	100	12	100	100	10	8	8	11	5
LM + MCP	100	10	100	100	9	8	9	7	5
LM + GMCP	93	93	93	93	93	93	93	93	93

Table 4. FPC feature selection for signal functional

Z^{2}

in Scenario 2.

Table 4. FPC feature selection for signal functional

Z^{2}

in Scenario 2.

Model	Selection Frequency
Model	$\hat{ζ_{1}^{2}}$	$\hat{ζ_{2}^{2}}$	$\hat{ζ_{3}^{2}}$	$\hat{ζ_{4}^{2}}$	$\hat{ζ_{5}^{2}}$	$\hat{ζ_{6}^{2}}$	$\hat{ζ_{7}^{2}}$	$\hat{ζ_{8}^{2}}$	$\hat{ζ_{9}^{2}}$
$F K M R_{L a s s o}$	0	3	0	0	0	0	100	0	0
$F K M R_{G L a s s o}$	100	100	100	100	100	100	100	100	100
$F K M R_{S G L}$	16	100	14	7	16	23	100	15	7
$F K M R_{M C P}$	0	11	0	0	0	1	100	0	0
$F K M R_{G M C P}$	100	100	100	100	100	100	100	100	100
COSSO	8	97	5	5	5	15	100	3	3
LM + Lasso	17	100	14	7	16	23	100	15	6
LM + GLasso	99	99	99	99	99	99	99	99	99
LM + SGL	17	100	14	7	16	23	100	15	7
LM + MCP	17	100	13	6	16	23	100	15	8
LM + GMCP	99	99	99	99	99	99	99	99	99

Table 5. Goodness-of-fit for the five models used in the data analysis.

Model	Adjusted $R^{2}$
M0: LM	0.07
M1: LM + SGL	0.13
M2: LSKM	0.18
M3: $F K M R_{S G L}$	0.30
M4: COSSO	0.14

Table 6. Axis-specific FPC feature selection.

Model	X-Axis						Y-Axis					Z-Axis
Model	$\hat{ζ_{1}^{1}}$	$\hat{ζ_{2}^{1}}$	$\hat{ζ_{3}^{1}}$	$\hat{ζ_{4}^{1}}$	$\hat{ζ_{5}^{1}}$	$\hat{ζ_{6}^{1}}$	$\hat{ζ_{1}^{2}}$	$\hat{ζ_{2}^{2}}$	$\hat{ζ_{3}^{2}}$	$\hat{ζ_{4}^{2}}$	$\hat{ζ_{5}^{2}}$	$\hat{ζ_{1}^{3}}$	$\hat{ζ_{2}^{3}}$	$\hat{ζ_{3}^{3}}$	$\hat{ζ_{4}^{3}}$
$F K M R_{S G L}$		✓	✓	✓		✓	✓		✓		✓
COSSO				✓			✓		✓
LM + SGL	✓			✓	✓	✓	✓	✓	✓						✓

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Naiman, J.; Song, P.X. Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection. Entropy 2022, 24, 203. https://doi.org/10.3390/e24020203

AMA Style

Naiman J, Song PX. Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection. Entropy. 2022; 24(2):203. https://doi.org/10.3390/e24020203

Chicago/Turabian Style

Naiman, Joseph, and Peter Xuekun Song. 2022. "Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection" Entropy 24, no. 2: 203. https://doi.org/10.3390/e24020203

APA Style

Naiman, J., & Song, P. X. (2022). Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection. Entropy, 24(2), 203. https://doi.org/10.3390/e24020203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multivariate Functional Kernel Machine Regression and Sparse Functional Feature Selection

Abstract

1. Introduction

1.1. Functional Regression

1.2. Least-Squares Kernel Machine

1.3. Feature Selection

2. Model and Estimation

3. Implementation and Algorithm

4. Theoretical Guarantees

5. Simulation Experiments

5.1. Setup

5.2. Simulation in Scenario 1

5.3. Simulation in Scenario 2

6. Data Example

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Technical Assumptions and Proofs

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Lemma 2

Appendix A.3. Proof of Theorem 1

Appendix A.4. Proof of Theorem 2

Appendix A.5. Proof of Corollary 1

Appendix B. Discussion about the FKMR Estimator

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI