Precise Tensor Product Smoothing via Spectral Splines

: Tensor product smoothers are frequently used to include interaction effects in multiple nonparametric regression models. Current implementations of tensor product smoothers either require using approximate penalties, such as those typically used in generalized additive models, or costly parameterizations, such as those used in smoothing spline analysis of variance models. In this paper, I propose a computationally efficient and theoretically precise approach for tensor product smoothing. Specifically, I propose a spectral representation of a univariate smoothing spline basis, and I develop an efficient approach for building tensor product smooths from marginal spectral spline representations. The developed theory suggests that current tensor product smoothing methods could be improved by incorporating the proposed tensor product spectral smoothers. Simulation results demonstrate that the proposed approach can outperform popular tensor product smoothing implementations, which supports the theoretical results developed in the paper.


Introduction
Consider a multiple nonparametric regression model [1] of the form where Y ∈ R is the observed response variable, X = (X 1 , . . ., X p ) ∈ X is the observed predictor vector, X = X (1) × • • • × X (p) is the product domain with X (j) denoting the domain of the j-th predictor, f : X → R is the (unknown) real-valued function connecting the response and predictors, and is an error term that satisfies E( ) = 0 and E( 2 ) = σ 2 < ∞.
Note that this implies that E(Y|X) = f (X), i.e., the function f (•) is the conditional expectation of the response variable Y given the predictor vector X.Given a sample of training data, the goal is to estimate the unknown mean function f without having any a priori information about the parametric nature of the functional relationship (e.g., without assuming linearity).Let {(x i , y i )} n i=1 denote a sample of n independent observations from the model in Equation (1), where y i ∈ R is the i-th observation's realization of the response variable, and x i = (x i1 , . . ., x ip ) ∈ X is the i-th observation's realization of the predictor vector.To estimate f , it is typical to minimize a penalized least squares functional of the form where P( f ) ≥ 0 denotes some non-negative penalty that describes the complexity of f , i.e., if P( f ) > P(g), then the function f is more complex (less smooth) than the function g, and the tuning parameter λ ≥ 0 controls the influence of the penalty.To find a reasonable balance between fitting (the data) and smoothing (the function), λ is often chosen via cross-validation, information theory, or maximum likelihood estimation [2].
To find the fλ that minimizes Equation (2), it is first necessary to specify the assumed model form.For example, with p = 2 predictors, we could consider one of two forms: where X = (X 1 , X 2 ) ∈ X = X (1) × X (2) is the bidimensional predictor, f 0 ∈ R is an intercept, f 1 : X (1) → R is the main effect of the first predictor, f 2 : X (2) → R is the main effect of the second predictor, and f 12 : X → R is the two-way interaction effect.Note that these models are nested given that the additive model is equivalent to the interaction model if f 12 = 0.
For additive models, f j is typically represented by a spline basis of rank r j , such as f j (X j ) = Z j β j .Note that Z j = Z (j) 1 (X j ), . . ., Z r j (X j ) ∈ R r j denotes the known spline basis vector that depends on the chosen knots (later described), and β j ∈ R r j is the unknown coefficient vector.To define the complexity of each (additive) effect, it is typical to consider penalties of the form P j ( f j ) = β j Q j β j , where Q j is a semi-positive definite matrix.Using these representations of the function evaluation and penalty, Equation ( 2) can be written as where z ij = Z (j) 1 (x ij ), . . ., Z (j) is the i-th observation's realization of the Z j vector, and λ j ≥ 0 are tuning parameters that control the influence of each penalty.
When the model contains interaction effects, different approaches can be used to represent and penalize the interaction terms.In GAMs, it is typical to (i) represent interaction effects by taking an outer (Kronecker) product of marginal basis vectors, and (ii) penalize interaction effects using an equidistant (grid) approximation (see [7] (pp.227-237)).In SSANOVA models, it is typical to represent and penalize interaction effects using a tensor product RK function (see [5] (pp.40-48)).For a thorough comparison of the two approaches, see Helwig [1].In both frameworks, estimation of interaction effects can be costly when using a moderate to large number of knots, which is true even when using scalable parameterizations and algorithms [9,17,18].This is because efficient computational tools for exact tensor product function representation and penalization are lacking from the literature, which hinders the widespread application of tensor product smoothing splines.
To address this practical issue, this paper (i) proposes a spectral representer theorem for univariate smoothing spline estimators, and (ii) develops efficient computational strategies for constructing tensor product smoothing splines from marginal spectral representations.The marginal spectral spline representation that I propose is similar to that proposed by [19]; however, the version that I consider penalizes all of the non-constant functions of each predictor.The tensor product basis construction approach that I propose generally follows the idea proposed by [20], where tensor products are built from outer (Kronecker) products of marginal bases.However, unlike this approach, I leverage reproducing kernel theory to develop exact analytical penalties for tensor product smooth terms.The proposed approach makes it possible to fit tensor product smoothing spline models (a) with interaction effects between any combination of predictors, and (b) using any linear mixed modeling software.
The remainder of this paper is organized as follows: Section 2 provides background on the reproducing kernel Hilbert space theory relevant to univariate smoothing splines; Section 3 provides background on the tensor product smoothing splines; Section 4 proposes an alternative tensor product smoothing spline (like) framework that penalizes all nonconstant functions of the predictor; Section 5 develops the spectral representer theories necessary for efficiently computing exact tensor product penalties; Section 6 conducts a simulation study to compare the proposed approach to an existing (comparable) method; Section 7 demonstrates the proposed approach using a real dataset; and Section 8 discusses potential extensions of the proposed approach.

Reproducing Kernel Hilbert Spaces
Consider a single predictor (i.e., p = 1) that satisfies x ∈ X , and let H denote a RKHS of functions on X .The unknown function f from Equation ( 2) is assumed to be an element of H, which will be denoted by f ∈ H. Suppose that the space H can be decomposed into two orthogonal subspaces, such as H = H 0 ⊕ H 1 , where ⊕ denotes the tensor summation.Note that H 0 = { f : P( f ) = 0, f ∈ H} is the null space, which contains all functions (in H) that have zero penalty, and H 1 = { f : P( f ) > 0, f ∈ H} is the contrast space, which contains all functions (in H) that have a non-zero penalty.In some cases, the null space can be further decomposed, such as H 0 = H 00 ⊕ H 01 , where a space of non-constant functions (unpenalized).For example, when using a cubic smoothing spline, H 01 contains the linear effect of X, which is unpenalized.
The inner product of H will be denoted by f , g for any f , g ∈ H, and the corresponding norm will be written as f = f , f for any f ∈ H.Given the tensor sum decomposition of H, the inner product can be written as a summation of the corresponding subspaces' inner products, such as f , g = f , g 0 + f , g 1 .Note that f , g 0 is the null space inner product for any f , g ∈ H 0 , and f , g 1 is the contrast space inner product for any f , g ∈ H 1 .The corresponding norms will be denoted by . When the null space consists of non-constant functions, i.e, when H 0 = H 00 ⊕ H 01 , the null space inner product can be written as f , g 0 = f , g 00 + f , g 01 , where f , g 00 is the inner product of H 00 for any f , g ∈ H 00 , and f , g 01 is the inner product of H 01 for any f , g ∈ H 01 .The corresponding norm can be written as , where f 00 = f , f 00 and f 01 = f , f 01 denote the norms of H 00 and H 01 , respectively.
The RK of H will be denoted by R(x, z) = R z (x) = R x (z) for any x, z ∈ X .Note that the RK is an element of the RKHS, i.e., R ∈ H for any x, z ∈ X .By definition, the RK is the representer of the evaluation functional in H, which implies that the RK satisfies f (x) = R x (z), f (z) for any f ∈ H and any x, z ∈ X .This important property, which is referred to as the "reproducing property" of the (reproducing) kernel function, implies that any function in H can be evaluated through the inner product and RK function.Following the decompositions of the inner product, the RK function can be written as R(x, z) = R 0 (x, z) + R 1 (x, z), where R 0 ∈ H 0 and R 1 ∈ H 1 denotes the RKs of H 0 and H 1 , respectively.Furthermore, when H 0 = H 00 ⊕ H 01 , the null space RK can be decomposed such as R 0 (x, z) = R 00 (x, z) + R 01 (x, z), where R 00 ∈ H 00 and R 01 ∈ H 01 denotes the RKs of H 00 and H 01 , respectively.By definition, R 00 (x, z) = β 0 for all x, z ∈ X , where β 0 ∈ R is some constant.
The tensor sum decomposition H = H 0 ⊕ H 1 implies that any function f ∈ H can be written as a summation of two components, such as where f 0 ∈ H 0 is the null space contribution and f 1 ∈ H 1 is the contrast space contribution.Furthermore, when H 0 = H 00 ⊕ H 01 , the null space component can be further decomposed into its constant and non-constant contributions, such as f 0 (x) = f 00 (x) + f 01 (x), where f 00 (x) ∝ 1 for all x ∈ X .Let P 0 denote the projection operator for the null space, such that P 0 f = f 0 for any f ∈ H. Similarly, let P 1 denote the projection operator for the contrast space, such that P 1 f = f 1 for any f ∈ H.Note that f 0 ∈ H 0 is referred to as the "parametric component" of f , given that H 0 is a finite dimensional subspace.In contrast, f 1 ∈ H 1 is the "nonparametric component" of f , given that H 1 is an infinite dimensional subspace.

Representer Theorem
Still consider a single predictor (i.e., p = 1) that satisfies x ∈ X with H = H 0 ⊕ H 1 denoting a RKHS of functions on X .Now, suppose that the penalty functional in Equation ( 2) is defined to be the squared norm of the function's projection into the contrast space, i.e., P( f Note that the second equality is due to the fact that f 1 2 0 = 0 for any f 1 ∈ H 1 , which is a consequence of the orthogonality of H 0 and H 1 .More specifically, given {(x i , y i )} n i=1 , consider the problem of finding the function fλ = arg min where P 1 is the projection operator for the contrast space H 1 .Note that the solution is subscripted with λ to emphasize the dependence on the tuning parameter.Suppose that the null space has dimension m ≥ 1.Note that m = 1 when H 0 only consists of the constant (intercept) subspace, whereas m ≥ 2 when H 0 = H 00 ⊕ H 01 .Let {N 0 , N 1 , . . ., N m−1 } denote a basis for the null space H 0 , such that any f 0 ∈ H 0 can be written as f 0 (x) = ∑ m−1 j=0 β j N j (x) for some coefficient vector β = (β 0 , . . ., β m−1 ) ∈ R m .The representer theorem of Kimeldorf and Wahba [21] reveals that the optimal smoothing spline estimator from Equation (4) has the form where R 1 ∈ H 1 is the RK of the contrast space, and α = (α 1 , . . ., α n ) ∈ R n is the coefficient vector that combines the training data RK evaluations.The representer theorem in Equation ( 5) reveals that the smoothing spline estimator can be written as ) is the contrast space contribution.Using the optimal representation from Equation (5), the penalty has the form where Q = [R 1 (x i , x i )] evaluates the RK function at all combinations of i, i ∈ {1, . . ., n}.Note that the first line is due to the fact that P 1 f λ = f 1λ for any f λ ∈ H, the second line is due to the bilinear nature of the inner product, and the third line is due to the reproducing property of the RK function.

Scalable Computation
The optimal solution given by the representer theorem in Equation ( 5) uses all training data points to represent f 1λ , which could be computationally costly when n is large.For more scalable computation, it is typical to approximate f 1λ by evaluating the contrast space RK at all combinations of r < n knots, which are typically placed at the quantiles of the training data predictor scores.Using this type of (low-rank) smoothing spline approximation, the approximation to the representer theorem becomes where {x * } r =1 are the chosen knots.As long as enough knots are used in the representation, the approximate representer theorem in Equation ( 7) can produce theoretically optimal function estimates [22,23].For optimal asymptotic properties, the number of knots should be on the order of r O(n 2/(4δ+1) ), where δ ∈ [1, 2] depends on the smoothness of the unknown true function.Note that δ = 1 is necessary when P( f ) < ∞ is barely satisfied, whereas δ = 2 can be used when f is sufficiently smooth (see [5,22,23]).
Using the approximate representer theorem in Equation ( 7), the penalized least squares functional from Equation ( 4) becomes a penalized least squares problem of the form where is the i-th observation's null space basis function vector, and evaluates the contrast space RK at all combinations of knots.Given a choice of the smoothing parameter λ, the solution has the form where N = (N 1 , . . . ,N n ) is the null space design matrix with N i as rows, R = (R 1 , . . . ,R n ) is the contrast space design matrix with R i as rows, y = (y 1 , . . ., y n ) is the response vector, and (•) † denotes the Moore-Penrose pseudoinverse [24,25].

Marginal Function Space Notation
Now, consider the multiple nonparametric regression model in Equation ( 1), where X = (X 1 , . . ., X p ) ∈ X is the observed predictor vector.Note that X = X is the product domain with X (j) denoting the domain of X j .Following the discussion from Section 2.1, let H (j) denote a RKHS of functions on X (j) for j = 1, . . ., p. Suppose that the complexity (i.e., lack of smoothness) for each predictor's marginal RKHS is defined according to some non-negative penalty functional P j .This implies that each RKHS can be decomposed such as } is the j-th predictor's null space, which contains all functions (in H (j) ) that have zero penalty, and H 1(j) = { f : P j ( f ) > 0, f ∈ H (j) } is the j-th predictor's contrast space, which contains all functions (in H (j) ) that have a non-zero penalty.When relevant, the j-th predictor's null space can be further decomposed such as H 0(j) = H 00(j) ⊕ H 01(j) , where H 00(j) is a constant (intercept) subspace, and H 01(j) contains non-constant functions that are unpenalized.
The inner product of H (j) will be denoted by f , g (j) for any f , g ∈ H (j) , and the corresponding norm will be written as f (j) = f , f (j) for any f ∈ H (j) .Each inner product can be decomposed into its null and contrast contributions, such as f , g (j) = f , g 0(j) + f , g 1(j) , and the corresponding norms will be denoted by f 0 . When H 0(j) = H 00(j) ⊕ H 01(j) , the null space inner product can be written as f , g 0(j) = f , g 00(j) + f , g 01(j) , where f , g 00(j) is the inner product of H 00(j) for any f , g ∈ H 00(j) , and f , g 01(j) is the inner product of H 01(j) for any f , g ∈ H 01(j) .The corresponding norm can be written as f 0(j) = f 2 00(j) + f 2 01(j) , where f 00(j) = f , f 00(j) and f 01(j) = f , f 01(j) denote the norms of H 00(j) and H 01(j) , respectively.
The RK of H (j) will be denoted by R (j) (x, z) for any x, z ∈ X (j) , and note that the RK is an element of the j-th predictor's RKHS, i.e., R (j) ∈ H (j) for any x, z ∈ X (j) .By definition, the RK is the representer of the evaluation functional in H (j) , which implies that the RK satisfies f (x) = R (j) (x, z), f (z) for any f ∈ H (j) and any x, z ∈ X (j) .Note that , where where R 0(j) ∈ H 0(j) is the null space RK and R 1(j) ∈ H 1(j) is the contrast space RK.Furthermore, when H 0(j) = H 00(j) ⊕ H 01(j) , the null space RK can be decomposed such as R 0(j) (x, z) = R 00(j) (x, z) + R 01(j) (x, z), where R 00(j) ∈ H 00(j) and R 01(j) ∈ H 01(j) denotes the RKs of H 00(j) and H 01(j) , respectively.Note that R 00(j) (x, z) = β 0(j) for all x, z ∈ X (j) , where β 0(j) ∈ R is some constant, given that H 00(j) is assumed to be a constant (intercept) subspace for all p predictors.

Tensor Product Function Spaces
Consider the construction of a tensor product function space H that is formed by combining the marginal spaces {H (1) , . . ., H (p) }.The largest space that could be constructed includes all possible main and interaction effects, such as where and each H {j} consists of ( p j ) orthogonal subspaces that capture different main and/or interaction effects of the predictors.For example, 2 two-way interaction effect subspaces, etc.Note that different (more parsimonious) statistical models can be formed by excluding subspaces from the tensor product RKHS defined in Equation (10).For example, the tensor product space corresponding to the additive model has the form H = H {0} ⊕ H {1} .For the model that includes all main effects and two-way interactions, the tensor product RKHS has the form H = H {0} ⊕ H {1} ⊕ H {2} .
Let X = (X 1 , . . ., X p ) ∈ X and Z = (Z 1 , . . ., Z p ) ∈ X denote two arbitrary predictor vectors.To evaluate functions in H, the tensor product RK can be defined as where R {0} (X, Z) = 1 is the constant (intercept) term, and each R {j} consists of a summation of ( p j ) RKs from orthogonal subspaces that capture different main and/or interaction effects of the predictors.For example, 2 two-way interaction effect RKs, etc.When different (more parsimonious) models are formed by excluding subspaces of the tensor product RKHS, the corresponding components of the tensor product RK are also excluded.For example, the tensor product RK corresponding to the additive model has the form R = R {0} + R {1} , and the tensor product RK for the model that includes all main effects and two-way interactions has the form The inner product of the tensor product RKHS H can be written as where f , g {0} is the inner product of H {0} , and f , g {j} consists of a summation of ( p j ) inner products corresponding to orthogonal subspaces that capture different main and/or interaction effects of the predictors.For example, f , g {1} consists of the summation of p main effect inner products, and f , g {2} consists of the summation of p(p−1) 2 two-way interaction effect inner products.The specifics of each subspace's inner product will depend on the type of spline used for each predictor.This is because each subspace's inner product (and, consequently, penalty) aggregates information across the penalized components after "averaging out" information from unpenalized components (see [5] (pp.40-48)).

Representation and Computation
Given an assumed model form, the tensor product RKHS can be written as where H 0 = { f : P( f ) = 0, f ∈ H} is the tensor product null space with P(•) denoting the tensor product penalty (later defined), and H k is the k-th orthogonal subspace of the tensor product contrast space Note that H k corresponds to the different main and/or interaction effect subspaces that are included in the assumed model form.
The corresponding inner product and RK can be written as where f , g 0 and R 0 (X, Z) denote the inner product and RK of H 0 (the tensor product null space), f , g k and R k (X, Z) denote the inner product and RK of H k for k = 1, . . ., K, and the θ k > 0 are additional non-negative tuning parameters that control the influence of each subspace's contribution.Note that including the θ k parameters is essential given that the different subspaces do not necessarily have comparable metrics.Suppose that the tensor product penalty P( f ) is defined to be the squared norm of the function's projection into the (tensor product) contrast space, i.e., where . Using this definition of the penalty, the function minimizing the penalized least squares functional in Equation ( 2) can be written according to the representer theorem in Equation (5).In this case, the set of functions {N 0 , N 1 , . . ., N m−1 } forms a basis for the tensor product null space H 0 , and the RK of the contrast space is defined as Using this optimal representation, the penalty can be written according to Equation ( 6) with the penalty matrix defined as evaluates the k-th subspace's RK function at all combinations of training data points.
For scalable computation as n becomes large, the approximate representer theorem in Equation ( 7) can be applied using the knots {x * } r =1 , where x * = (x * 1 , . . ., x * p ) ∈ X for all = 1, . . ., r.Using the approximately optimal representation from Equation ( 7), the penalized least squares problem can be written according to Equation (8), and the optimal coefficients can be written according to Equation (9).In the tensor product case, the optimal coefficients should really be subscripted with λ = (λ, θ 1 , . . ., θ K ), given that these estimates depend on the overall tuning parameter λ, as well as the K tuning (hyper)parameters for each of the contrast subspaces.Note that the penalty only depends on (λ 1 , . . ., λ K ) where λ k = λ/θ k for k = 1, . . ., K.However, it is often helpful (for tuning purposes) to separate the overall tuning parameter λ from the tuning parameters that control the individual effect functions, i.e., the θ k tuning parameters.

Smoothing Spline Like Estimators
Consider a single predictor (i.e., p = 1) that satisfies x ∈ X , and let H denote a RKHS of functions on X .Consider a decomposition of the function space such as H = H 00 ⊕ H 11 , where H 11 = H 01 ⊕ H 1 is a space of non-constant functions that either sum to zero (for categorical x) or integrate to zero (for continuous x) across the domain X .The inner product of H can be written as f , g = f , g 00 + f , g 11 for any f , g ∈ H, where f , g 11 = f , g 01 + f , g 1 is the inner product of H 11 .The corresponding RK can be written as R(x, z) = R 00 (x, z) + R 11 (x, z) for any x, z ∈ X , where where f 2 11 = f , f 11 is the squared norm of the projection of f into H 11 .The fλ defined in Equation ( 16) is a smoothing spline if H 0 = H 00 , which will be the case for nominal, ordinal, and linear smoothing splines.However, for cubic (and higher-order) smoothing splines, the H 01 subspace consists of non-constant lower-order polynomial terms, which are unpenalized.Note that the fλ in Equation ( 16) penalizes all non-constant terms, so it will not be equivalent to a cubic smoothing spline-even when is the same RKHS used for cubic smoothing spline estimation.
Theorem 1 (Representer Theorem).The f ∈ H that minimizes Equation ( 16) has the form where β ∈ R is an intercept parameter and α = (α 1 , . . ., α n ) ∈ R n is a vector of coefficients that combine the reproducing kernel function evaluations.
Proof.The theorem is simply a version of the representer theorem from Equation (5) where the null space has dimension one.
Proof.The corollary is simply a version of the approximate representer theorem from Equation (7) where the null space has dimension one.
These results imply that the penalized least squares functional from Equation ( 16) can be rewritten as the penalized least squares problem in Equation ( 8) where (i) the null space only contains the intercept column, i.e., N i = 1 and β = β, and (ii) the contrast space RK R 1 is replaced by R 11 in the function and penalty representation, i.e., Using these modifications the optimal coefficients can be written according to Equation (9).

Tensor Product Formulation
Now, consider the model in Equation (1) with p ≥ 2 predictors.Given an assumed model form, the tensor product RKHS H can be written according to the tensor sum decomposition in Equation ( 13) with H 0 = { f : f (X) ∝ 1 ∀X ∈ X } denoting the constant (intercept) subspace.Similarly, the inner product and RK of H can be written according to Equation ( 14), and the tensor product penalty can be written according to Equation (15).Unlike the previous tensor product treatment, this tensor product formulation assumes that H 0 contains only the constant (intercept) subspace, which implies that the H k subspaces contain all non-constant functions of the predictors.Furthermore, this implies that the proposed formulation of the tensor product penalty in Equation ( 15) penalizes all nonconstant functions of the predictors.Note that if all p predictors have a null space dimension of one, i.e., if H 0(j) = H 00(j) for all j = 1, . . ., p, then the proposed formulation will be equivalent to the classic formulation.However, if H 01(j) exists for any predictor, then the proposed formulation will differ from the classic formulation because the functions in H 01(j) will be penalized using the proposed formulation.
Given a sample of n observations {(x i , y i )} n i=1 with x i = (x i1 , . . ., x ip ) ∈ X and y i ∈ R, consider the problem of finding the function f ∈ H that satisfies fλ = arg min where the ω k ≥ 0 are additional tuning parameters (penalty weights) that control the influence of each component function's penalty contribution.
Proof.The result in Theorem 2 can be considered a generalization of the typical result used in tensor product smoothing spline estimators (see [3,5]).More specifically, the SSANOVA approach assumes that the function can be represented according to the form in Theorem 2 with the coefficients defined as α ik = α i θ k , where the vector α = (α 1 , . . ., α n ) is common to all K terms.Compared to the tensor product representation used in the SSANOVA modeling approach, the proposed approach combines the marginal RK information in a more flexible manner, such as Clearly, the two representations are equivalent when α ik = α i θ k for all i ∈ {1, . . ., n} and all k ∈ {1, . . ., K}.However, such a constraint is not necessary in practice.At first glance, it may appear that the proposed approach has made the estimation problem more challenging, given that the number of parameters has increased from n + K to nK.However, for estimation and inference purposes, it is beneficial to allow each term to have unique coefficients, given that this makes is possible to treat the tuning parameters as variance components in a linear mixed effects modeling framework [2,26,27].

Scalable Computation
The tensor product representer theorem in Theorem 2 is computationally costly for large n and/or K, given that it requires estimation of nK coefficients.For more practical computation, it is possible to apply knot-based approximations in a tensor product function space, as described in the following corollary.
Corollary 2 (Tensor Product Low-Rank Approximation).The minimizer of Equation ( 17) has the form f λ = ∑ K k=0 f kλ , and the effect functions can be approximated via where {x * k } r k =1 are the selected knots for the k-th effect with The proposed representation also allows for more flexible knot placement within each of the K subspaces of the tensor product contrast space.In particular, each of the K contrast subspaces is permitted to have a different number of knots r k using this formulation.Furthermore, note that x * k only needs to contain knot values for the predictors that are included in the k-th effect, e.g., x * k is a scalar for main effects, a vector of length two for two-way interactions, etc.For main effects, it is typical to place the knots at the (univariate) data quantiles for each predictor.For two-way interactions, many different knot placement strategies are possible, e.g., fixed grid, random sample, bivariate quantiles, strategic placement, etc.In this paper, I only consider multivariate knot placements that involve taking combinations of univariate knots [as in 20], but my ideas are easily applicable to other knot placement schemes.
Theorem 3 (Tensor Product Penalties).Suppose that K ≥ p and H k captures the k-th predictor's main effect for k = 1, . . ., p.Given any x = (x 1 , . . ., x p ) ∈ X , the k-th basis vector is defined as ) and the k-th penalty matrix is is the non-constant portion of each predictor's marginal RK function for k = 1, . . ., p. Now, suppose that H k (for some k > p) captures the interaction effect between X a and X b for some a, b ∈ {1, . . ., p}.If the basis vector is defined as R k = R a ⊗ R b , where ⊗ denotes the Kronecker product, then the penalty matrix has the form suppose that H k (for some k > p) captures the three-way interaction between (X a , X b , X c ) for some a, b, c ∈ {1, . . ., p}.If the basis vector is defined as Basis vectors and penalty matrices for higher-order interactions can be efficiently constructed in a similar fashion.
Proof.To prove the theorem, it suffices to prove the result for two-way interactions, given that three-way (and higher-order) interactions can be built by recursively applying the results from the two-way interaction scenario.Specifically, it suffices to show that The -th entry R k can be written in terms of the corresponding entries of R a and R b , such as where x = (x a , x b ) is the bivariate vector at which the RK is evaluated, and x * k = (x * ua , x * vb ) is the bivariate knot.Note that = v + r b (u − 1) indexes the tensor product vector R k , and u ∈ {1, . . ., r a } and v ∈ {1, . . ., r b } index the marginal R a and R b vectors.Letting α k = (α 1k , . . ., α r k k ) ∈ R r k denote an arbitrary coefficient vector, the penalty for the k-th term has the form where the first line is due to the bilinearity of the inner product, the second line is due to the reproducing property of the RK function, and the third line is a straightforward (algebraic) simplification of the second line.

Spectral Representater Theorem
For a more convenient representation of univariate smoothing spline (like) estimators, I introduce the spectral version of the representer theorem from Theorem 1, which will be particularly useful for tensor product function building.
Theorem 4 (Spectral Representer Theorem).Let R = (R 11 (x, x 1 ), . . ., R 11 (x, x n )) denote the vector of RK evaluations at the training data for an arbitrary x ∈ X , and let Q = [R 11 (x i , x i )] denote the corresponding penalty matrix.Consider an eigen-decomposition of Q of the form Q = VD 2 V , where V = (v 1 , . . ., v n ) is the matrix of eigenvectors, and ) is the diagonal matrix of eigenvalues (d i > 0 is the i-th singular value).The function f ∈ H that minimizes Equation ( 16) can be written as where γ = (γ 1 , . . ., γ n ) ∈ R n is a vector of coefficients, and S i (x) = d −1 i v i R. The spectral basis functions satisfy S i , S i 11 = δ ii , where δ ii is Kronecker's delta, which implies that S γ 2  11 = ∑ n i=1 γ 2 i for any γ ∈ R n , where S = (S 1 (x), . . ., S n (x)) .
Proof.To prove the first part of the theorem, we need to prove that R α = S γ, where S = (S 1 (x), . . ., S n (x)) .To establish the connection between the classic and spectral representations, first note that we can write the transformed (spectral) basis as S = D −1 V R, and the corresponding transformed coefficients as γ = DV α.This implies that given that VD −1 DV = I n , which completes the proof of the first part of the theorem.The prove the second part of the theorem, note that S i , S i , which is a consequence of the fact that R, R 11 = Q due to the reproducing property, and the fact that v i v i = δ ii due to the orthonormality of the eigenvectors.
Note that Theorem 4 reveals that modified representation in Theorem 1 can be equivalently expressed in terms of the empirical eigen-decomposition of the penalty matrix, which we refer to as the spectral representation of the smoothing spline.Furthermore, note that the theorem reveals that the spectral basis functions {S 1 , S 2 , . . ., S n } serve as empirical eigenfunctions for H, in the sense that these functions are a sample dependent basis that is orthonormal with respect to the contrast space inner-product.These eigenfunctions have the typical sign-changing behavior that is characteristic of spectral representations, such that S i+1 has more sign changes than S i for i = 1, . . ., n, see Figure 1.Note that the (scaled) ordinal and linear smoothing spline spectra are nearly identical to one another, which is not surprising given the asymptotic equivalence of these kernel functions [28].Furthermore, note that the (scaled) cubic and quintic smoothing spline spectra are rather similar in appearance, especially for the first four empirical eigenfunctions.Spectral basis functions for different types of reproducing kernel functions using five equidistant knots.Basis functions were evaluated at x i = i−1 100 for i = 1, . . ., 101 and were scaled for visualization purposes.Produced by R [29] using the rk() function in the grpnet package [30].

Tensor Product Formulation
For a more convenient representation of tensor product smoothing spline (like) estimators, I introduce the spectral version of the representer theorem from Theorem 2, which will be particularly useful for tensor product function building.
Theorem 5 (Spectral Tensor Product Representer Theorem).The minimizer of Equation (17) has the form f λ = ∑ K k=0 f kλ , where f 0λ ∈ R is an intercept, and f kλ ∈ H k is the k-th effect function for k = 1, . . ., K. The optimal effect functions can be expressed as for all x ∈ X , where γ k = (γ 1k , . . ., γ nk ) ∈ R n is the coefficient vector and {S ik } n i=1 are the spectral basis functions for k = 1, . . ., K. The spectral basis functions can be defined to satisfy S ik , S i k k = δ ii , where δ ii is Kronecker's delta, which implies that Proof.The result in Theorem 5 is essentially a combination of the results in Theorem 2 and Theorem 4. To prove the result, let R k = (R k (x, x 1 ), . . ., R k (x, x n )) denote the vector of RK evaluations at the training data for an arbitrary x ∈ X , and let k V k denote the eigen-decomposition of the penalty matrix, where V k = (v 1k , . . ., v nk ) is the matrix of eigenvectors, and ) is the diagonal matrix of eigenvalues (d ik > 0 is the i-th singular value).Then the spectral basis functions can be defined as Using the spectral tensor products, multiple and generalized nonparametric regression models can be easily fit using standard mixed effects modeling software, such as lme4 [31].See Figure 2 for a visualization of the spectral tensor product basis functions. .Spectral tensor product basis functions formed from p = 2 cubic smoothing spline marginals with r k = 5 equidistant knots for each predictor.From left to right, the basis functions become less smooth with respect to X 2 .From top to bottom, the basis functions become less smooth with respect to X 1 .Produced by R [29] using the rk() function in the grpnet package [30].

Scalable Computation
For large n, the spectral basis functions defined in Theorem 5 are not computationally feasible, given that computing the eigen-decomposition of the penalty requires O(n 3 ) flops.For more scalable computation, I present a spectral version of Corollary 2.
Corollary 3 (Spectral Tensor Product Low-Rank Approximation).The minimizer of Equation ( 17) has the form f λ = ∑ K k=0 f kλ , and the effect functions can be approximated via where S k = S 1k (x), . . ., S r k k (x) is the vector of spectral basis functions corresponding to {x * k } r k =1 , which are the selected knots for the k-th effect with Using the low-rank approximation, the penalty matrix For the main effects, the spectral basis functions can be defined as k V k is the eigen-decomposition of the penalty matrix with (d 2 k , v k ) denoting the -th eigenvalue/vector pair.As will be demonstrated in the subsequent theorem, spectral basis functions for interaction effects can be defined in a more efficient fashion via the computational tools from Theorem 3.
Theorem 6 (Spectral Tensor Product Penalties).Suppose that K ≥ p and H k captures the k-th predictor's main effect for k = 1, . . ., p.Given any x = (x 1 , . . ., x p ) ∈ X , the k-th basis vector is defined as )) and the k-th penalty matrix is is the non-constant portion of each predictor's marginal RK function for k = 1, . . ., p. Then the k-th spectral basis vector is defined as and the corresponding penalty matrix is the identity matrix.Now, suppose that H k (for some k > p) captures the interaction effect between X a and X b for some a, b ∈ {1, . . ., p}.If the basis vector is defined as S k = S a ⊗ S b , where ⊗ denotes the Kronecker product, then the penalty matrix is the identity matrix.Now, suppose that H k (for some k > p) captures the three-way interaction between (X a , X b , X c ) for some a, b, c ∈ {1, . . ., p}.If the basis vector is defined as S k = S a ⊗ S b ⊗ S c , then the penalty matrix is the identity matrix.Basis vectors for higher-order interactions can be efficiently constructed in a similar fashion.
Proof.To prove the theorem, it suffices to prove the result for two-way interactions, given that three-way (and higher-order) interactions can be built by recursively applying the results from the two-way interaction scenario.Specifically, it suffices to show that the penalty matrix corresponding to S k = S a ⊗ S b is the identity matrix.Letting α k ∈ R r k and γ k ∈ R r k denote arbitrary coefficient vectors, the representation for the k-th term is where the reparameterized basis and coefficient vector can be written as Now, note that the squared Euclidean norm of the reparameterized coefficients is where the first line plugs in the definition of the squared Euclidean norm, the second line uses the fact that (A ⊗B) = (A ⊗B ), the third line uses the fact that (A ⊗B)(C ⊗D) = (AC) ⊗(BD), and the fourth line plugs in the definition of the penalty matrices.

Simulated Example
To demonstrate the potential of the proposed approach, I designed a simple simulation study to compare the performance of the proposed tensor product smoothing approach with the approach of Wood et al. [20], which is implemented in the popular gamm4 package [32] in R [29].The gamm4 package [32] uses the mgcv package [33] to build the smooth basis matrices, and then uses the lme4 package [31] to tune the smoothing parameters (which are treated as variance parameters).For a fair comparison, I have implemented the proposed tensor product spectral smoothing (TPSS) approach using the lme4 package to tune the smoothing parameters, which I refer to as tpss4.This ensures that any difference in the results is due to the employed (reparameterized) basis functions instead of due to differences in the tuning procedure.
Given p = 2 predictors with X = [0, 1] × [0, 1], the true mean function is defined as where f 1 (x 1 ) = 4 cos(2π[x 1 − π]) is the main effect of the first predictor, and f 2 (x 2 ) = 120(x 2 − 0.6) 5 is the main effect of the second predictor.The interaction effect is defined as f 12 (x 1 , x 2 ) = 0 for the additive function, and ) for the interaction function.Note that this interaction function has been used in previous simulation work that explored tensor product smoothers (see [9,34]).Two different sample sizes were considered n ∈ {1000, 2000}.For each sample size and data-generating mean function, n observations were (independently) randomly sampled from X , and the response was defined as y i = f (x i1 , x i2 ) + i , where i follows a standard normal distribution.
For both the gamm4 package and the proposed tpss4 implementation, (i) I fit the model using r k ∈ {5, 6, . . ., 10} marginal knots for each predictor, and (ii) I used restricted maximum likelihood (REML) to tune the smoothing parameters.For the gamm4 package, the tensor product smooth was formed using the t2() function, which allows for main and interaction effects of the predictors.For the tpss4 method, the implementation in the smooth2d() function (see Supplementary Materials) allows for both main and interaction effects.Thus, for both methods, the fit model is misspecified for additive models and correctly specified for interaction models.
I compared the quality of the solutions using the root mean squared error (RMSE) and the mean absolute error (MAE) where f (x i1 , x i2 ) is the data-generating mean function and fλ is the estimated function.
The data generation and analysis procedure was repeated 100 times for each sample size.Box plots of the RMSE and MAE for each method under each combination of n ∈ {1000, 2000} and r k ∈ {5, 6, . . ., 10} are displayed in Figures 3 and 4. As expected, both the RMSE and MAE decrease as the number of knots increases for both methods.For each r k , the proposed tpss4 method tends to result in smaller RMSE and MAE values compared to the gamm4 implementation.For the (misspecified) additive function, the benefit of the proposed approach is noteworthy and persists across all r k .For the interaction model, the benefit of the proposed tpss4 approach is particularly noticeable for small r k , but is still existent for larger numbers of knots.
The runtime for each method is displayed in Figure 5.The proposed tpss4 method produces runtimes that are slightly larger than the gamm4 method in most situations.Despite using the same number of marginal knots for each predictor, the gamm4 approach uses an approximation that estimates slightly fewer coefficients, which is likely causing the timing differences.However, it is possible that these timing differences could be due to running compiled code (in gamm4) versus uncompiled code (in tpss4).Regardless of the source of the differences, the timing differences are rather small and disappear as r k increases, which reveals the practicality of the proposed approach. .Gray boxes denote the results using the gamm4 packages, whereas white boxes denote the results using the proposed tpss4 approach.Each box plot summarizes the results across the 100 simulation replications.

Real Data Example
To demonstrate the proposed approach using real data, I make use of the Bike Sharing Dataset [35] from the UCI Machine Learning Repository [36].This dataset contains the number (count) of bikes rented from the Capital Bike Share system in Washington DC.The rental counts are recorded by the hour from the years 2011 and 2012, which produced a dataset with n = 17,379 observations.In addition to the counts, the dataset contains various situational factors that might affect the number of rented bikes.In this example, I will focus on modeling the number of bike rentals as a function of the hour of the day (which takes values 0, 1, . . ., 23) and the month of the year (which takes values 1, 2, . . ., 12).
The proposed approach was used to fit a tensor product spectral smoother (TPSS) to the data using 12 knots for the hour variable and 6 knots for the month variable.The counts were modeled on the log10 scale, and then transformed back to the original (data) scale for visualization purposes.As in the simulation study, the smoothing parameters were tuned using the REML method in the lme4 package.Figure 6 displays the average number of bike rentals by hour and month, as well as the TPSS model predictions.As is evident from the figure, the TPSS solution closely resembles the average data; however, the model predictions are substantially smoother, which improves the interpretation.
Looking at the bike rental patterns by hour of the day, it is evident that there are two surges in the number of rentals: (1) during the morning rush hour (∼8:00-9:00) and (2) during the evening rush hour (∼17:00-18:00).The results also reveal another (smaller) surge that occurs during the lunch hour (∼12:00-13:00).Interestingly, the predictions in Figure 6 reveal that the bike rental surge during the morning rush hour is more localized in time (lasting about one hour), whereas the evening surge is more temporally diffuse (lasting 2-3 h).The bike rentals tend to peak during the afternoon rush hour, and are at their lowest expected value during the evening hours (∼23:00-06:00).The month effect is less pronounced than the hour effect, but it still produces some interpretable insights.In particular, we see that there are fewer people using the bikes during the winter months (Dec, Jan, Feb), which is expected.The drops in the number of rentals during the winter are particularly noticeable during the lunch surge, which suggests that fewer people use the bikes to compute for lunch during the winter.The peak in the rentals occurs during the summer months (Jun, Jul, Aug).Combining the hour and month information suggests that the evening rush hour during the summer months is when the Capital Bike Share system sees it greatest surge in demand.

Discussion
This paper proposes efficient and flexible approaches for fitting tensor product smoothing spline-like models.The refined smoothing spline approach developed in Section 4 offers an alternative approach for tensor product smoothing splines that penalizes all nonconstant effects of the predictors.In particular, Theorem 1 proposes a representer theorem for univariate smoothing spline-like estimators that penalizes all non-constant functions, Theorem 2 provides a tensor product extension of the proposed estimator, and Theorem 3 develops efficient computational tools for forming tensor product penalties.Furthermore, the spectral tensor product approach developed in Section 5 makes it possible to use exact (instead of approximate) tensor product penalties, which can be easily implemented in any standard mixed effects modeling software.In particular, Theorem 4 presents a spectral representer theorem for univariate smoothing, Theorem 5 provides a tensor product extension of the spectral representation, and Theorem 6 develops efficient computational tools for forming tensor product penalties.
The principal results in this paper reveal that if basis functions are formed by taking Kronecker products of spectral spline representations, then the resulting (exact) penalty matrix is the identity matrix.This implies that it is no longer necessary to choose between approximate penalties or costly parameterizations.Note that the results in this paper provide some theoretical support for the tensor product approach of Wood et al. [20], which uses a similar approach with different basis functions.The simulation results support the theoretical results given that the proposed approach (which uses the exact penalty) outperforms the approach of Wood et al. [20] in gamm4 [32].As a result, I expect that the proposed approach will be quite useful for fitting (generalized) nonparametric models using modern mixed effects and penalized regression modeling softwares such as lme4 or grpnet.Furthermore, I expect that the proposed approach will be useful for conducting inference with tensor product smoothing splines, e.g., using nonparametric permutation tests [37] or standard hypothesis tests for variance components [38].

Figure 1 .
Figure1.Spectral basis functions for different types of reproducing kernel functions using five equidistant knots.Basis functions were evaluated at x i = i−1 100 for i = 1, . . ., 101 and were scaled for visualization purposes.Produced by R[29] using the rk() function in the grpnet package[30].

Figure 2
Figure2.Spectral tensor product basis functions formed from p = 2 cubic smoothing spline marginals with r k = 5 equidistant knots for each predictor.From left to right, the basis functions become less smooth with respect to X 2 .From top to bottom, the basis functions become less smooth with respect to X 1 .Produced by R [29] using the rk() function in the grpnet package[30].

Figure 3 .Figure 4 .Figure 5 .
Figure 3. Box plots of the root mean squared error (RMSE) of the function estimate for each method.Rows show results for the additive function (top) and interaction function (bottom).Columns show the results as a function of the number of knots for n = 1000 (left) and n = 2000 (right).Gray boxes denote the results using the gamm4 packages, whereas white boxes denote the results using the proposed tpss4 approach.Each box plot summarizes the results across the 100 simulation replications.

Figure 6 .
Figure 6.Real data results.(left) average number of bike rentals by hour and month.(right) predicted number of bike rentals by hour and month.