Robust Permutation Tests for Penalized Splines

: Penalized splines are frequently used in applied research for understanding functional relationships between variables. In most applications, statistical inference for penalized splines is conducted using the random effects or Bayesian interpretation of a smoothing spline. These interpretations can be used to assess the uncertainty of the ﬁtted values and the estimated component functions. However, statistical tests about the nature of the function are more difﬁcult, because such tests often involve testing a null hypothesis that a variance component is equal to zero. Furthermore, valid statistical inference using the random effects or Bayesian interpretation depends on the validity of the utilized parametric assumptions. To overcome these limitations, I propose a ﬂexible and robust permutation testing framework for inference with penalized splines. The proposed approach can be used to test omnibus hypotheses about functional relationships, as well as more ﬂexible hypotheses about conditional relationships. I establish the conditions under which the methods will produce exact results, as well as the asymptotic behavior of the various permutation tests. Additionally, I present extensive simulation results to demonstrate the robustness and superiority of the proposed approach compared to commonly used methods.


Penalized Spline Prevalence
Penalized splines are used within modern multiple and generalized nonparametric regression frameworks [1,2], such as the smoothing spline analysis of variance models [3][4][5] and generalized additive models [6][7][8], to discover unknown functional relationships between a response variable Y and a collection of predictors X 1 , . . . , X d . Penalized splines and their variants have been applied to understand functional relationships in data from a variety of different disciplines. For example, recent applications include the use of penalized splines to model spatiotemporal patterns in US homelessness [9], biomechanical analysis of locomotion data [10,11], fear learning curves in veterans with PTSD [12], functional properties of successful smiles [13], self-esteem trajectories across the lifespan [14], and spatiotemporal trends in social media [15].
In addition to being a commonly used tool in applied research, penalized splines are also frequently the focus of theoretical and computational statistics research. For example, there has been recent interest in fitting penalized spline models with ordinal predictors and responses [16,17], mixed-effects models with penalized spline components [18][19][20], efficient approximation for large samples [21,22], and efficient algorithms for fitting penalized spline models to big data [23,24]. Furthermore, given the frequent usage of penalized splines for the analysis of real world data, there has been a growing interest in the robustness of penalized spline tuning and inference methods under model misspecification [25]. Recently, there has also been the development of alternative spline penalization approaches for nonparametric function estimation from noisy data [26].

Penalized Spline Definition
Consider a multiple nonparametric regression model [1,2] of the form where Y i is the i-th realization of the response variable Y ∈ R, X i = (X i1 , . . . , X id ) is the i-th realization of the predictor variable X ∈ X = X 1 × · · · × X d , and i is the i-th realization of the error term ∈ R. The error terms are assumed to satisfy E( i ) = 0 and E( 2 i ) = σ 2 i with σ 2 i < ∞ denoting the i-th observation's error variance. To estimate η, it is typical to minimize the penalized least squares (PLS) functional where J(·) is the penalty functional, and λ ≥ 0 is the smoothing parameter, which controls the balance between fitting and smoothing.
Given λ, the Kimeldorf-Wahba representer theorem [27] reveals that the minimizer of the PLS functional has the form where {N v } m−1 v=0 are known functions that span the null space H 0 = {η : J(η) = 0} with N 0 (X) = 1, the symmetric and bivariate function R(·, ·) is the known reproducing kernel of the contrast space H 1 = {η : J(η) < ∞}, the collection of predictor scores {X * i } r i=1 are the selected spline knots, a ∈ R is the unknown intercept, and b = (b 1 , . . . , b m−1 ) and c = (c 1 , . . . , c r ) are the unknown basis function coefficient vectors. The representer theorem uses all design points as knots (i.e., r = n and X * i = X i ), but reasonable approximations can be obtained using r < n knots [21,22,28].

Penalized Spline Estimation
Using the representer theorem, the nonparametric regression model can be written as where N i = (N 1 (X i ), . . . , N m−1 (X i )) is the null space basis function vector for the i-th observation, and R i = (R(X i , X * 1 ), . . . , R(X i , X * r )) is the reproducing kernel function evaluated at X i and the selected knots. Let Y c i = Y i −Ȳ denote the centered response and let N c i = N i −N and R c i = R i −R denote the centered basis functions, whereȲ = 1 Then, the PLS functional can be rewritten as . . , R c n ] are the mean centered basis function matrices, and Q = [R(X * i , X * j )] is the penalty matrix. Let d = (b 1 , . . . , b m−1 , c 1 , . . . , c r ) denote the combined basis function coefficient vector, and let K c = (N c , R c ) denote the combined (centered) basis function matrix. Given the smoothing parameter λ, it is well known that the basis function coefficients that minimize the PLS functional can be written aŝ is the block diagonal penalty matrix, and (·) † denotes the Moore-Penrose pseudo-inverse [29,30]. The least squares estimate of the intercept term can then be written asâ =Ȳ −K d λ , whereK = (N ,R ). The fitted values have the form η λ =â +η 0λ +η 1λ , whereη 0λ = N cbλ is the (non-constant) null space contribution, and η 1λ = R cĉλ is the contrast space contribution withd λ = (b λ ,ĉ λ ).

Bayesian Interpretation
Given the estimated functionη λ , the statistical inference about the unknown function η is often conducted using the Bayesian interpretation of a smoothing spline [31,32]. This approach assumes that η = η 0 + η 1 , where the null space function η 0 has a vague prior, and the contrast space function η 1 has a Gaussian process prior with a mean zero and covariance matrix proportional to Q † . Given these prior assumptions, it can be demonstrated that the posterior distribution of η given Y is multivariate normal with mean vector η λ and covariance matrix K K + nλQ * † , where K = (1 n , N, R). When the smoothing parameter λ is chosen via the generalized cross-validation (GCV) criterion [33], confidence intervals formed using the Bayesian covariance matrix tend to have an "across-the-function" coverage property [32]. See [25] for an investigation of the Bayesian CI coverage properties.
The Bayesian confidence intervals can be used to assess the uncertainty of the fitted values and the component functions, i.e., the main and interaction effects of the d predictors [34,35]. However, testing hypotheses about the nature of the function η is a more difficult problem. Various different approaches have been proposed for testing hypotheses about penalized smoothers [36][37][38][39][40][41][42][43]. However, these methods are primarily designed for testing specific hypotheses (e.g., H 0 : η ∈ H 0 ) under the assumption of homoscedastic Gaussian errors. The exception is Wood's (2013a) approach, which is designed to be more general, but the validity of Wood's approach depends on having a correctly specified parametric (exponential family) distribution. As a result, the generalizability of the existing inference methods is limited, and the validity of these methods is suspect when the parametric assumptions are questionable.

Proposed Approach
When working with real data, it can be difficult to determine whether or not the parametric assumptions that are required for valid hypothesis testing are reasonably met. Furthermore, if the error terms are non-Gaussian and/or heteroscedastic, the proposed hypothesis tests may produce substantially misleading inferential results. To overcome this important practical issue, I propose a flexible permutation testing framework for robust inference in nonparametric regression models. The proposed approach extends recent advances in robust permutation tests for linear models [44] to generalized ridge regression (GRR) and penalized smoothing problems. As I demonstrate in the following sections, the proposed framework can be used for overall (omnibus) tests about functional relationships, as well as more specific (conditional) tests.
The remainder of this paper is organized as follows: Section 2 develops the foundations and theory for omnibus permutation tests using GRR estimators, Section 3 extends the permutation testing framework to conditional tests of effects, Section 4 presents extensive simulation results to validate the theoretical results derived in the previous sections, and Section 5 discusses how the proposed framework can be flexibly adapted for testing a variety hypotheses about semi-and non-parametric regression models.

Model and Estimation
Given an independent sample of n observations, consider the linear regression model for i ∈ {1, . . . , n}, where Y i is the i-th realization of the response variable Y ∈ R, X i = (X i1 , . . . , X ip ) is the i-th realization of the predictor vector X = (X 1 , . . . , X p ) ∈ R p , and i is the i-th realization of the error term ∈ R. The error terms are assumed to satisfy E( i ) = 0 and E( 2 i ) = σ 2 i with σ 2 i < ∞ denoting the i-th observation's (finite) error variance. Without a loss of generality, we can assume that the response and predictors are centered, which implies that the intercept α can be dropped from the model for estimation purposes. Let Y c i = Y i −Ȳ and X c i = X i −X denote the mean centered response and predictor vector for the i-th observation, whereȲ = 1 n ∑ n i=1 Y i andX = 1 n ∑ n i=1 X i . To estimate the coefficients in β, consider minimizing the GRR loss function [45] 1 . . , X c n ] is the centered design matrix, and ∆ is a p × p symmetric and positive semi-definite penalty matrix. The coefficients that minimize the GRR problem in Equation (3) have the form which is subscripted to emphasize that the estimated coefficients depend on the penalty matrix ∆. Given the estimated slope vectorβ ∆ , the least squares estimate of the intercept has the formα =Ȳ −X β ∆ .

Asymptotic Distributions
Consider the linear regression model from Equation (2) with the assumptions: n X c X c + ∆ is almost surely invertible. In the fixed predictors case, the mean vector is µ X = 1 n ∑ n i=1 X i and the the covariance matrix terms are defined as Σ X = 1 n X c X c and Ω X = 1 n X c ΨX c , where Ψ = diag(σ 2 1 , . . . , σ 2 n ) . Given assumptions A1-A3, the GRR estimator provides an estimate of Xσ XY withΣ X = 1 n X c X c andσ XY = 1 n X c Y c . This implies that the GRR estimator can be written asβ ∆ =Σ −1 X∆Σ Xβ whereΣ X∆ =Σ X + ∆.

Lemma 1.
Given assumptions A1-A3, the GRR estimatorβ ∆ from Equation (4) is asymptotically normal with mean vector β ∆ and covariance matrix 1 as n → ∞, where the notation d → denotes convergence in distribution.
Lemma 1 can be proved using results of White [46], who demonstrated that the OLS estimatorβ is asymptotically normal with a mean vector β and covariance matrix 1 n Σ −1 X Ω X Σ −1 X under assumptions A1-A3. Noting thatβ ∆ =Σ −1 X∆Σ Xβ and β ∆ = Σ −1 X∆ Σ X β completes the proof of Lemma 1, given thatΣ X is a consistent estimator of Σ X . See Appendix A.1 for details. (2) with the assumptions β ∼ N(0 p , σ 2 n ∆ −1 ) and i iid ∼ N(0, σ 2 ), and suppose that β and i are independent of one another. Under these assumptions, the posterior distribution of β given Y is multivariate normal with mean vectorβ ∆ and covariance matrix σ 2 nΣ −1 X∆ . The asymptotic mean vector is β ∆ and the asymptotic covariance matrix is σ 2 n Σ −1 X∆ .

Lemma 2. Consider the linear model in Equation
Lemma 2 can be proved by using the results of Henderson [47,48], who derived the covariance matrix of the best linear unbiased estimator (BLUE) and the best linear unbiased predictor (BLUP) in linear mixed models (e.g., see [49]). The asymptotic mean vector and covariance matrix result from the facts thatΣ X andσ XY are consistent estimators of Σ X and σ XY , respectively. See Appendix A.2 for details.

Test Statistics
Consider the linear model in Equation (2), and suppose that we want to test the null hypothesis H 0 : β = 0 p versus the alternative hypothesis H 1 : β = 0 p , where the notation are independent of one another, one may consider using the F statistic i is the estimated error variance, andˆ i = Y i −Ŷ i is the residual withŶ i =α + X iβ∆ denoting the fitted value for the i-th observation. Note that if H 0 is true and β is fixed, then the F statistic would approach an F distribution with degrees of freedom parameters p and n − p − 1 as ∆ → 0. However, under the assumptions of Lemma 2, the F statistic in Equation (5) will not follow an F p,n−p−1 distribution under H 0 , and it may produce asymptotically invalid results when used in a permutation test.
The assumption i iid ∼ N(0, σ 2 ) may be questionable in many real data situations, where the error terms may be non-Gaussian and/or heteroscedastic. If the error terms are heteroscedastic, i.e., if E( 2 i ) = σ 2 i , then the F statistic may not produce valid results even in a permutation test (see [44,50,51]). Note that even if i iid ∼ (0, σ 2 ) for some non-Gaussian distribution, the F statistic may produce asymptotically invalid results when used in a permutation test. Instead, consider the Wald test statistic . Under assumptions A1-A3, the W statistic asymptotically follows a χ 2 p distribution when H 0 is true, which is a result of Lemma 1 and the consistency of the estimatorsΣ X andΩ X under H 0 .

Permutation Inference
Let the vector π = (π 1 , . . . , π n ) denote some permutation of the integers {1, . . . , n}, and define Y π = (Y c π 1 , . . . , Y c π n ) to be the permuted (and centered) response vector using the permutation vector π. Furthermore, let F(Y π , X) and W(Y π , X) denote the test statistics from Equations (5) and (6) calculated using the permuted response vector Y π . When X i is independent of i , the permutation test conducted using W(Y π , X) will be exact, given that Ω X = E(σ 2 i )Σ X when X i and i are independent. However, the permutation test using F(Y π , X) is not guaranteed to be exact or asymptotically valid, given that the asymptotic sampling distribution of F(Y, X) may not be the same as the permutation distribution (see [44]). When there is dependence between X i and i , the permutation test conducted using F(Y π , X) will be inexact and asymptotically invalid, whereas the permutation test conducted using W(Y π , X) will be inexact and asymptotically valid. Define the additional assumption A4: Theorem 1. Consider the linear model with assumptions A1-A4. When β = 0 p , the permutation distribution of W(Y π , X) converges to a χ 2 p distribution as n → ∞.
Theorem 1 can be proved by combining the results in Lemma 1 with the results in Theorem 3.1 of DiCiccio and Romano [44], who derived the asymptotic nature of the permutation distribution W(Y π , X) for the OLS estimatorβ. Note that Theorem 1 reveals that a permutation test using W(Y π , X) will produce asymptotically level α rejection rates when the null hypothesis H 0 : β = 0 p is true.
Let the vector ψ = (ψ 1 , . . . , ψ n ) denote a resigning vector where ψ i ∈ {−1, 1} ∀i, and define Y ψ = (ψ 1 Y c 1 , . . . , ψ n Y c n ) to be the resigned (and centered) response vector using the resigning vector ψ. Furthermore, let F(Y ψ , X) and W(Y ψ , X) denote the test statistics from Equations (5) and (6) calculated using the resigned response vector Y ψ . When X i is independent of i and the errors are symmetric, the permutation test conducted using W(Y ψ , X) will be exact, but the permutation test using F(Y ψ , X) may be invalid. When X i ⊥ ⊥ i , the permutation test conducted using F(Y ψ , X) will be inexact and asymptotically invalid, whereas the permutation test conducted using W(Y ψ , X) will be inexact and asymptotically valid, regardless of whether or not errors are symmetric. Theorem 2. Consider the linear model with assumptions A1-A4. When β = 0 p , the permutation distribution of W(Y ψ , X) converges to a χ 2 p distribution as n → ∞.
Theorem 2 can be proved by combining the results in Lemma 1 with the results in Theorem 3.2 of DiCiccio and Romano [44], who derived the asymptotic nature of the permutation distribution W(Y ψ , X) for the OLS estimatorβ. Note that Theorem 2 reveals that a permutation test using W(Y ψ , X) will produce asymptotically level α rejection rates when the null hypothesis H 0 : β = 0 p is true.
to be the permuted and resigned (centered) response vector using the permutation vector π and resigning vector ψ. Consider the linear model with assumptions A1-A4. When β = 0 p , the permutation distribution of W(Y πψ , X) converges to a χ 2 p distribution as n → ∞, where W(Y πψ , X) is the test statistic in Equation (6), calculated using the permuted and resigned response vector Y πψ .
Corollary 1 follows directly from the results of Theorems 1 and 2.

Model and Estimation
Given an independent sample of n observations, consider the linear regression model for i ∈ {1, . . . , n}, where Y i is the i-th realization of the response variable Y ∈ R, X i = (X i1 , . . . , X ip ) and Z i = (Z i1 , . . . , Z iq ) are the i-th realizations of the predictor vectors X = (X 1 , . . . , X p ) ∈ R p and Z = (Z 1 , . . . , Z q ) ∈ R q , and i is the i-th realization of the error term ∈ R. The predictor variables are assumed to be partitioned into two sets, such that X contains the variables of interest for inference purposes, and Z contains the covariates that will be conditioned on. Let M i = (X i , Z i ) denote the combined predictor vector, and let θ = (β 1 , . . . , β p , γ 1 , . . . , γ q ) denote the combined coefficient vector. Furthermore, let M c i = M i −M denote the mean centered (combined) predictor vector, whereM = 1 n ∑ n i=1 M i is the sample average of the combined predictor vector. To estimate the coefficients in θ, consider minimizing the GRR loss function where M c = [M c 1 , . . . , M c n ] is the centered design matrix, and ∆ is an r × r symmetric and positive semi-definite penalty matrix (where r = p + q is the total number of slope coefficients). The coefficients that minimize Equation (8) can be written as which is subscripted to emphasize that the estimated coefficients depend on the penalty matrix ∆. Given the estimated slope vectorθ ∆ , the least squares estimate of the intercept has the formα =Ȳ −M θ ∆ .

Lemma 3.
Given assumptions B1-B3, the GRR estimatorθ ∆ from Equation (9) is asymptotically normal with mean vector θ ∆ and covariance matrix 1 as n → ∞, where the notation d → denotes convergence in distribution. (7) with the assumptions θ ∼ N(0 r , σ 2 n ∆ −1 ) and i iid ∼ N(0, σ 2 ), and suppose that θ and i are independent of one another. Under these assumptions, the posterior distribution of θ given Y is multivariate normal with mean vectorθ ∆ and covariance matrix σ 2 nΣ −1 M∆ . The asymptotic mean vector is θ ∆ and the asymptotic covariance matrix is σ 2 n Σ −1 M∆ .

Lemma 4. Consider the linear model in Equation
Note that Lemmas 3 and 4 can be proved using analogues of the results that were used to prove Lemmas 1 and 2. Specifically, for Lemma 3, we can use a direct analogue of the proof for Lemma 1 with θ ∆ replacing β ∆ and the matrices Ω M and Σ M∆ replacing the matrices Ω X and Σ X∆ . For Lemma 4, we can use a direct analogue of the proof for Lemma 2 with θ ∆ replacing β ∆ and Σ M∆ replacing Σ X∆ . For both cases, we need to replace the unpenalized coefficient vector β with the unpenalized coefficient vector θ.

Test Statistics
Consider the linear model in Equation (7), and suppose that we want to test the null hypothesis H 0 : β = 0 p versus the alternative hypothesis H 1 : β = 0 p . This is the same null hypothesis that was considered in the previous section, but now the nuisance effects Z i γ are included in the model (i.e., conditioned on) while testing the significance of the β vector. Assuming that θ ∼ N(0 r , σ 2 n ∆ −1 ) and i iid ∼ N(0, σ 2 ) are independent of one another, we could use the F test statistic where S = [I p , 0 p×q ] is a p × r selection matrix such that Sθ ∆ = β ∆ (note that I p is the p × p identity matrix and 0 p×q is a p × q matrix of zeros),σ 2 = 1 n−r−1 ∑ n i=1ˆ 2 i is the estimated error variance, andˆ i = Y i −Ŷ i is the residual withŶ i =α + M iθ ∆ denoting the fitted value for the i-th observation.
When H 0 is true and the assumptions in Lemma 4 are met, the F statistic approaches an F distribution with degrees of freedom parameters p and n − r − 1 as ∆ → 0. For non-zero penalties, the F statistic in Equation (10) will not follow an F p,n−r−1 distribution, and may produce asymptotically invalid results when used in a permutation test, especially when the error terms are heteroscedastic (see [44,50,51]). In such cases, the Wald test statistic should be preferred whereΩ M = 1 n M c D 2 M c with Dˆ = diag(ˆ 1 , . . . ,ˆ n ). Under assumptions B1-B3, the W statistic asymptotically follows a χ 2 p distribution when H 0 is true, which is a result of Lemma 3 (and the consistency of the estimatorsΣ M andΩ M ). Table 1 depicts eight different permutation methods that have been proposed for testing the significance of regression coefficients in the presence of nuisance parameters. The eight methods can be split into three different groups: (i) methods that permute the rows of X [52][53][54], (ii) methods that permute Y with Z included in the model [55][56][57], and (iii) methods that permute Y after partialling out Z [58][59][60]. All of these methods were originally proposed for use with the OLS estimatorθ and the F test statistic. Recent works have incorporated the use of the robust W test statistic with these permutation methods [44,50,51]. However, these authors only considered a theoretical analysis of the DS and FL permutation methods, and no previous works seem to have studied these methods using the GRR estimators from Equations (4) and (9).  To understand the motivation of the various permutation methods in Table 1, assume that the penalty matrix is a block diagonal such as ∆ = bdiag(∆ X , ∆ Z ), where ∆ X and ∆ Z denote the penalty matrices for β and γ, respectively. Using the well-known form for the inverse of a block matrix [61][62][63][64], the coefficient estimates from Equation (9) have the form

Permutation Inference
is the residual forming matrix for the model that only includes the nuisance effects in the model, i.e., Y = α + Z γ + . This implies that the (centered) fitted values can be written asŶ −1 Z c is the hat matrix for the linear model that only includes the nuisance effects. Thus, when ∆ Z = 0 q×q , all of the permutation methods except SW will produce the same observed F statistic (when the permutation matrix is P = I n ).
Consider the additional assumption that the response and predictor variables have finite fourth moments, i.e., B4: E(Y 4 i ) < ∞, E(X 4 ij ) < ∞ ∀j, and E(Z 4 ik ) < ∞ ∀k. Assume B1-B4 and that the null hypothesis H 0 : β = 0 p is true. Using the W test statistic from Equation (11), the following can be said about the finite sample and asymptotic properties of the various permutation methods in Table 1: the DS method is exact when X ⊥ ⊥ (Y, Z) and asymptotically valid otherwise; the MA method is exact when Y ⊥ ⊥ (X, Z) and asymptotically valid otherwise; the SW method is inexact and asymptotically valid only when E (X − µ X )(Z − µ Z ) = 0 p×q ; the other five methods (OS, FL, TB, KC, HJ) are inexact and asymptotically valid.
The asymptotic behaviors of the DS and FL methods were proved by DiCiccio and Romano [44]. The asymptotic validity of the OS method can be proved using a similar result as used for the DS method, given that 1 The asymptotic validity of the MA method can also be proved using a similar result as used for the DS method. The asymptotic validity of the TB method can be proved using a similar result as used for the FL method, given that 1 Finally, note that the KC and HJ methods are asymptotically equivalent, and the SW method is asymptotically equivalent to the KC and HJ methods when X and Z are uncorrelated. The asymptotic validity of the KC and HJ methods follow from the results in Theorem 1, given that these methods permute the response after partialling out the nuisance effects. It is important to note that if X and Z are correlated, the SW method will produce asymptotically invalid results because Z is partialled out of Y but not X.

Simulation A
The first simulation study was designed to explore the validity of the claims made about the omnibus permutation tests discussed in Section 2. Simulation A was a fully crossed design that manipulated three design factors: (i) the data generating distribution (three levels: left skew normal, standard normal, right skew normal), (ii) the error standard deviation (three levels: constant, increasing, and parabolic), and (iii) the sample size (five levels: n ∈ {10, 25, 50, 100, 200}), see Figure 1. For each of the 45 combinations of simulation design parameters (3 F × 3 σ × 5 n), I generated 10,000 independent copies of the data from the model in Equation (1) For each generated sample of data, the ss() function in the npreg R package [65] was used to fit a cubic smoothing spline with r = 5 knots, which were placed at the quantiles of the predictor scores. The generalized cross-validation (GCV) criterion [33] was used to select the smoothing parameter λ.
Ten different inference methods (six permutation tests and four parametric tests) were used to test the null hypothesis of no functional relationship. The six permutation tests were formed using the two test statistics (F and W) combined with the three permutation methods discussed in Section 2: permute Y, sign-flip Y, and both permuting and sign flipping. The permutation tests were implemented using the np.reg.test() function in the nptest R package [66] using the default number of resamples (the default uses R = min(R 0 , 9999) resamples, where R 0 is the number of elements of the exact null distribution: R 0 = n! for the permute method, R 0 = 2 n for the sign-flip method, and R 0 = n!2 n for the combined method). The first two parametric tests were formed by comparing the F statistic from Equation (5) to an F p,n−p−1 distribution, and the W statistic from Equation (6) to a χ 2 p distribution. The other two parametric tests were the F tests that are implemented by the mgcv R package [67] and the npreg R package [65]. Note that these F tests compare the F statistic from Equation (5) to an F ν,n−ν−1 distribution, where ν is an estimate of the effective degrees of freedom. The npreg package defines ν as the trace of the smoothing matrix, whereas the mgcv package uses a more complex estimate of ν (see [42]).  Figure 2 displays the type I error rate for each inference method in each combination of Simulation A conditions. All of the inference methods perform similarly across the three data generating distributions, so the following discussions of the results apply to each of the three data generating distributions. When using the W test statistic, the three permutation methods produced accurate type I error rates for all combinations of data generating conditions (see red square, blue circle, and green triangle), and the parametric χ 2 p approximation produced asymptotically accurate results (see purple plus sign). When using the F test statistic, the permutation methods produced inflated type I error rates (see orange x, yellow diamond, and brown upside-down triangle), and the parametric F p,n−p−1 approximation produced inconsistent results across the different error standard deviation conditions (see pink square with x). The parametric F ν,n−ν−1 tests implemented by the mgcv and npreg packages also produce inconsistent results across the different error standard deviation conditions, such that the type I error rate is inflated in the increasing and parabolic conditions (see gray asterisk and black diamond with plus sign).

Simulation B
The second simulation study was designed to explore the validity of the claims made about the conditional permutation tests discussed in Section 3. Simulation B was a fully crossed design that manipulated two design factors: (i) the error standard deviation (three levels: constant, increasing, and parabolic), and (ii) the sample size (five levels: n ∈ {10, 25, 50, 100, 200}). Given that the results in Simulation A did not noticeably differ across the three data generating distributions, errors were generated from a normal distribution throughout Simulation B. For each of the 15 combinations of data generating parameters (3 σ × 5 n), I generated 10,000 independent copies of the data from the model in Equation (1) with X i = (i − 1)/(n − 1) and η(X i ) = X i (unlike Simulation A, the data generating mean function now includes a linear effect). As in the previous simulation, the ss() function in the npreg R package [65] was used to fit a cubic smoothing spline with r = 5 knots, and the GCV criterion was used to select the smoothing parameter.
The null hypothesis of a linear relationship was tested using a total of 19 inference methods: 16 permutation tests and 3 parametric tests. The 16 permutations tests were formed using the two test statistics (F and W) combined with the eight permutation methods in Table 1. As in Simulation A, the permutation tests were implemented using the np.reg.test() function in the nptest R package [66] using the default number of resamples (i.e., R = 9999) to form the permutation distribution. The first two parametric tests were formed by comparing the F statistic from Equation (10) to an F p,n−r−1 distribution, and the W statistic from Equation (11) to a χ 2 p distribution. The other parametric test is the F test that is implemented by the npreg package, which compares the F statistic from Equation (10) to an F ν−m,n−ν−1 distribution. Note that the "cardinal" spline parameterization used in the mgcv R package does not separate the linear and non-linear portions of the function; thus, a test of linearity is not possible using this package. Figure 3 displays the type I error rate for each inference method in each combination of Simulation B conditions. When the errors have constant variance, using the F statistic produces (i) inflated type I error rates using all permutation methods, (ii) deflated type I error rates using the parametric test, and (iii) asymptotically accurate error rates using the npreg test. In contrast, when using the W test statistic, all of the inference methods except SW produce asymptotically accurate error rates when the errors have constant variance. As expected, the F statistic produces asymptotically invalid results when the errors have non-constant variance, with the performance being the worst in the parabolic condition. In contrast, the W statistic produced asymptotically valid type I error rates when the errors have non-constant variance (using all methods except SW).

Summary of Findings
Penalized splines are frequently used in applied research for understanding functional relationships between variables. Although there has been a considerable body of work on the estimation and computation of penalized splines, there have been relatively few papers on statistical inference about the nature of the functional relation. In most applications, the statistical inference for penalized splines is conducted using the random effects or Bayesian interpretation of a smoothing spline. These inferential frameworks rely on parametric assumptions about the unknown function and error terms, i.e., that η is a Gaussian process and i are iid Gaussian variables. Even when these parametric assumptions are met, valid statistical inference can be challenging due to the need for a reasonable estimate of the degrees of freedom of the penalized spline estimate. Furthermore, when these parametric assumptions are incorrect (e.g., due to heteroscedastic and/or non-Gaussian errors), the standard inferential tools can produce substantially misleading results.
In this paper, I developed a flexible and robust permutation testing framework for inference using penalized splines. Unlike the existing methods for statistical inference with penalized splines, the proposed methods can provide asymptotically valid inferential results under a variety of data generating situations. Furthermore, unlike a majority of the existing methods for hypothesis testing with penalized splines, the proposed approach can be flexibly adapted to test a variety of standard and non-standard hypotheses about the nature of the functional relationships between variables. The omnibus test in Section 2 is frequently of interest in practical applications, but has been ignored by most inferential tests for penalized splines (e.g., see [42]). The conditional test in Section 3 can be used to test the classic hypothesis η ∈ H 0 (which has been considered in many papers), as well as the more specialized hypotheses about the main and/or interaction effects of predictors.
The simulation results in Section 4 clearly demonstrate the benefits of the proposed approach over standard methods used for inference with penalized splines. In particular, the simulation results demonstrate that classic (parametric) methods that rely on an F test statistic can produce substantially inflated type I error rates when the error terms are heteroscedastic. Furthermore, the simulation results demonstrate that the F test statistic can even produce inaccurate results when used in a permutation test. In contrast, the permutation tests using the robust W test statistic can produce exact results for the omnibus tests and asymptotically valid results for the conditional tests-even when the errors depend on the predictor(s). Moreover, the χ 2 p approximation using the robust W statistic produced asymptotically valid results that were reasonably close to the nominal α = 0.05 rate with only n = 100 (for the omnibus test) or n = 200 (for the conditional test).

Future Directions
Although the simulation studies only explored the omnibus and conditional tests with a single predictor, the theoretical results in Sections 2 and 3 can be readily applied to penalized spline models with multiple predictors. With d > 1 predictors, the number of possible hypotheses that could be tested increases, given that it is possible to test omnibus or conditional tests about any of the main or interaction effects. The simulation results for a single predictor revealed that the permutation methods of Kennedy and Cade [59] and Huh and Jhun [60] performed the best for conditional tests with a single predictor, so I hypothesize that these procedures will be ideal for tests of main and/or interaction effects with d > 1 predictors. However, future theoretical and Monte Carlo simulation work is needed to determine which permutation strategies should be preferred for conditional tests of main and/or interaction effects of multiple predictors.
The theoretical results in Sections 2 and 3 were developed for the classic formulation of penalized splines, which solves a GRR problem to obtain the function estimate. Note that the penalized spline GRR problem is a special case of the elastic net penalty [68] that is used in the recently proposed kernel eigenvector smoothing and selection operator (kesso) regression method [26]. Although the theorems derived in Sections 2 and 3 assume a ridge penalty, it is straightforward to develop extensions of these formulas for elastic net penalties used in the kesso regression. Specifically, the vector version of the coefficients, which is given after Equation (8) in [26], can be used to express the kesso coefficients as a modification of the least squares coefficients. However, future Monte Carlo research is needed to explore the accuracy of the various permutation strategies when using elastic net penalties to fit penalized spline regression models.
Finally, it is worth noting that this paper only considered the GCV tuning method, which is one of several smoothing parameter selection methods available in the ss() function. Recent work has demonstrated that the different tuning methods tend to produce similar results across a wide variety of data generating conditions-including non-normal and/or heteroscedastic errors (see [25]). However, it was noted that the maximum likelihood based tuning method tended to perform (i) worse than the cross-validation based methods when the sample size was small and (ii) better than the cross-validation based methods when the errors were correlated. Given these past findings, I would not expect the simulation results in this paper to significantly change if a different tuning criterion were used instead of the GCV. However, future research should explore the performance of the proposed permutation testing approach using different combinations of tuning methods and permutation strategies to analyze data from a wide variety of data-generating conditions. Funding: This research was funded by the following National Institutes of Health (NIH) grants: R01EY030890, R01MH115046, U01DA046413, and R43AG074740.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The supporting information contains the following files: simA R script for reproducing simulation A (.R file) simB R script for reproducing simulation B (.R file) sstest R function for omnibus and conditional smoothing spline permutation tests (.R file)

Conflicts of Interest:
The author declares no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: To prove Lemma 1, one needs to demonstrate that under assumptions A1-A3, the OLS estimateβ =Σ −1 Xσ XY is asymptotically normal with mean vector β = Σ −1 X σ XY and covariance matrix 1 n Σ −1 X Ω X Σ −1 X . In other words, one needs to demonstrate that where · d → · denotes that the random vector on the lefthand side converges in distribution to the probability distribution specified on the righthand side. This result, which is originally due to White [46], can be derived using basic rules of expectation and covariance operators in combination with a multivariate central limit theorem.
First, under assumptions A1-A3, note that the expectation ofβ is where the second line is due to the fact that X c Y c = X c Y, the third line is due to the facts that X c 1 n = 0 p and X c X = X c X c by definition, and the fourth line is due to the fact that E(X c ) = 0 p by assumption A2. Second, under assumptions A1-A3, note that the covariance matrix ofβ satisfies where the first line uses the previous result from the expectation derivation, the second line uses the fact that β is a constant vector, the third line uses the fact that E (X c X c ) −1 X c = 0 p by assumption A2, and the fourth line uses the definitions of Σ X and Ω X given in assumption A3. Note that the notation should be read as 'is asymptotically equal to'.
Thus, under assumptions A1-A3, the OLS coefficient estimates have mean vector E(β) = β and asymptotic covariance matrix Cov(β) 1 n Σ −1 X Ω X Σ −1 X . The asymptotic multivariate normality ofβ results from applying the multivariate central limit theorem using the consistent estimatorsΣ X = 1 n X c X c andΩ X = 1 n X c diag(ˆ )X c in place of the unknown asymptotic covariance matrix components Σ X and Ω X .
Finally, note that the asymptotic expectation ofβ ∆ has the form X∆ Σ X β = β ∆ and the asymptotic covariance ofβ ∆ has the form given thatβ ∆ =Σ −1 X∆Σ Xβ , which completes the proof.
where the second line plugs in the definitions of the parameters, the third line uses Equation (17) of Henderson and Searle [70] to more conveniently write the matrix inverse, and the last line results from straightforward algebraic simplification of the third line. Similarly, the posterior covariance matrix of β given Y has the form where the second line plugs in the definitions of the parameters, the third line plugs in the convenient representation of Σ −1 YY from Equation (17) of Henderson and Searle [70], and the final line results from the straightforward algebraic manipulation of the third line. Noting that µ β|Y =β ∆ and Σ β|Y = σ 2 nΣ −1 X∆ completes the proof.