Cross-Validation, Information Theory, or Maximum Likelihood? A Comparison of Tuning Methods for Penalized Splines

Functional data analysis techniques, such as penalized splines, have become common tools used in a variety of applied research settings. Penalized spline estimators are frequently used in applied research to estimate unknown functions from noisy data. The success of these estimators depends on choosing a tuning parameter that provides the correct balance between fitting and smoothing the data. Several different smoothing parameter selection methods have been proposed for choosing a reasonable tuning parameter. The proposed methods generally fall into one of three categories: cross-validation methods, information theoretic methods, or maximum likelihood methods. Despite the well-known importance of selecting an ideal smoothing parameter, there is little agreement in the literature regarding which method(s) should be considered when analyzing real data. In this paper, we address this issue by exploring the practical performance of six popular tuning methods under a variety of simulated and real data situations. Our results reveal that maximum likelihood methods outperform the popular cross-validation methods in most situations—especially in the presence of correlated errors. Furthermore, our results reveal that the maximum likelihood methods perform well even when the errors are non-Gaussian and/or heteroscedastic. For real data applications, we recommend comparing results using cross-validation and maximum likelihood tuning methods, given that these methods tend to perform similarly (differently) when the model is correctly (incorrectly) specified.


Introduction
Functional data analysis (FDA) considers the analysis of data that are (noisy) realizations of a functional process [1][2][3]. Such data can be found in many fields [4,5] and are becoming more common in the biomedical and social sciences, e.g., in the form of ecological momentary assessments [6,7] collected using smart phone apps. Most FDA techniques can be interpreted as functional extensions of standard methods used in applied statistics. For example, the nonparametric regression model considered in this paper can be interpreted as a functional extension of the simple model Y = µ + , which is assumed for a one sample location test. A fundamental aspect of many FDA applications is choosing a method to smooth the (noisy) functional data, and splines are one of the most popular smoothing methods used in applications of FDA [2,4]. Note that spline smoothers assume a nonparametric regression model of the form Y i = η(X i ) + i , where η(·) is the unknown mean function.
Nonparametric regression models are frequently used in applied research to estimate unknown functional relationships between variables (e.g., see [8][9][10][11][12][13][14]). Unlike parametric regression models, which assume that the functional relationship between variables has a known form that depends on unknown parameters, nonparametric regression models do not assume that the form of the relationship between variables is known [15,16]. Instead of assuming a particular functional form for the relationship, nonparametric regression models attempt to learn the form of the functional relationship from a sample of noisy data. As a result, nonparametric regression models are a type of statistical learning (e.g., see [17]), given that the collected data enable the researcher to discover functional forms that describe relations between variables. The overarching goal of nonparametric regression is to estimate a function that fits the data well while still maintaining a parsimonious (i.e., smooth) estimate of the functional relationship.
Penalized splines are a popular approach for estimating unknown functional relationships from noisy data. Note that penalized splines are a form of generalized ridge regression [18], where a quadratic smoothness penalty is added to the ordinary least squares loss function. The influence of the smoothness penalty is controlled by a nonnegative smoothing (or tuning) parameter λ ≥ 0, which controls the trade-off between fitting the data well and obtaining a smooth estimate. This paper focuses on the Gaussian-type response, so the fit to the data is measured by the ordinary least squares loss function. More generally, penalized splines can be viewed as a form of penalized likelihood estimation [19], where the goal is to find the function that minimizes where the first term quantifies the fidelity to the data (with n denoting the sample size) and the second term quantifies the (lack of) parsimony of the estimate. As the smoothing parameter λ → 0, the log-likelihood term dominates the loss functional, which causes the estimator to converge to the maximum likelihood estimator. In contrast, as λ → ∞, the penalty term dominates the loss functional, which causes the estimator to converge to a "perfectly smooth" estimator (later defined). When working with finite samples of noisy data, it is desirable to select a λ ∈ (0, ∞) that provides an ideal balance between fitting and smoothing the data. If the signal to noise level is relatively large, it may be possible to manually select a reasonable value of λ via visual inspection. However, for reliable and valid smoothing parameter selection across multiple noise levels, some automated method for selecting λ should be preferred. A variety of different methods have been proposed for automatically selecting an ideal value of λ for a given sample of data. However, there is no general consensus as to which method should be preferred for general situations.
In this paper, we compare six popular tuning methods that can be categorized into three distinct groups: (i) cross-validation based methods, which include ordinary crossvalidation (OCV) and generalized cross-validation (GCV); (ii) information theoretic methods, which include an information criterion (AIC) and the Bayesian information criterion (BIC); and (iii) maximum likelihood methods, which include standard (ML) and restricted maximum likelihood estimation (REML). The cross-validation tuning methods are often the default choice for smoothing parameter selection, e.g., the popular smooth.spline function in R [20] offers both the GCV (default) and OCV tuning options. Despite the popularity of the OCV and GCV, it is known that these tuning criteria can breakdown when the model error terms are correlated [21,22]. In such situations, these CV criteria tend to under-smooth the data (i.e., chooses a λ that is too small) because the structure in the error terms is perceived to be part of the mean structure.
When a researcher has a priori knowledge that the errors are correlated, it is possible to incorporate that knowledge into the estimation problem (e.g., see [23][24][25]). However, in most real data applications, the researcher lacks prior knowledge about the nature of the error correlation structure, so it is not possible to incorporate such information into the estimation process. One (naive) option would be to fit a penalized spline and then to inspect the model residuals in an attempt to learn about the error correlation structure. However, it has been shown that estimating the correlation structure from residuals is difficult, given that the residuals often look uncorrelated even when the error terms are correlated [22]. This is due to the (previously mentioned) issue that the error correlation is often absorbed into the estimate of the mean function when using popular tuning methods such as the OCV and GCV. Consequently, when the error terms may be correlated, some robust alternative tuning approach is needed.
Past research has shown that the cross-validation and maximum likelihood tuning methods have several common features [26]; however, certain tuning criteria may be more robust than the OCV and GCV in the presence of misspecified error structures. Specifically, Krivobokova and Kauermann [27] showed that the REML tuning criterion should be expected to outperform the OCV, GCV, and AIC (with respect to mean function recovery) when the errors are autocorrelated, and Lee [28] found that the AIC is more robust than the cross-validation criteria when the errors have non-constant variance. However, these findings focused on the situation when the errors follow a Gaussian distribution. To the best of our knowledge, no study has thoroughly compared the various tuning criteria under a wide variety of different combinations of error variance, error correlation, and error distribution. In this study, we explore how the distributional properties of the error terms affect not only the mean function recovery (which has been the focus in past studies) but also the standard errors used for inference about the unknown function (which has been largely ignored in past works).
The remainder of this paper is organized as follows. Section 2 provides some background about estimation and inference for smoothing splines. Section 3 presents the six smoothing parameter selection methods: OCV, GCV, AIC, BIC, ML, and REML. Section 4 conducts a thorough simulation study comparing the performance of the tuning methods under a variety of different data generating conditions. Section 5 demonstrates that the different tuning methods can produce noteworthy differences when analyzing real data. Finally, Section 6 discusses the important conclusions drawn from the current work as well as future research directions related to robust estimation and inference for penalized splines.

Smoothing Spline Definition
denote a set of n observations, where y i ∈ R is the real-valued response variable for the i th observation and x i ∈ [a, b] is the predictor variable for the i th observation. Note that the predictor is assumed to be bounded, and we can assume that a = 0 and b = 1 without loss of generality. Consider a nonparametric regression model of the form where η(·) is the unknown smooth function relating x i to y i , and i is the error term for the i th observation. In standard applications of nonparametric regression, the errors are assumed to be independent and identically distributed realizations of a continuous random variable with mean zero, i.e., i iid ∼ (0, σ 2 ). Note that this implies that the unknown function η(·) determines the conditional mean of Y given X. If the errors are correlated and/or heteroscedastic, then η(·) still determines the conditional mean of Y given X, as long as the error terms satisfy E( i ) = 0.

Representation and Computation
The Kimeldorf-Wahba representer theorem [29] reveals that the minimizer of the penalized least squares functional in Equation (3) can be written as where {φ j } m−1 j=0 are known functions that span the null space, κ(·, ·) is the reproducing kernel of the contrast space, and {x * k } r k=1 are the selected knots. Note that the representer theorem uses all observed design points as knots (i.e., r = n and x * i = x i ); however, it is possible to obtain good approximations using r n selected design points as knots [30]. For practical computation, note that the reproducing kernel function has the form where ψ m are Bernoulli polynomials [16,31]. Using the Kimeldorf-Wahba representation in Equation (4), the function estimation reduces to the estimation of the unknown coefficient vectors β = (β 0 , . . . , β m−1 ) and γ = (γ 1 , . . . , γ r ) . The Kimeldorf-Wahba representer theorem implies that the penalized least squares functional in Equation (3) can be written as where y = (y 1 , . . . , y n ) is the response vector, is the contrast space basis function matrix, and Q = [κ(x * j , x * k )] is the penalty matrix. Given λ, the optimal coefficient estimates have the form where A † denotes the Moore-Penrose pseudoinverse of A [32,33]. Note that the coefficient estimates in Equation (7) are unique if the selected knots satisfy 0 ≤ x * 1 < · · · x * r < 1, in which case the matrix Z is a full column rank (assuming that r < n).
In nonparametric regression, we are not typically interested in the values of the coefficients. Instead, we are interested in the fitted values, which have the form where is the "smoothing matrix", which is the nonparametric regression analogue of the "hat matrix" in linear models. Note that the fitted values and smoothing matrix are subscripted with λ, given that the coefficient estimates (and, consequently,ŷ λ and S λ ) change as a function of the smoothing parameter. The trace of the smoothing matrix is denoted by ν λ = tr(S λ ), which is often referred to as the effective degrees of freedom of the function estimate.
Given these assumptions, the posterior distribution of α given y is multivariate normal (α|y) ∼ N(µ α|y , Σ α|y ) with mean vector and covariance matrix which is a well-known property of the multivariate Gaussian distribution (e.g., see [36]). Defining θ 2 = σ 2 nλ , it can be shown that, as τ 2 → ∞, we have the relationŝ where M = [X, Z] is the model matrix and Q * = bdiag(0, Q) is a block diagonal penalty matrix where the zeros correspond to the X portion of M. Note thatμ α|y is the coefficient estimates from Equation (7) and thatΣ α|y is the inner portion of the smoothing matrix from Equation (9). Thus, the smoothing spline estimator can be interpreted as a posterior mean estimator under the specified prior distribution assumptions. The Bayesian interpretation of a smoothing spline can be useful for assessing the uncertainty of the predictions from a fit smoothing spline. First, note that the model predictions can be written asη Using the results in Equation (11), the posterior distribution of η(x) given y is univariate normal (η(x)|y) ∼ N(µ η(x)|y , σ 2 η(x)|y ), where the posterior mean is µ η(x)|y =η λ (x) = ξ μ α|y and the posterior variance is σ 2 η(x)|y = ξ Σ α|y ξ. This implies that the 100(1 − α)% Bayesian "confidence interval" for η(x) has the form where Z 1−α/2 denotes the standard normal critical value that cuts off α/2 in the upper tail. When the model assumptions are reasonable, the Bayesian confidence intervals tend to have "across the function" coverage properties, such that the 100(1 − α)% confidence interval is expected to contain about 100(1 − α)% of the true η(x) values [34,35].

Cross-Validation Methods
Ordinary cross-validation (OCV), which is also referred to as leave-one-out crossvalidation, is a special case of k-fold cross-validation where k (the number of folds) is equal to n (the sample size). This means that each observation has a turn being the "test set" or "test observation" since only one observation is held at a time. The use of OCV for model selection and model assessment was independently discovered by Allen [37] and Stone [38] in the context of regression. Wahba and Wold [39] later suggested the use of OCV when fitting smoothing spline models. The OCV method seeks to find the λ that minimizes where η λ (x i ) is the function that minimizes the penalized least squares functional, leaving out the i th data pair (x i , y i ). More specifically, is the minimizer of the leave-one-out version of the penalized least squares functional. The form of the OCV given in Equation (13) suggests that evaluating the OCV criterion for a given λ requires fitting the model n different times (once for each x i ). Fortunately, the leave-one-out function evaluation can be written in terms of the solution fit to the full sample of data, such as where s ii(λ) is the i th diagonal element of S λ [16]. This implies that the OCV criterion can be evaluated for a given λ via a single fitting of the model. Plugging the form of η which is the computational form of the OCV criterion. The computational form of the OCV in Equation (16) reveals that the OCV can be interpreted as a weighted least squares criterion, where the weights are defined (1 − s ii(λ) ) −2 . The leverage scores satisfy s ii(λ) ∈ (0, 1) so each observation can have a notably different amount of influence on the OCV criterion. To equalize the influence of the observations on the smoothing parameter selection, the generalized cross-validation (GCV) criterion of Craven and Wahba [40] replaces the leverage scores with their average value. More specifically, the GCV method seeks to find the λ that minimizes where ν λ = tr(S λ ) is the effective degrees of freedom of the estimator η λ . The GCV criterion is typically preferred over the OCV criterion, especially when there are replicate x i scores in the sample. Furthermore, assuming that i iid ∼ (0, σ 2 ), the GCV is known to have desirable asymptotic properties (e.g., see [41]).

Information Theory Methods
The information theoretic approaches for selecting λ require more assumptions than are required by the cross-validation based methods. More specifically, the information theory methods assume that i iid ∼ N(0, σ 2 ), which makes it possible to explicitly define the likelihood of the generated data (under the assumption that η(x) is an unknown constant given x). The assumption of iid Gaussian error terms implies that the distribution of the response vector is y ∼ N(η, σ 2 I), where the mean vector is η = Xβ + Zγ. Given a sample of n independent observations and assuming that i iid ∼ N(0, σ 2 ), the log-likelihood function has the form which depends on λ and the error variance σ 2 . The maximum likelihood estimate of the error variance has the formσ 2 2 , and plugging this into the log-likelihood function has the form which only depends on λ throughσ 2 λ . An information criterion (AIC) was proposed by Akaike [42] to compare a set of candidate models, with the goal being to select the model that loses the least amount of information about the (unknown) true data generating process. The AIC method for selecting λ involves selecting the λ that minimizes where ν λ = tr(S λ ) is the effective degrees of freedom of the function estimate. Note that it is possible to use other degrees of freedom estimates for η λ ; however, we prefer the ν λ estimate given that this estimate is used for the GCV criterion. The Bayesian information criterion (BIC) proposed by Schwarz [43] has the form which is similar to the AIC but uses a different weight on the penalty. Assuming that n ≥ 8, the BIC penalty weight of log(n) is larger than the AIC penalty weight of 2, which implies that the BIC will tends to select larger values of λ (i.e., smoother models).

Maximum Likelihood Methods
The maximum likelihood approaches for selecting λ exploit the computational relationship between penalized splines and linear mixed effects models [24,44,45]. This approach uses similar arguments to the Bayesian confidence intervals with the exception that the null space coefficients are treated as fixed effects. More specifically, assume that , and assume that γ is independent of i ∀i. This implies that the response vector is y ∼ N(Xβ, σ 2 Σ λ ), where the null space representation contains the "fixed effects" and the contrast space representation contains the "random effects", with covariance matrix proportional to Σ λ = 1 nλ ZQ † Z + I. Given a sample of n independent observations, the log-likelihood function has the form where r = y − Xβ. The maximum likelihood estimate for σ 2 has the formσ 2 λ(ML) = 1 n r Σ −1 λ r, and plugging this into the log-likelihood function produces which is the (profile) maximum likelihood criterion for selecting λ.
Restricted maximum likelihood (REML) estimation takes into account the reduction in degrees of freedom due to estimating the m null space coefficients [46]. The REML log-likelihood function has the form and the REML estimate for σ 2 has the formσ 2 λ(REML) = 1 n−m r Σ −1 λ r. Plugging this error variance estimate into the log-likelihood function produces the (profile) REML criterion for selecting λ, which has the form whereñ = n − m is the degrees of freedom ofσ 2 λ(REML) . Note that the REML criterion is generally preferred over the ML criterion for variance component estimation, particularly when the sample size is small to moderate.

Simulation Design
To investigate the performance of the tuning parameter selection methods discussed in the previous section, we designed a simulation study that compares the methods under a variety of different data generating conditions. More specifically, we designed a fully crossed simulation study that compares the performance of the tuning methods under all combinations of five different design factors: the function smoothness (three levels: later defined), the error standard deviation (three levels: constant, increasing, and parabolic), the error correlation (three levels: ρ ∈ {0, 1/3, 2/3}), the error distribution (three levels: normal, t 5 , and uniform), and the sample size (four levels: n ∈ {50, 100, 200, 400}). For each combination of data generating conditions, the predictor scores were defined as The three levels of the function smoothness are depicted in Figure 1 (left) and are from the simulation studies of Wahba [34]. The three levels of the error standard deviation are depicted in Figure 1 (right) and are defined as σ(x) = 1 (constant), σ(x) = x + 1/2 (increasing), and σ(x) = 4(x − 1/2) 2 + 1/2 (parabolic). To generate correlated errors, we use an autoregressive process of order one, i.e., AR(1) process, so that Cor(x i , x j ) = ρ |i−j| for all i, j. The generation of correlated multivariate normal (or t) data is straightforward given that the AR(1) correlation structure can be incorporated into the covariance matrix. To generate correlated uniform data, we use the method proposed by Falk [47], which produces uniformly distributed data with desired correlations.

Simulation Analyses
For each of the 324 levels of the simulation design (3 η × 3 σ × 3 ρ × 3 F × 4 n), we generated data according to the model in Equation (2). Within each cell of the simulation design, we generated R = 1000 independent replications of the data. For each sample of generated data, we fit the model using the six smoothing parameter selection methods discussed in the previous section. The models were fit in R [20] via the ss() function in the npreg package [48] using r = 20 knots placed evenly across the range of the x i scores.
To evaluate the performance of the different tuning methods, we calculated the root mean squared error (RMSE) of the estimator, i.e., so smaller values of RMSE indicate better recovery of the true mean function. We also calculated the coverage of the 95% Bayesian confidence interval for each tuning method where I{·} denotes an indicator function, a(η λ ( is the upper bound for the 95% Bayesian CI. Note that, when evaluated at the n design points, the estimated posterior variance has the formσ 2 η(x i )|y =σ 2 λ s ii(λ) . Figure 2 displays the RMSE for each tuning method in each combination of data generating function η, autocorrelation ρ, and sample size n when the errors are Gaussian and homoscedastic. The results reveal that, for all combinations of η and ρ, all of the methods tend to result in better function recovery (i.e., smaller RMSE) as n increases, which was expected. The interesting finding is that the maximum likelihood-based methods (REML and ML) tend to produce RMSE values that are similar to or smaller than the RMSE values produced by the cross-validation-based methods (OCV and GCV) and the information theory-based methods (AIC and BIC). The only exception is that the crossvalidation methods tend to outperform the maximum likelihood methods when all three of the following conditions are true: (i) the sample size is small, (ii) the mean function is rather jagged, and (iii) the errors are independent. When ρ > 0, the maximum likelihood methods produce substantially smaller RMSE values compared with the other methods. This effect exists for all combinations of η and n, but the RMSE difference diminishes as the function roughness and/or the sample size increases. Figure 3 displays the coverage of the 95% Bayesian confidence intervals for each tuning method in each combination of data generating function η, autocorrelation ρ, and sample size n when the errors are Gaussian and homoscedastic. The results reveal that the performance of the Bayesian confidence intervals depends on the combination of the function smoothness, the error autocorrelation, the sample size, and the chosen tuning method. When there is no autocorrelation present, all of the tuning methods except the BIC tend to produce better coverage rates (i.e., closer to the nominal level) as the sample size n increases, regardless of the function smoothness. Furthermore, when there is no autocorrelation present, the maximum likelihood methods and the BIC result in noteworthy under-coverage when the function is jagged and the sample size is small. When there exists moderate autocorrelation in the errors (i.e., ρ = 1/3), all of the tuning methods result in noteworthy under-coverage, with the maximum likelihood methods performing better for smooth functions and the cross-validation methods performing better for more jagged functions. When the autocorrelation is larger (i.e., ρ = 2/3), all of the methods have similarly poor performance.  Our discussion of the results focused on a subset of the simulation conditions, which are sufficient for understanding the primary findings of the simulation study. Specifically, we focused on the results when the errors are Gaussian with constant variance, whereas the results for non-Gaussian and heteroscedastic errors are presented in Appendix A (Gaussian), Appendix B (t 5 ), and Appendix C (uniform). Interestingly, we found that the RMSE and coverage results were quite similar for the non-Gaussian and heteroscedastic cases. In other words, the primary conclusions that were made about the results in Figures 2 and 3 also apply to the results for non-Gaussian distributions with non-constant variance. When the errors followed a multivariate t 5 distribution, the RMSE values tended to be a bit larger for all methods, which is not surprising. However, the primary conclusions regarding the effect of autocorrelation remained the same. It is rather interesting to note that the REML criterion tends to perform relatively well-especially in the presence of autocorrelation in the errors-even when the errors do not follow a Gaussian distribution. Consequently, the REML criterion seems to be a reasonable default tuning criterion as long as the sample size is not too small.

Global Warming Example
Our first example uses global land-ocean temperature data, which are freely available from NASA's Goddard Institute for Space Studies [49,50]. The dataset contains the global land-ocean temperature index from the years 1880 to 2020. Note that the global landocean temperature index is the change in global surface temperatures (in degree Celsius) relative to the 1951-1980 average temperatures. Positive values indicate that the average temperature for a given year was above the average temperature for the years 1951-1980, and negative values indicate that the average temperature for a given year was below the average temperature for the years 1951-1980. In our example, we compare the results of the trend estimate using the GCV and REML criteria for selecting the tuning parameter of the smoothing spline. We use the .nknots.smspl function in R [20] to select the number of knots, which results in the selection of 76 knots. For both tuning methods, we use the same 76 knot values to fit the cubic smoothing spline.
The results are plotted in Figure 4, which reveals that the GCV and REML tuning methods produce drastically different pictures of the temperature change across time. Although both methods show that the global land-ocean temperature index has increased over time (particularly since the 1970s), the GCV and REML solutions tell notably different stories about the nature of the increase. The GCV solution has an estimated degrees of freedom ofν λ = 47.52 and suggests that the temperature index is rather volatile from year to year. In contrast, the REML solution has an estimated degrees of freedom of ν λ = 11.43 and suggests that the temperature index changes in a rather smooth fashion from year to year. Based on the simulation results, it seems likely that the GCV criterion is under-smoothing the data, which may have some noteworthy autocorrelation in the error structure. Consequently, we contend that the REML solution should be preferred for interpreting the nature of the yearly changes in the global land-ocean temperature index.

Motorcycle Accident Example
Our second example uses acceleration data from a simulated motorcycle accident. This dataset was first considered by Silverman [51] and has since been popularized via its inclusion in the popular MASS R package (see mcycle, [52]). The dataset contains the head acceleration (in g) as a function of time (in milliseconds) for n = 133 points of simulated data. The data are meant to simulate the acceleration curve of the head after a motorcycle accident and were simulated for the purpose of evaluating motorcycle helmets. Due to the simulation procedure, the resulting data are noisy realizations of the true acceleration curve, so the data need to be smoothed in order to estimate the expected head acceleration as a function of time. As in the previous example, (i) we compare the results of the curve estimate using the GCV and REML criteria for selecting the tuning parameter of the smoothing spline and (ii) we use the .nknots.smspl function in R to select the number of knots, which resulted in the selection of 61 knots. For both tuning methods, we use the same 61 knot values to fit the cubic smoothing spline.
The results are plotted in Figure 5, which reveals that the GCV and REML tuning methods produce rather similar estimates in this case. The GCV criterionν λ = 12.21 selected a slightly smaller degree of freedom estimate compared with the REML solution ν λ = 13.86. However, from a practical perspective, the two tuning methods result in smoothing spline estimates that produce the same interpretation of the acceleration curve. The estimated acceleration curve reveals that the head experiences negative acceleration (15-30 ms) followed by a rebound effect of positive acceleration (30-40 ms) before loosely stabilizing around zero (40+ ms). Finally, it is worth noting that the data in this example violate the homogeneity of variance assumption, which is required for the Bayesian confidence intervals. Therefore, although the GCV and REML tuning criteria may provide satisfactory estimates of the acceleration curve, the Bayesian confidence intervals do not provide satisfactory estimates of uncertainty-in this case, they suggest too much uncertainty in the estimate from 0 to 15 ms.

Overview
Due to their unique combination of flexibility and interpretability, smoothing splines are frequently used to understand functional relationships in applied research. Unlike standard parametric regression methods (which make strict assumptions) and standard machine learning methods (which produce black-box predictions), smoothing splines are able (i) to learn functional forms and (ii) to produce insightful visualizations. To provide a valid estimate of unknown functional relationships, the smoothing spline estimator requires selecting a smoothing parameter that provides the "correct" balance between fitting and smoothing the data. Specifically, the success of a smoothing spline depends on choosing the tuning parameter λ that satisfies a Goldilocks phenomenon: if λ is too small, the estimator has too much variance, and if λ is too large, the estimator has too much bias. Despite the well-known importance of selecting the "correct" λ, there is little agreement in the literature regarding which method should be used. Tuning methods can produce different results [26,27] so the choice of tuning method matters.
To address this issue, we explored the relative performance of six popular methods used to select the smoothing parameter λ. Unlike previous studies on this topic, (i) we compared a diverse collection of tuning methods, which included cross-validation, information theoretic, and maximum likelihood methods; (ii) we designed an extensive simulation study that evaluated each method's performance under a variety of data generating conditions; and (iii) we assessed the performance of the methods with respect to both function estimation and statistical inference. Furthermore, we used both simulated and real data examples to demonstrate the substantial differences that different smoothing methods can have on the solution. As we elaborate in the following paragraphs, the primary take-home message from our work is that any real data application should compare the results using both the GCV and REML smoothing parameter selection criteria-which is rarely performed in practice.

Summary of Results
Our simulation results replicate several important findings and provide novel insights about the performance of different tuning methods for smoothing splines. The finding that common tuning methods (i.e., OCV, GCV, and AIC) can breakdown in the presence of autocorrelated errors replicates several past studies on the topic [21,22]. Furthermore, our finding that the REML and ML tuning criteria are relatively robust in the presence of autocorrelated errors supports the theoretical results of Krivobokova and Kauermann [27]. In addition to replicating these known results, our simulation produced several important and novel findings: (i) the performance of the Bayesian confidence intervals deteriorates as the degree of autocorrelation increases; (ii) the cross-validation tuning methods only show an advantage over REML/ML when n is small, η is rough, and ρ = 0; and (iii) the superior performance of the REML/ML tuning methods persists even when the errors are non-Gaussian and/or heteroscedastic.
Our real data results demonstrate the important role that the smoothing parameter selection method plays in understanding functional relationships from a fit smoothing spline. Using the GCV versus the REML criterion, a researcher would arrive at a remarkably different interpretation of the global warming trends. Since these are real data, we cannot be sure whether the GCV or REML solution is closer to the truth. However, in this case, it seems likely that the GCV criterion has capitalized on autocorrelation in the error terms, which manifests itself as a part of the mean structure. In contrast, the REML solution seems to provide a rather parsimonious and intuitive estimate of the global warming trends, which agrees with visual intuition about the nature of the trends across time. Of course, the specifics of the global warming example do not generalize to other datasets, e.g., the two tuning methods performed similarly for the motorcycle data. However, this example illustrates how (i) the GCV criterion can under-smooth data and (ii) the REML criterion can overcome this under-smoothing issue.

Limitations and Future Directions
This paper only considers the nonparametric regression model in Equation (2), where the unknown function η(·) describes the conditional mean of Y given X. Accordingly, this paper only compares tuning methods that are applicable to the penalized least squares problem in Equation (3). Although quite general, the model in Equation (2) can be extended in a variety of different ways. For example, in a generalized nonparametric regression model, the function η(·) describes the conditional mean of an exponential family response variable as a function of a predictor [16]. As another example, in nonparametric quantile regression, the function η(·) describes the condi-tional quantile of a response variable as a function of a predictor [53]. Furthermore, penalized splines can be incorporated into other types of nonparametric estimators, e.g., M-estimators [54]. These extensions require different estimation and tuning methods, so the results in this paper cannot be generalized to such extensions. Thus, future work is needed to determine which tuning methods should be preferred when penalized splines are used for various FDA extensions of the simple nonparametric regression model in Equation (2).

Conclusions
When using smoothing splines for real data analysis, it seems that many researchers do not give much thought to the smoothing parameter selection method. This is likely because many software implementations of smoothing splines do not emphasize the importance of the tuning method. Furthermore, most softwares only implement one or two tuning methods, so researchers rarely have the option to explore a multitude of tuning methods. For example, the smooth.spline() function in R [20] only offers the OCV and GCV tuning methods, and most users seem to (purposefully or unwittingly) use the default GCV tuning method. It is important to note that the GCV method is also the default in the mgcv R package [55], the bigsplines R package [56], and the npreg R package [48]; however, these packages offer more tuning options. For a flexible alternative to R's smooth.spline() function, we recommend the ss() function from the npreg package, given that it has nearly identical syntax to the smooth.spline() function and makes it possible to easily compare the results using multiple tuning criteria. Funding: This research was funded by the following National Institutes of Health (NIH) grants: R01EY030890, R01MH115046, U01DA046413, and R01MH112583.

Data Availability Statement:
The supporting materials published with this paper include the data and R code needed to reproduce the simulation and real data results. The global temperature anomaly data were obtained from https://data.giss.nasa.gov/gistemp/ accessed on 17 May 2021.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:   Figure A16. Boxplots of the coverage rate for the 95% Bayesian confidence interval across the 1000 simulation replications for uniform errors with parabolic standard deviation.