Bandwidth Selection in Nonparametric Regression with Large Sample Size †

In the context of nonparametric regression estimation, the behaviour of kernel methods such as the Nadaraya-Watson or local linear estimators is heavily influenced by the value of the bandwidth parameter, which determines the trade-off between bias and variance. This clearly implies that the selection of an optimal bandwidth, in the sense of minimizing some risk function (MSE, MISE, etc.), is a crucial issue. However, the task of estimating an optimal bandwidth using the whole sample can be very expensive in terms of computing time in the context of Big Data, due to the computational complexity of some of the most used algorithms for bandwidth selection (leave-one-out cross validation, for example, has O(n2) complexity). To overcome this problem, we propose two methods that estimate the optimal bandwidth for several subsamples of our large dataset and then extrapolate the result to the original sample size making use of the asymptotic expression of the MISE bandwidth. Preliminary simulation studies show that the proposed methods lead to a drastic reduction in computing time, while the statistical precision is only slightly decreased.

Let us consider a sample of size n, {(x i , y i )} i=1,...,n , drawn from a nonparametric regression model y i = m(x i ) + ε i .We assume random design, E[ε | x] = 0 and E[ε 2 | x] = σ 2 (x) < ∞.In this context, we deal with the Nadaraya-Watson estimator [1] for the regression function, m, which is characterized by the kernel function K and the bandwidth or smoothing parameter h > 0. Under suitable conditions, the asymptotically optimal (in the sense of minimum AMISE) bandwidth satisfies Since we are assuming that the sample size, n, is very large, the task of computing a bandwidth selector using the whole sample would be too computationally expensive.For example, the leave-one-out cross-validation (LOO CV) bandwidth selector has complexity O(n 2 ).

Bandwidth Selection
The idea behind our proposal is to find the LOO CV bandwidth for several subsamples and then extrapolate the result to the original sample size using the asymptotic expression of the MISE bandwidth (1).

One Subsample Size (OSS)
The idea behind this method is to draw several subsamples of size r, much smaller than n, then compute the LOO CV selector and finally use Equation (1) to extrapolate the CV bandwidth for the original sample size (this idea was already proposed in [2] in the context of kernel density estimation to reduce the variance of the CV bandwidth selector).

1.
Obtain s subsamples of size r n subsampling without replacement from our original dataset.

2.
For each subsample, find the LOO CV bandwidth.

3.
Let ĥr denote the average of these bandwidths.

Several Subsample Sizes (SSS)
We now propose a method that considers several subsamples of different sizes.

1.
Consider a grid of subsample sizes, r 1 , . . ., r s , with r j n.

2.
For each r j , compute the LOO CV bandwidth, ĥj (several subsamples of each size could be considered).

3.
Solve the ordinary least squares problem (or a robust analogue) given by ( β0 , β1 ) = arg min 2 , in which case ĉ = e β0 and p = β1 is our estimate of the order of convergence of the AMISE bandwidth.4.
Our estimate of the AMISE bandwidth for the original sample size, n, would be ĥAMISE,n = ĉn p.
It is clear from Figure 1 that the OSS selector outperforms the SSS selector in terms of statistical precision.Moreover, in many cases bandwidths that are quite distant from the optimum do not have an associated large error (in terms of AMISE).On the other hand, as we can observe in Tables 1 and 2, the OSS selector is substantially faster than the SSS selector due to the fact that the former works with a single subsample size which, in turn, is even smaller than most of those considered for the SSS selector).It should be noted that the source code for both selectors was written in C++ and run in parallel on an Intel Core i5-8600K 3.6 GHz.

2.0
Maximum subsample size h ^hMISE q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 500 1000 500

Figure 1 .
Figure 1.Sampling distributions of ĥ h MISE,n (left figure) and log

Table 1 .
CPU elapsed times for the OSS selector with n = 10 6 .10 subsamples of the corresponding size were considered.

Table 2 .
CPU elapsed times for the SSS selector with n = 10 6 considering uniform grids (of 20 elements) of subsample sizes ranging from 100 to the corresponding maximum size.10 subsamples of each of the corresponding sizes were considered.