Next Article in Journal
Dynamic Properties Evaluation of Railway Ballast Using Impact Excitation Technique
Previous Article in Journal
When Diversity Met Accuracy: A Story of Recommender Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Extended Abstract

Bandwidth Selection in Nonparametric Regression with Large Sample Size †

by
Daniel Barreiro-Ures
*,‡,
Ricardo Cao
and
Mario Francisco-Fernández
Department of Mathematics, Faculty of Computer Science, University of A Coruña, A Coruña 15008, Spain
*
Author to whom correspondence should be addressed.
Presented at the XoveTIC Congress, A Coruña, Spain, 27–28 September 2018.
These authors contributed equally to this work.
Proceedings 2018, 2(18), 1166; https://doi.org/10.3390/proceedings2181166
Published: 17 September 2018
(This article belongs to the Proceedings of XoveTIC Congress 2018)

Abstract

:
In the context of nonparametric regression estimation, the behaviour of kernel methods such as the Nadaraya-Watson or local linear estimators is heavily influenced by the value of the bandwidth parameter, which determines the trade-off between bias and variance. This clearly implies that the selection of an optimal bandwidth, in the sense of minimizing some risk function (MSE, MISE, etc.), is a crucial issue. However, the task of estimating an optimal bandwidth using the whole sample can be very expensive in terms of computing time in the context of Big Data, due to the computational complexity of some of the most used algorithms for bandwidth selection (leave-one-out cross validation, for example, has O ( n 2 ) complexity). To overcome this problem, we propose two methods that estimate the optimal bandwidth for several subsamples of our large dataset and then extrapolate the result to the original sample size making use of the asymptotic expression of the MISE bandwidth. Preliminary simulation studies show that the proposed methods lead to a drastic reduction in computing time, while the statistical precision is only slightly decreased.

1. Scenario

Let us consider a sample of size n, { ( x i , y i ) } i = 1 , , n , drawn from a nonparametric regression model y i = m ( x i ) + ε i . We assume random design, E [ ε x ] = 0 and E [ ε 2 x ] = σ 2 ( x ) < . In this context, we deal with the Nadaraya-Watson estimator [1] for the regression function, m, which is characterized by the kernel function K and the bandwidth or smoothing parameter h > 0 . Under suitable conditions, the asymptotically optimal (in the sense of minimum AMISE) bandwidth satisfies
h A M I S E , n = c 0 n 1 5 .
Since we are assuming that the sample size, n, is very large, the task of computing a bandwidth selector using the whole sample would be too computationally expensive. For example, the leave-one-out cross-validation (LOO CV) bandwidth selector has complexity O ( n 2 ) .

2. Bandwidth Selection

The idea behind our proposal is to find the LOO CV bandwidth for several subsamples and then extrapolate the result to the original sample size using the asymptotic expression of the MISE bandwidth (1).

2.1. One Subsample Size (OSS)

The idea behind this method is to draw several subsamples of size r, much smaller than n, then compute the LOO CV selector and finally use Equation (1) to extrapolate the CV bandwidth for the original sample size (this idea was already proposed in [2] in the context of kernel density estimation to reduce the variance of the CV bandwidth selector).
  • Obtain s subsamples of size r n subsampling without replacement from our original dataset.
  • For each subsample, find the LOO CV bandwidth.
  • Let h ^ r denote the average of these bandwidths.
  • We estimate the unknown constant c 0 by c ^ 0 = h ^ r r 1 5 .
  • Therefore, our estimate of the AMISE bandwidth would be h ^ A M I S E , n = c ^ 0 n 1 5 = h ^ r r n 1 5 .

2.2. Several Subsample Sizes (SSS)

We now propose a method that considers several subsamples of different sizes.
  • Consider a grid of subsample sizes, r 1 , , r s , with r j n .
  • For each r j , compute the LOO CV bandwidth, h ^ j (several subsamples of each size could be considered).
  • Solve the ordinary least squares problem (or a robust analogue) given by ( β ^ 0 , β ^ 1 ) = arg min β 0 , β 1 i = 1 s ( log ( h ^ i ) β 0 β 1 log ( m i ) ) 2 , in which case c ^ = e β ^ 0 and p ^ = β ^ 1 is our estimate of the order of convergence of the AMISE bandwidth.
  • Our estimate of the AMISE bandwidth for the original sample size, n, would be h ^ A M I S E , n = c ^ n p ^ .

3. Simulation Study

Let us consider samples of size n = 10 6 drawn from the model Y = m ( X ) + ε , where X B e t a ( 2 , 2 ) , ε N ( 0 , 0 . 2 2 ) and m ( x ) = 1 + x sin ( 5.5π x ) 2 . Furthermore, we have considered a Gaussian kernel and, as a weight function, w ( x ) = 1 { F X 1 ( 0.05) x F X 1 ( 0.95) } , where F X 1 denotes the marginal quantile function of X.
It is clear from Figure 1 that the OSS selector outperforms the SSS selector in terms of statistical precision. Moreover, in many cases bandwidths that are quite distant from the optimum do not have an associated large error (in terms of AMISE). On the other hand, as we can observe in Table 1 and Table 2, the OSS selector is substantially faster than the SSS selector due to the fact that the former works with a single subsample size which, in turn, is even smaller than most of those considered for the SSS selector). It should be noted that the source code for both selectors was written in C++ and run in parallel on an Intel Core i5-8600K 3.6 GHz.

Author Contributions

Conceptualization, D.B.-U., R.C. and M.F.-F.; Methodology, D.B.-U., R.C. and M.F.-F.; Software, D.B.-U., R.C. and M.F.-F.; Validation, D.B.-U., R.C. and M.F.-F.; Formal Analysis, D.B.-U., R.C. and M.F.-F.; Investigation, D.B.-U., R.C. and M.F.-F.; Resources, D.B.-U., R.C. and M.F.-F.; Data Curation, D.B.-U., R.C. and M.F.-F.; Writing—Original Draft Preparation, D.B.-U., R.C. and M.F.-F.; Writing—Review & Editing, D.B.-U., R.C. and M.F.-F.; Visualization, D.B.-U., R.C. and M.F.-F.; Supervision, D.B.-U., R.C. and M.F.-F.; Project Administration, D.B.-U., R.C. and M.F.-F.; Funding Acquisition, D.B.-U., R.C. and M.F.-F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This research has been supported by MINECO grant MTM-2014-52876-R and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015 and Centro Singular de Investigación de Galicia ED431G/01), all of them through the ERDF.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

  1. Nadaraya, E.A. On estimating regression. Theory Probab. Its Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
  2. Wang, Q.; Lindsey, B.G. Improving cross-validated bandwidth selection using subsampling-extrapolation techniques. Comput. Stat. Data Anal. 2015, 89, 51–71. [Google Scholar] [CrossRef]
Figure 1. Sampling distributions of h ^ h M I S E , n (left figure) and log A M I S E ( h ^ ) A M I S E ( h A M I S E , n ) (right figure) for the OSS (red) and SSS (green) bandwidth selectors.
Figure 1. Sampling distributions of h ^ h M I S E , n (left figure) and log A M I S E ( h ^ ) A M I S E ( h A M I S E , n ) (right figure) for the OSS (red) and SSS (green) bandwidth selectors.
Proceedings 02 01166 g001
Table 1. CPU elapsed times for the OSS selector with n = 10 6 . 10 subsamples of the corresponding size were considered.
Table 1. CPU elapsed times for the OSS selector with n = 10 6 . 10 subsamples of the corresponding size were considered.
Subsample SizeCPU Elapsed Time (s)
5001.62
10002.82
Table 2. CPU elapsed times for the SSS selector with n = 10 6 considering uniform grids (of 20 elements) of subsample sizes ranging from 100 to the corresponding maximum size. 10 subsamples of each of the corresponding sizes were considered.
Table 2. CPU elapsed times for the SSS selector with n = 10 6 considering uniform grids (of 20 elements) of subsample sizes ranging from 100 to the corresponding maximum size. 10 subsamples of each of the corresponding sizes were considered.
Maximum Subsample SizeCPU Elapsed Time (s)
5009.21
100017.9
150032.1
200051.0
250075.1
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Barreiro-Ures, D.; Cao, R.; Francisco-Fernández, M. Bandwidth Selection in Nonparametric Regression with Large Sample Size. Proceedings 2018, 2, 1166. https://doi.org/10.3390/proceedings2181166

AMA Style

Barreiro-Ures D, Cao R, Francisco-Fernández M. Bandwidth Selection in Nonparametric Regression with Large Sample Size. Proceedings. 2018; 2(18):1166. https://doi.org/10.3390/proceedings2181166

Chicago/Turabian Style

Barreiro-Ures, Daniel, Ricardo Cao, and Mario Francisco-Fernández. 2018. "Bandwidth Selection in Nonparametric Regression with Large Sample Size" Proceedings 2, no. 18: 1166. https://doi.org/10.3390/proceedings2181166

Article Metrics

Back to TopTop