An Upper Bound of the Bias of Nadaraya-Watson Kernel Regression under Lipschitz Assumptions

The Nadaraya-Watson kernel estimator is among the most popular nonparameteric regression technique thanks to its simplicity. Its asymptotic bias has been studied by Rosenblatt in 1969 and has been reported in a number of related literature. However, Rosenblatt's analysis is only valid for infinitesimal bandwidth. In contrast, we propose in this paper an upper bound of the bias which holds for finite bandwidths. Moreover, contrarily to the classic analysis we allow for discontinuous first order derivative of the regression function, we extend our bounds for multidimensional domains and we include the knowledge of the bound of the regression function when it exists and if it is known, to obtain a tighter bound. We believe that this work has potential applications in those fields where some hard guarantees on the error are needed


Introduction
Nonparametric regression and density estimation have been used in a wide spectrum of applications, ranging from economics (Bansal et al., 1995), system dynamics identification (Wang et al., 2006;Nguyen-Tuong & Peters, 2010), and reinforcement learning (Ormoneit & Sen, 2002;Kroemer & Peters, 2011;Deisenroth & Rasmussen, 2011;Kroemer et al., 2012). In recent years, nonparameteric density estimation and regression have been dominated by parametric methods such as those based on deep neural networks. These parametric methods have demonstrated an extraordinary capacity in dealing with both high-dimensional data-such as images, sounds or videos-and large dataset. However, it is difficult to obtain strong guarantees on such complex models, which have been shown easy to fool (Moosavi-Dezfooli et al., 2016). Nonparametric techniques have the advantage of being easier to understand, and recent work overcame some of their limitations, by e.g. allowing linear-memory and sub-linear query time for density kernel estimation (Charikar & Siminelakis, 2017;Backurs et al., 2019). These methods allowed nonparameteric kernel density estimation to be performed on datasets of 10 6 samples and up to 784 input dimension. As such, nonparametric methods are a relevant choice when one is willing to trade performance for statistical guarantees; and the contribution of this paper is to advance the state-of-the-art on such guarantees.
Studying the error of a statistical estimator is important. It can be used for example to tune the hyper-parameters by minimizing the estimated error (Hrdle & Marron, 1985;Ray & Tsay, 1997;Herrmann et al., 1992;Khler et al., 2014). To this end, the estimation error is usually decomposed into an estimation bias and variance. When it is not possible to derive these quantities, one performs an asymptotic behavior analysis or a convergence to a probabilistic distribution of the error. While all aforementioned analyses give interesting insights on the error and allow for hyper-parameter optimization, they do not provide any strong guarantee on the error, i.e., we are not able to upper bound it with absolute certainty.
Beyond hyper-parameter optimization, we argue that another important aspect of error analysis is to provide hard (non-probabilistic) bounds of the error for critical data-driven algorithms. We believe that in the close future, learning agents taking autonomous, data-driven, decisions will be increasingly present. These agents will for example be autonomous surgeons, self-driving cars or autonomous manipulators. In many critical applications involving these agents, it is of primary importance to bound the prediction error in order to provide some technical guarantees on the agent's behavior. In this paper we derive hard upper bounds of the estimation error in nonparametric regression with minimal assumptions on the problem such that the bound can be readily applied to a wide range of applications.
Specifically, we consider in this paper the Nadaraya-Watson kernel regression (Nadaraya, 1964;Watson, 1964), which can be seen as a conditional kernel density estimate, and we derive an upper bound of the estimation bias for the Gaussian kernel under weak local Lipschitz assumptions. The reason for our choice of estimator falls of its inherent simplicity, in comparison to more sophisticated techniques. The bias of the Nadaraya-Watson kernel regression has been previously studied by Rosenblatt (1969), and has been reported in a number of related work (Mack & Mller, 1988;Fan, 1992;Fan & Gijbels, 1992;Wasserman, 2006). The main assumptions of Rosenblatt's analysis are h n → 0 (where n is the number of samples) and nh n → ∞ where h n is the kernel's bandwidth. The Rosenblatt's analysis suffers from an asymptotic error o(h 2 n ), which means that for large bandwidths it is not accurate. In contrast, we derive an upper bound of the bias of the Nadaraya-Watson kernel regression which is valid for any choice of bandwidth.
Our analysis is built on weak Lipschitz assumptions (Miculescu, 2000), which are milder than the (global) Lipschitz, as we require only |f (x) − f (y)| ≤ |x − y| ∀y ∈ C given a fixed x, instead of the classic |f (x) − f (y)| ≤ |x − y| ∀y, x ∈ C-where C is the data domain. Moreover, the classical analysis requires the knowledge of m , and therefore the continuity of m -where m and m are respectively second and first order derivative of the regression function. We relax this assumption, which allows us to obtain a bias upper bound even for functions such as |x|, at points where m is undefined. When the bandwidth h n is large, the Rosenblatt's bias analysis, being only valid for h n → 0, tends to provide wrong estimates of the bias, as we can observe in the experimental section. Furthermore, we consider multidimensional input space, in order to open this analysis to more realistic settings.

Preliminaries
Consider the problem of estimating E[Y |X = x] where X ∼ f X and Y = m(X) + , with noise , i.e. E[ ] = 0. The noise can depend on x, but since our analysis is conducted point-wise for a given x, x will be simply denoted by . Let m : R d → R be the regression function and f X a probability distribution on X called design. In our analysis we consider X ∈ R d and Y ∈ R. The Nadaraya-Watson kernel estimate of where K h is a kernel function with bandwidth-vector h, the x i are drawn from the design f X and y i from m(x i )+ . Note that both the numerator and the denominator are proportional to Parzen-Rosenblatt density kernel estimates (Rosenblatt, 1956;Parzen, 1962). We are interested in the point-wise bias of such estimate E[m(x)] − m(x).
In the prior analysis of Rosenblatt (1969), knowledge of m , m , f X , f X is required and f, m must be continuous in a neighborhood of x. In addition, and as discussed in the introduction, the analysis is limited to a one-dimensional design, and for an infinitesimal bandwidth. For clarity of exposition, we briefly present the classical bias analysis of Rosenblatt (1969) before introducing our results.
samples from a distribution with nonzero differentiable density f X . Assume y i = m(s i ) + σ( i ), where i are i.i.d. and zero-mean. The bias of the Nadaraya-Watson kernel in the limit of infinite samples and for h → 0 and nh n → ∞ is The o P term denotes the asymptotic behavior w.r.t. the bandwidth. Therefore, for a larger value of the bandwidth, the bias estimation becomes worse, as is illustrated in Figure 1.

Main Result
In this section we present two bounds on the bias of the Nadaraya-Watson estimator. The first one considers a bounded regression function m, and allows for local Lipschitz and in the case of the normal function one can find f µ,σ (a, b) = max y∈{a,b} |N (y|µ, σ)|.
conditions on a subset of the design's support. The second bound instead does not require the regression function to be bounded but only the local Lipschitz continuity to hold on all of its support. The definition of "local" Lipschitz continuity will be given below.
In order to develop our bound on the bias for multidimensional inputs, it is important to define some subset of the R d space. More in detail we consider an open n- We now formalize what is meant by weak (log-)Lipschitz continuity. This will prove useful as we need knowledge of the local-Lipschitz constants of m and log f X in our analysis.
Definition 1. Weak Lipschitz continuity at x on the set C under the L 1 -norm. Let C ⊆ R d and f : C → R. We call f weak Lipschitz continuous at x ∈ C if and only if where | · | denotes the L 1 -norm.
Definition 2. Weak log-Lipschitz continuity at x on the set C under the L 1 -norm.
Note that the set C can be a subset of the function's domain.
It is important to note that, in contrast to the global Lipschitz continuity, which requires |f (y) − f (z)| ≤ L|y − z| ∀y, z ∈ C, the weak Lipschitz continuity is defined at a specific point x and therefore allows the function to be discontinuous elsewhere. In the following we list the set of assumptions that we use in our theorems.
In the following, we propose two different bounds of the bias. The first version considers a bounded regression function (M < +∞), this allows both the regression function and the design to be weak Lipschitz on a subset of their domain. In the second version instead, we consider the case of unbounded regression function (M = +∞) or when the bound M is not known. In this case both the regression function and the design must be weak Lipschitz on the entire domain Υ.
Theorem 2. Bound on the Bias with Bounded Regression Function.
, n → ∞, and furthermore assuming there is a constant 0 ≤ M < +∞ such that |m(y) − x(z)| ≤ M ∀y, z ∈ Υ, the considered Nadaraya-Watson kernel regression bias results to be bounded by In the case where M is unknown or infinite, we propose the following bound.
, n → ∞, and furthermore assuming that Υ ≡ D ≡ G, the considered Nadaraya-Watson kernel regression bias results to be bounded by where ξ A k , ζ, Ψ, ϕ are defined as in Theorem 2 and The proof of both theorems is detailed in the Supplementary Material. Note that the conditions required by our theorems are mild and they allow a wide range of random designs, including and not limited to Gaussian, Cauchy, Pareto, Uniform and Laplace distributions. In general every continuously differentiable density distribution is also weak log-Lipschitz in some closed subset of its domain. For example, the Gaussian distribution does not have a finite Lipschitz constant on its entire domain, but on any closed interval, there is a finite weak Lipschitz constant. Examples of densities that are weak log-Lipschitz are presented in Table 1.

Simulations
In this section we provide a numerical analysis of our bounds on the bias. We test our method on uni-dimensional input spaces for display purposes. We select a set of regression functions with different Lipschitz constants and different bounds, • y = sin(5x); L m = 5 and M = 1, • y = log x which for G ≡ (−1, +∞) has L m = 1 and M = +∞, • y = 60 −1 log cosh 60x which has L m = 1, is unbuounded, and has a particularly high second derivative in x = 0, with m (0) = 60, • y = √ x 2 + 1 which has L m = 1 and is unbounded.
In order to provide as many different scenarios as possible we also used the distributions from Table 1, using therefore both infinite domain distributions, such as Cauchy and Laplace, and finite domain such as Uniform. In order to numerically estimate the bias, we approximate E[m n (x)] with an ensemble of estimates N −1 N j=1m n,j (x) where each estimatem n,j is built on a different dataset (drawn from the same distribution f X ). In order to "simulate" n → ∞ we used n = 10 5 samples.
In this section we provide some simulations of our bound presented in Theorem 2 and Theorem 3, and for the Rosenblatt's case we use Since the Rosenblatt's bias estimate is not an upper bound, it can happen that the true bias is higher (as well as lower) than this estimate, as it is possible to see in Figure 1. We presented different scenarios, both with bounded and unbounded functions, infinite and finite design domains, and with larger or smaller choice of bandwidths. It is possible to observe that, thanks to the knowledge of f, f , m , m the Rosenblatt's estimation of the bias tends to be more accurate than our bound, however it can happen that it largely overestimate the bias, like in the case of m(x) = 60 −1 log cosh(60x) in x = 0 or to underestimate it, most often in boundary regions. In contrast, our bound always overestimate the true bias, and despite its lack of knowledge of f, f , m , m , it is most often tight. Moreover, when the bandwidth is small, both our method and Rosenblatt's deliver an accurate estimation of the bias. In general, Rosenblatt tends to deliver a better estimate of the bias, but it does not behave as a bound, and in some situations it also can deliver larger mispredictions. In detail, the plot (a) in Figure 1 shows that, with a tight bandwidth both our method and Rosenblatt's method achieve good approximations of the bias, but only our method correctly upper bounds the bias. When increasing the bandwidth, we obtain both a larger bias and subsequent larger estimates of the bias. Our method consistently upper bounds the bias, while in many cases Rosenblatt's method under estimates it, especially in proximity of boundaries (subplots b, d, e). An interesting case can be observed in subplot (c), where we test the function m(x) = 60 −1 log cosh(60x), which has high second order derivative in x = 0: in this case, Rosenblatt's method largely overestimates the bias. The figure shows that our bound is able to deal with different functions and random designs, being reasonably tight, if compared to the Rosenblatt's estimation which requires the knowledge of the regression function and the design, and respective derivatives.

Acknowledgment
The research is financially supported by the Bosch-Forschungsstiftung program.

A Appendix
In order to give the proof of the stated Theorems, we deen to introduce some quantities and to state some facts that will be used in our proofs.
Definition 3. Multivariate Gaussian Kernel. We define the multivariate Gaussian Kernel with bandwidth h ∈ R as Υ e g(x) dx and |g(x) − g(y)| ≤ L f |x − y| ∀y ∈ D.
Lower bound of the Denominator: The denominator is always positive, so the module can be removed, Now considering Proposition 2 and Proposition 3, we obtain Upper bound of the Numerator: x + φ + ) ⊆ G, we will later define at our convenience.
The first integral instead can be solved with Proposition 4, Proposition 6 and Proposition 7, The second integral can be solved using Proposition 6, A good choice for F is φ − i = min(γ − i , M/L f ) and φ + i = min(γ + i , M/L f ), as in this way we obtain a tighter bound. In last analysis, letting