1. Introduction
Nonparametric regression and density estimation have been used in a wide spectrum of applications, ranging from economics [
1], system dynamics identification [
2,
3], and reinforcement learning [
4,
5,
6,
7]. In recent years, nonparametric density estimation and regression have been dominated by parametric methods such as those based on deep neural networks. These parametric methods have demonstrated an extraordinary capacity in dealing with both high-dimensional data—such as images, sounds, or videos—and large datasets. However, it is difficult to obtain strong guarantees on such complex models, which have been shown easy to fool [
8]. Nonparametric techniques have the advantage of being easier to understand, and recent work overcame some of their limitations by, e.g., allowing linear-memory and sub-linear query time for density kernel estimation [
9,
10]. These methods allowed nonparametric kernel density estimation to be performed on datasets of
samples and up to 784 input dimension. As such, nonparametric methods are a suitable choice when one is willing to trade performance for statistical guarantees; and the contribution of this paper is to advance the state-of-the-art on such guarantees.
Studying the error of a statistical estimator is important. It can be used, for example, to tune the hyper-parameters by minimizing the estimated error [
11,
12,
13,
14]. To this end, the estimation error is usually decomposed into an estimation bias and variance. When it is not possible to derive these quantities, one performs an asymptotic behavior analysis or a convergence to a probabilistic distribution of the error. While all aforementioned analyses give interesting insights on the error and allow for hyper-parameter optimization, they do not provide any strong guarantee on the error, i.e., we cannot upper bound it with absolute certainty.
Beyond hyper-parameter optimization, we argue that another critical aspect of error analysis is to provide hard (non-probabilistic) bounds of the error for critical data-driven algorithms. We believe that learning agents taking autonomous, data-driven, decisions will be increasingly present in the near future. These agents will, for example, be autonomous surgeons, self-driving cars or autonomous manipulators. In many critical applications involving these agents, it is of primary importance to bound the prediction error in order to provide some technical guarantees on the agent’s behavior. In this paper we derive a hard upper bound of the estimation bias in non-parametric regression with minimal assumptions on the problem. The bound can be readily applied to a wide range of applications.
Specifically, we consider the Nadaraya-Watson kernel regression [
15,
16], which can be seen as a conditional kernel density estimate. We derive an upper bound of the estimation bias under weak local Lipschitz assumptions. The reason for our choice of estimator falls in its inherent simplicity compared to more sophisticated techniques. The bias of the Nadaraya-Watson kernel regression has been previously studied by [
17], and has been reported in a number of related work [
18,
19,
20,
21]. The analysis of the bias conducted by Rosenblatt (1969) [
17] still remains the main reference for this regression technique. The main assumptions of Rosenblatt’s analysis are
(where
n is the number of samples) and
where
is the kernel’s bandwidth. Rosenblatt’s analysis suffers from an asymptotic error
, which means that for large bandwidths it is not accurate; making it inapplicable to derive a hard upper bound. To the best of our knowledge, the only proposed bound on the bias requires the restrictive assumption that the samples must be placed evenly on a closed interval [
22]. In contrast, we derive an upper bound of the bias of the Nadaraya-Watson kernel regression that is valid for a large class of design and for any choice of bandwidth.
We build our analysis on weak Lipschitz assumptions [
23], which are milder than the (global) Lipschitz, as we require only
given a fixed
x, instead of the classic
—where
is the data domain. Lipschitz assumptions are common in different fields, and usually allow a wide family of admissible functions. This is particularly true when the Lipschitz is require for only a subset of the function’s domain (like in our case). Moreover, notice that the classical analysis requires the knowledge of the second derivative of the regression function
m, and therefore the continuity of
. Our Lipschitz assumption is less restrictive, allowing us to obtain a bias upper bound even for functions like
, at points (like
) where
is undefined. The Rosenblatt analysis builds on a Taylor expansion of the estimator and therefore when the bandwidth
is large, Rosenblatt’s bias analysis tends to provide wrong estimates of the bias, as observed in the experimental section. We consider multidimensional input space, and we apply the bound to a realistic regression problem.
2. Preliminaries
Consider the problem of estimating
where
and
, with noise
, i.e.,
. The noise can depend on
, but since our analysis is conducted point-wise for a given
,
will be simply denoted by
. Let
be the regression function and
a probability distribution on
X called design. In our analysis we consider
and
. The Nadaraya-Watson kernel estimate of
is
where
is a kernel function with bandwidth-vector
, the
are drawn from the design
and
from
. Note that both the numerator and the denominator are proportional to Parzen-Rosenblatt density kernel estimates [
24,
25]. We are interested in the point-wise bias of such estimate
. In the prior analysis of [
17], knowledge of
is required and
must be continuous in a neighborhood of
x. In addition, and as discussed in the introduction, the analysis is limited to a one-dimensional design and an infinitesimal bandwidth. We briefly present the classical bias analysis of [
17] before introducing our results for clarity of exposition.
Theorem 1. Classic Bias Estimation [
17].
Let be twice differentiable. Assume a set , where are i.i.d. samples from a distribution with non-zero differentiable density . Assume , with noise . The bias of the Nadaraya-Watson kernel in the limit of infinite samples and for and is Note that Equation (2) must be normalized with
when the kernel function does not integrate to one. The
term denotes the asymptotic behavior w.r.t. the bandwidth. Therefore, for a larger value of the bandwidth, the bias estimation becomes worse, as is illustrated in
Figure 1.
3. Main Result
In this section, we present two bounds on the bias of the Nadaraya-Watson estimator. The first one considers a bounded regression function m (i.e., ), and allows for weak Lipschitz conditions on a subset of the design’s support. Instead, the second bound does not require the regression function to be bounded but only the weak Lipschitz continuity to hold on all of its support. The definition of “weak” Lipschitz continuity will be given below.
To develop our bound on the bias for multidimensional inputs is essential to define some subset of the space. In more detail, we consider an open n-dimensional interval in which is defined as where . We now formalize what is meant by weak (log-)Lipschitz continuity. This will prove useful as we need knowledge of the weak-Lipschitz constants of m and in our analysis.
Definition 1. Weak Lipschitz continuity aton the setunder the-norm.
Letand.
We call f weak Lipschitz continuous atif and only ifwheredenotes the-norm. Definition 2. Weak log-Lipschitz continuity aton the setunder the-norm.
Let.
We call f weak log-Lipschitz continuous aton the setif and only ifNote that the setcan be a subset of the function’s domain. It is important to note that, in contrast to the global Lipschitz continuity, which requires
, the weak Lipschitz continuity is defined at a specific point
and therefore allows the function to be discontinuous elsewhere. The Lipschitz assumptions are not very restrictive, and in practice require a bounded gradient. They have been widely used in various fields. Note that when the Lipschitz constants are not known, they can be estimated from the dataset [
26]. In the following we list the set of assumptions that we use in our theorems.
- A1.
and m are defined on and ;
- A2.
is log weak Lipschitz with constant at on the set and with positive defined (note that this implies );
- A3.
m is weak Lipschitz with constant at on a the set with positive defined .
To work out a bound on the bias valid for a wide class of kernels, we must enumerate some assumption and quantify some integrals with respect to the kernel.
- A4.
The multidimensional kernel can be decomposed in a product of independent uni-dimensional kernels, i.e., with ;
- A5.
the kernels are non-negative and symmetric ;
- A6.
for every and , the integrals , , are finite (i.e., ).
Assumptions A4–A6 are not really restrictive, and includes any kernel with both finite domain and co-domain, or not heavy-tailed (e.g., Gaussian-like). Furthermore, Axiom A4 allows any independent composition of different kernel functions. In
Appendix B we detail the integrals of Axiom A6 for different kernels. Note that when the integrals listed in A6 exist in closed form, the computation of the bound is straightforward, and requires negligible computational effort.
In the following, we propose two different bounds of the bias. The first version considers a bounded regression function (), this allows both the regression function and the design to be weak Lipschitz on a subset of their domain. In the second version instead, we consider the case of an unbounded regression function () or an unknown bound M. In this case both the regression function and the design must be weak Lipschitz on the entire domain .
Theorem 2. Bound on the Bias with Bounded Regression Function.
Assuming A1–A3,a positive defined vector of bandwidths,
the multivariate kernel defined in A4–A6,
the Nadaraya-Watson kernel estimate using n observationswith,
and with noisecentered in zero (),
, and furthermore assuming there is a constantsuch that, the considered Nadaraya-Watson kernel regression bias is bounded bywherewith, andcan be freely chosen to obtain a tighter bound. We suggest.
In the case where M is unknown or infinite, we propose the following bound.
Theorem 3. Bound on the Bias with Unbounded Regression Function.
Assuming A1–A3,a positive defined vector of bandwidths,
the multivariate kernel defined in A4–A6,the Nadaraya-Watson kernel estimate using n observationswith,
and with noisecentered in zero (),, and furthermore assuming that, the considered Nadaraya-Watson kernel regression bias is bounded bywhereare defined as in Theorem 2. We detail the proof of both theorems in
Appendix A. Note that the conditions required by our theorems are mild, and they allow a wide range of random designs, including and not limited to Gaussian, Cauchy, Pareto, Uniform, and Laplace distributions. In general, every continuously differentiable density distribution is also weak log-Lipschitz in some closed subset of its domain. For example, the Gaussian distribution does not have a finite Lipschitz constant on its entire domain, but there is a finite weak Lipschitz constant on any closed interval. Examples of distributions that are weak log-Lipschitz are presented in
Table 1.
4. Analysis
Although the bound applies to different kernel functions, in the following we analyze the most common Gaussian kernel. It worth noting that for
,
Note that we removed the subscripts from the functions , B, and C, as we consider only the Gaussian kernel. To provide a tight bound, we consider many quantities that describe the design’s domain, the Lipschitz constants of the design and of the regression function, the bound of the image of the regression function, and the different bandwidths for each dimension of the space. This complexity results in an effective but poorly readable bound. In this section, we try to simplify the problem and to analyze the behavior of the bound in the limit.
Asymptotic Analysis: Let us consider, for the moment, the case of one-dimension (
) and infinite domains and co-domains (
M unknown and
). In this particular case, the bound becomes
As expected, for or for , . This result is in line with (2) (since ). A completely flat design, e.g., uniformly distributed, does not imply a zero bias. This can be seen either in Rosenblatt’s analysis or by just considering the fact that, notoriously, the Nadaraya-Watson estimator suffers from the boundary bias. When we analyze our bound, we find in fact that . It is also interesting to analyze the asymptotic behavior when these quantities tend to infinity. Similarly to (2), we observe that grows quadratically w.r.t. the kernel’s bandwidth h and it scales linearly w.r.t. the Lipschitz constant of the regression function (which is linked to ). A further analysis brings us to the consideration that Rosenbatt’s analysis is linear w.r.t. (since ). Our bound has a similar implication, as the Log-Lipschitz constant is also related to the derivative of the logarithm of the design function, and .
Boundary Bias: The Nadaraya-Watson kernel estimator is affected by the boundary bias. The boundary bias is an additive bias term affecting the estimation in the region close to the boundaries of the design’s domains. Since in our framework, we can consider a closed domain of the design, we can also see what is happening close to the border. Let us consider still a one-dimensional regression, but this time
,
and
. In this case, we obtain
Keeping in account that and using L’Hôpital’s rule, we can observe that the bound is now exponential w.r.t. h and , i.e., , which implies that it is more “sensible” to higher bandwidths or less smooth design. Interestingly, instead, the bounds maintains its linear relation w.r.t. .
Dimensionality: Let us now study the multidimensional case, supposing that each dimension has same bandwidth and same values for the boundaries. In this case,
Therefore, the bound scales exponentially w.r.t. the dimension. We observe an exponential behavior when is close to the boundary of the design’s domain. In these regions, in fact the ratio is particularly high. Of course, when the aforementioned ratio tends to one, the linearity w.r.t. d is predominant.
We can conclude the analysis by noticing that our bound has similar limiting behavior with the Rosenblatt’s analysis, but it provides a hard bound on the bias.
5. Numerical Simulation
In this section, we provide three numerical analyses of our bounds on the bias (the code of our numerical simulations can be found at
http://github.com/SamuelePolimi/UpperboundNWBias). The first analysis of our method is conducted on uni-dimensional input spaces for display purposes and aims to show the properties of our bounds in different scenarios. The second analysis aims instead at testing the behavior of our method on a multidimensional input space. The third analysis emulates a realistic scenario where our bound can be applied.
Uni-dimensional Analysis: We select a set of regression functions with different Lipschitz constants and different bounds,
; and ,
which for has and ,
which has , is unbuounded, and has a particularly high second derivative in , with ,
which has and is unbounded.
A zero-mean Gaussian noise with standard deviation
has been added to the output
y. Our theory applies to a wide family of kernels. In this analysis we consider a Gaussian kernel, with
, a box kernel, with
, and a triangle kernel, with
with
We further analyze the aforementioned kernels in
Appendix B. In order to provide as many different scenarios as possible we also used the distributions from
Table 1, using therefore both infinite domain distributions, such as Cauchy and Laplace, and finite domain such as Uniform. In order to numerically estimate the bias, we approximate
with an ensemble of estimates
where each estimate
is built on a different dataset (drawn from the same distribution
). In order to “simulate”
we used
samples, and to obtain high confidence of the bias’ estimate, we used
models.
In this section we provide some simulations of our bound presented in Theorems 2 and 3, and for the Rosenblatt’s case we use
Since the Rosenblatt’s bias estimate is not an upper bound, it can happen that the true bias is higher (as well as lower) than this estimate, as it is possible to see in
Figure 1. We presented different scenarios, both with bounded and unbounded functions, infinite and finite design domains, and a larger or smaller bandwidth choice. It is possible to observe that, thanks to the knowledge of
the Rosenblatt’s estimation of the bias tends to be more accurate than our bound. However, it can happen that it largely overestimates the bias, like in the case of
in
or to underestimate it, most often in boundary regions. In contrast, our bound always overestimates the true bias, and despite its lack of knowledge of
, it is most often tight. Moreover, when the bandwidth is small, both our method and Rosenblatt’s deliver an accurate estimation of the bias. In general, Rosenblatt tends to deliver a better estimate of the bias, but it does not behave as a bound, and in some situations, it also can deliver larger mispredictions. In detail, the plot (a) in
Figure 1 shows that with a tight bandwidth, both our method and Rosenblatt’s method achieve good approximations of the bias, but only our method correctly upper bounds the bias. When increasing the bandwidth, we obtain both a larger bias and subsequent larger estimates of the bias. Our method consistently upper bounds the bias, while in many cases, Rosenblatt’s method underestimates it, especially in proximity of boundaries (subplots b, d, e). An interesting case can be observed in subplot (c), where we test the function
, which has a high second-order derivative in
: in this case, Rosenblatt’s method largely overestimates the bias.
The figure shows that our bound can deal with different functions and random designs, being reasonably tight, if compared to Rosenblatt’s estimation, which requires the knowledge of the regression function and the design, and respective derivatives.
Multidimensional Analysis: We want to study if our bounds work in a multidimensional case and how much it overestimates the true bias (therefore, how tight it is). For this purpose, we took a linear function
where
is a column-vector of
d ones. This function, for any dimension
d, has a Lipschitz constant
and is unbounded
. We set a Gaussian design with zero mean and unit diagonal covariance. Since in higher dimensions, the estimation’s variance grows exponentially [
22], we used a large number of samples (
), and we averaged over
independent estimations. In
Figure 2 we show how the “true” bias (estimated numerically averaging over a thousand Nadaraya-Watson regressions) and our bound evolve with a growing number of dimensions
d. Far from the low-density region
we notice that the bias tends to have a linear behavior, while close to the boundary the bias tends to be exponential. We can observe that our bound correctly bounds the bias in all the cases.
Realistic Scenario: Let us consider the regression problem of the dynamics of an under-actuated pendulum of length
l and mass
m. In particular, the state of the pendulum can be described by its angle
and its angular rotation
. Furthermore, a force
u can be applied to the pendulum. The full system is described by,
where
g is the gravitational acceleration. In practice, when this model is discretized in time, the next state is estimated via numerical integration. Fort this reason, the Lipschitz constant
is unknown. Notice that also
and
required by the Rosenblatt’s analysis are unknown. We estimated the Lipschitz constant
by selecting the highest ratio
in the dataset. In our analysis, we want to predict all the states with fixed
,
, but variable
. In order to generate the dataset, we use the simulator provided by gym [
27]. To train our models, we generate tuples of
by sampling independently each variable from a uniform distribution i.e.,
,
and
(hence,
). We fit 100 different models with 50,000 samples. We choose a Gaussian kernel with bandwidth
.
Figure 3 depicts our bound and the estimated bias. We notice the bias is low and increases close to the boundaries. Our upper bound is tight, but, as expected, it becomes overly pessimistic in the boundary region.