1. Introduction
Let be independent identically distributed random variables (i.i.d. r.v.’s) having a probability density function f. In the typical non-parametric set-up, nothing is assumed about f except that it possesses a certain degree of smoothness, e.g., that it has r continuous derivatives.
Estimating
f via kernel smoothing is a sixty year old problem; M. Rosenblatt who was one of its originators discusses the subject’s history and evolution in the monograph [
1]. For some point
x, the kernel smoothed estimator of
is defined by
where the kernel
is a bounded function satisfying
and
, and the positive bandwidth parameter
h is a decreasing function of the sample size
n.
If
has finite moments up to
q-th order, and moments of order up to
equal to zero, then
q is called the ‘order’ of the kernel
. Since the unknown function
f is assumed to have
r continuous derivatives, it typically follows that
and
where
, and
are bounded functions depending on
as well as
f and its derivatives, cf. [
1] p. 8.
The idea of choosing a kernel of order
q bigger (or equal) than
r in order to ensure the
to be
dates back to the early 1960s in work of [
2,
3]; recent references on higher-order kernels include the following: [
4,
5,
6,
7,
8,
9,
10]. Note that since
r is typically unknown and can be arbitrarily large, it is possible to use kernels of infinite order that achieve the minimal bias condition
for any
r; Ref. [
11] gives many properties of kernels of infinite order. In this paper we will employ a particularly useful class of infinite order kernels namely the
flat-top family; see [
12] for a general definition.
It is a well-known fact that optimal bandwidth selection is perhaps the most crucial issue in such non-parametric smoothing problems; see [
13], as well as the book [
14]. The goal typically is minimization of the large-sample mean squared error (MSE) of
. However, to perform this minimization, the practitioner needs to know the degree of smoothness
r, as well as the constants
and
. Using an infinite order kernel and focusing just on optimizing the order of magnitude of the large-sample MSE, it is apparent that the optimal bandwidth
h must be asymptotically of order
; this yields a large-sample MSE of order
.
A generalization of the above scenario is possible using a degree of smoothness
r that has another sense, and that is not necessarily an integer. Let
denote the integer part of
r, and define
; then, one may assume that
f has
continuous derivatives, and that the
th derivative satisfies a Lipschitz condition of order
. Interestingly, even in this case where
f is assumed to belong to the Hölder class of degree
r (the derivative of the density function of the order
r satisfies the Lipschitz condition) the MSE–optimal bandwidth
h is still of order
and again yields a large-sample MSE of the order
(see, e.g., [
15,
16,
17,
18] among others).
The problem of course is that, as previously mentioned, the underlying degree of smoothness
r is typically unknown. In
Section 4 of the paper at hand, we develop an estimator
of
r and prove its strong consistency; this is perhaps the first such result in the literature. In order to construct our estimator
, we operate under a class of functions that is slightly more general than the aforementioned Hölder class; this class of functions is formally defined in
Section 2 via Equation (
3) or (
4).
Under such a condition on the tails of the characteristic function we are able to show in
Section 3 that the optimized MSE of
is again of order
for possibly non-integer
this is true, for example, when the characteristic function
has tails of order
see Example 2.
Furthermore, in
Section 5 we develop an
adaptive estimator
that achieves the optimal MSE rate of
within a logarithmic factor despite the fact that
r is unknown, see Examples after Theorem 3. Similar effect arises in the adaptive estimation problem of the densities from the Hölder class; see [
18,
19,
20]. It should pointed that problems of asymptotic adaptive optimal density estimations from another classes have also been considered in the literature; see, e.g., [
14,
21,
22,
23].
The construction of
is rather technical; it uses the new estimator
, and it is inspired from the construction of sequential estimates although we are in a fixed
n, non-sequential setting. As the major theoretical result of our paper, we are able to prove a non-asymptotic upper bound for the MSE of
that satisfies the above mentioned optimal rate.
Section 6 contains some simulation results showing the performance of the new estimator
in practice. All proofs are deferred to
Section 7, while
Section 8 contains our conclusions and suggestions for future work.
2. Problem Set-Up and Basic Assumptions
Let
be i.i.d. having a probability density function
f. Denote
the characteristic function of
f and the sample characteristic function
For some finite
, define two families
and
of bounded, i.e.,
and continuous functions
f satisfying one of the following conditions, respectively:
In other words,
is the family of functions (introduced by M. Rosenblatt) satisfying (
2) and (
3), while
is the family of functions (introduced in this paper) satisfying (
2) and (
4). It should be noted that the new class
is a little bit more wide that the classical class
.
In addition, define the family (respectively, ) as the family of functions f that belong to (respectively, ) but with f being such that its characteristic function has monotonously decreasing tails.
Consider the class
of non-parametric kernel smoothed estimators
of
as given in Equation (
1). Note that we can alternatively express
in terms of the Fourier transform of kernel
, i.e.,
where
In this paper, we will employ the family of flat-top infinite order kernels, i.e., we will let the function
be of the form
where
c is a fixed number in
chosen by the practitioner, and
is some properly chosen continuous, real-valued function satisfying
and
for any
with
and
; see [
12,
24,
25,
26] for more details on the above flat-top family of kernels.
Define
the partial derivative of the function
with respect to the bandwidth
We will also assume that for some
Denote for every
the functions
From (
3) and (
5) it follows that
as
for
and
as well as for
and
In other cases
Define the following classes and
The main aim of the paper is the estimation of the parameter r of these classes and adaptive estimation of densities from the class with the unknown parameter
3. Asymptotic Mean Square Optimal Estimation
The mean square error (MSE)
of the estimators
has the following form:
where
is the principal term of the MSE,
Thus, in particular,
To minimize the principal term
by
h we set its first derivative with respect to
to zero which gives the following equality for the optimal (in the mean square sense) value
From the definition of the class of kernels for cases
we have
and for
h small enough, according to (
6)
Then, by the definition of the class
as
h small enough, denoting
we have
Thus, for
and from (8) it follows
Define the number
from the equality
It is obvious, that and
Then, from (
7) and (
9), for every
and
as
we have
where
In such a way we have proved the following theorem, which gives the rates of convergence of the random quantities
and
We can loosely call
and
‘estimators’ although it is clear that these functions can not be considered as estimators in the usual sense in view of the dependence of the bandwidths
and
on unknown parameters
r and
Nevertheless, this theorem can be used for the construction of bona fide adaptive estimators with the optimal and suboptimal converges rates; see Examples 1 and 2, as well as
Section 5.3 in what follows.
Theorem 1. Let Then, for the asymptotically optimal (with respect to bandwidth h) in the MSE sense ‘estimator’ of the function and for the ‘estimator’ of the following limit relations, as hold Remark 1. The definition (9) of is essentially simpler than the definition (8) of the optimal bandwidth From Theorem 1 it follows that the (slightly) suboptimal ‘estimator’ can be successfully used instead. It should be noted that the parameter is chosen by the practitioner here and that but in which case we want to choose close to 0.
We shall write in the sequel
as
instead of the limit relations
Example 1. Consider an estimation problem of the function satisfying the following additional conditionusing the kernel estimator By making use of (9) and (10) we find the rate of convergence of the MSE and To this end we calculate It is easy to verify that Thus, from (9), as where is a solution of the equation Therefore, as we have Consider the piecewise linear flat-top kernel introduced by [25] (see [26] as well):where is the positive part function. Then, from (8) we obtainand, for n large enough Thus, similarly to as for we findand Example 2. Consider an estimation problem of the function satisfying the following additional condition:using the kernel estimator Using (9) and (10) we will find the rate of convergence of the MSE and To this end, we calculate It is easy to verify that Thus, from (9), as Similarly to Example 1 as for we findand 4. Estimation of the Degree of Smoothness r
Let and be two given sequences of positive numbers chosen by the practitioner such that and as The sequence represents the ‘grid’-size in our search of the correct exponent while represents an upper bound that limits this search.
Define the following sets of non-random sequences
Remark 2. Formally, the definition of sets and, as follows of estimators and as well of sets defined below depend on the unknown function At the same time, the set (and, as follows, the estimator and the set can be defined independently of
Indeed, denote and
– let
Thus for appropriate chosen and because (consider for simplification the case According to the definition of the class it is impossible to find elements of the set independently of the function to be estimated without usage of an a priori information about Consider one simple example.
– Let
Suppose, e.g., in addition that Then for appropriate chosen and because Another examples are in Example 3 (see also Remark 3 and Example 4).
For an arbitrary given
chosen by the practitioner, define the estimators
and
of the parameter
r in (
3) and (
4) as follows
Example 3. For the functions from Examples 1 and 2, we can use the definitions (11) and (12) with the following choices:arbitrary and as Indeed, for and every (Example 1),and for (Example 2),and, as follows, the classes and are not empty. Lemma 1. Let Then, for every and there exist positive numbers such thatand for every Define the sets
and
of non-random sequences
Remark 3. It can be directly verified that under the conditions of Remark 2 the sequences if and as well as if and Moreover, under the conditions of Example 3.1, if we put Example 4. Consider the functions from Examples 1, 2 and suppose, that the smooth parameter for some known number Then the sequences if we put Theorem 2. The estimators and defined in (11) and (12), respectively, with have the following properties (a) if and then (b) if and then (a) if and for some the sequences then (b) if and for some the sequences then 6. Simulation Results
In this section we provide results of a simulation study regarding the estimators introduced in
Section 3.
Two flat-top kernels have been used in the simulation. The first one has the piecewise linear kernel characteristic function introduced in [
26], i.e.,
The piecewise linear characteristic function and corresponding kernel are shown in
Figure 1.
The second case refers to the infinitely differentiable flat-top kernel characteristic function defined in [
28], i.e.,
The characteristic function and kernel of the second case are shown in
Figure 2.
We examine kernel density estimators of triangular, exponential, Laplace, and gamma (with various shape parameter) distributions.
Figure 3,
Figure 4 and
Figure 5 illustrate the estimator MSE as a function of the sample size.
Using notation for Heaviside step function, the triangular density function is defined as having characteristic function Laplace density has characteristic function , gamma density has characteristic function .
In all cases we choose scale parameter to have variation equals to 1, and consider estimation of density function at point .
All the above-mentioned characteristic functions
satisfy condition (
4) for
(triangular and Laplace), and
(gamma,
); therefore, all distributions belong to the family
with corresponding value of
r. In addition, all
meet the requirements of Example 2. Thus, the bandwidth can be taken in the form
and the expected convergence rate of the kernel estimator MSE is
The main goal of the simulation study is investigation of the MSE behavior for the kernel estimator with the growth of sample size. We generate sequences of 150 samples for sample size from 25 to 2000 with step 25, and for some distributions for sample size from 2000 to 20,000 with step 100 or 200. Then, for each sample size we calculate the estimator MSE multiplied by and expect visual stabilization of the sequence of resulting values with growth of
Typical examples of the simulation results are presented at
Figure 3 (for
),
Figure 4 (for
and
), and
Figure 5 (for
and
). The expected stabilization of the scaled MSE is observed in all cases. Moreover, increasing
r causes enlargement of sample size that is needed to achieve limiting asymptotic behavior. For
and
we can see stabilization starting from
, for
it starts from
, while for
the asymptotic behavior is observed to start from sample size
15,000.
8. Conclusions
Non-parametric kernel estimation crucially depends on the bandwidth choice which, in turn, depends on the smoothness of the underlying function. Focusing on estimating a probability density function, we define a smoothness class and propose a data-based estimator of the underlying degree of smoothness. The convergence rates in the almost sure sense of the proposed estimators are obtained. Adaptive estimators of densities from the given class on the basis of the constructed smoothness parameter estimators are also presented, and their consistency is established. Simulation results illustrate the realization of the asymptotic behavior when the sample size grows large.
Recently, there has been an increasing interest in nonparametric estimation with dependent data both in terms of theory as well as applications; see, e.g., [
15,
30,
31,
32,
33]. With respect to probability density estimation, many asymptotic results remain true when moving from i.i.d. data to data that are weakly dependent. For example, the estimator variance, bias and MSE have the same asymptotic expansions as in the i.i.d. case subject to some limitations on the allowed bandwidth rate; fortunately, the optimal bandwidth rate of
is in the allowed range—see [
34,
35].
Consequently, it is conjectured that our proposed estimator of smoothness—as well as resulting data-based bandwidth choice and probability density estimator—will retain their validity even when the data are weakly dependent. Future work may confirm this conjecture especially since working with dependent data can be quite intricate. For example, [
36] extended the results of [
34] from the realm of linear time series to strong-mixing process. In so doing, Remark 5 of [
36] pointed to a nontrivial error in the work of [
34] which is directly relevant to optimal bandwidth choice.