Shannon Entropy Estimation for Linear Processes

Timothy Fortune; Hailin Sang

doi:10.3390/jrfm13090205

Abstract

In this paper, we estimate the Shannon entropy

S (f) = - E [log (f (x))]

of a one-sided linear process with probability density function

f (x)

. We employ the integral estimator

S_{n} (f)

, which utilizes the standard kernel density estimator

f_{n} (x)

of

f (x)

. We show that

S_{n} (f)

converges to

S (f)

almost surely and in

Ł^{2}

under reasonable conditions.

Keywords:

linear process; kernel entropy estimation; Shannon entropy

1. Introduction

Let

f (x)

be the common probability density function of a sequence

{X_{n}}_{n = 1}^{\infty}

of identically distributed observations. The associated Shannon entropy

\begin{matrix} S (f) = E [- log f (X)] = - \int f (x) log f (x) d x \end{matrix}

(1)

of such an observation was first introduced by Shannon (1948). In his 1948 paper, Shannon utilized this tool in his mathematical investigation of the theory of communication. Today, entropy is widely applied in the fields of information theory, statistical classification, pattern recognition and so on, since it is a measure of the amount of uncertainty present in a probability distribution.

In the literature, several estimators for the Shannon entropy have been introduced. See Beirlant et al. (1997) for an overview. Many of these estimators have been studied in cases where the data are independent. In 1976, Ahmad and Lin (1976) obtained results using the resubstitution estimator

H_{n} = - \frac{1}{n} \sum_{i = 1}^{n} ln f_{n} (X_{i})

for independent data

{X_{i}}_{i = 1}^{n}

. In particular, he showed consistency in the first and second mean under certain regularity conditions. Here,

f_{n} (x)

is the kernel density estimator. Dmitriev and Tarasenko (1973) reported results in 1973 for estimating functionals of the type

\int H (f (x), f^{'} (x), \dots, f^{k} (x)) d x

, where the common density

f (x)

of the independent

X_{i}

is assumed to have at least k derivatives. Plugging in kernel density estimators (see their paper and references therein) for the arguments of H and integrating only over the symmetric interval

[- k_{n}, k_{n}]

, which is determined by a sequence

{k_{n}}_{n = 1}^{\infty}

of a certain order, they provided a result for the estimation of Shannon entropy using the estimator that Beirlant et al. (1997) refer to as the integral estimator. Their results give conditions for almost sure convergence.

Interestingly enough, Dmitriev and Tarasenko (1973) also provided (because their work is a more general investigation of functionals) a result for the estimation of the quadratic Rényi entropy

Q (f) = \int f^{2} (x) d x

. Conditions are provided specifically for the almost sure convergence of their estimator to the true value

Q (f)

. The estimation of Rényi entropy for the dependent case is challenging. A dependent case is treated by Sang et al. (2018). They studied the estimation of the quadratic Entropy for the one-sided linear process. Utilizing the Fourier transform along with the projection method, they demonstrate that the kernel entropy estimator satisfies a central limit theorem for short memory linear processes.

To study the Shannon entropy for dependent data is also a challenging problem, and to the best of our knowledge, general results for the Shannon entropy estimation of regular time series data are still unknown. In this paper, we study the Shannon entropy

S (f)

for the one-sided linear process

\begin{matrix} X_{n} = \sum_{i = 0}^{\infty} a_{i} ε_{n - i}, \end{matrix}

(2)

where the innovations

ε_{i}

are independent and identically distributed real valued random variables on some probability space

(Ω, F, P)

with mean zero and finite variance

σ_{ε}^{2}

and where the collection

{a_{i} : i \geq 0}

of real coefficients satisfies

\sum_{i = 0}^{\infty} a_{i}^{2} < \infty

. Additionally, we will require that the common density

f_{ε} (x)

of the innovations be bounded. The estimator we utilize employs the kernel method, which was first introduced by Rosenblatt (1956); Parzen (1962). The kernel estimator will be denoted by

\begin{matrix} f_{n} (x) = \frac{1}{n h_{n}} \sum_{i = 1}^{n} K (\frac{x - X_{i}}{h_{n}}), \end{matrix}

(3)

where the sequence

{h_{n}}_{n = 1}^{\infty}

provides the bandwidths, and

K : R \to R

is the kernel function which satisfies

\int_{R} K (x) d x = 1

. Typically, the kernel function is a probability density function.

This method has proven to be successful in estimating probability density functions and their derivatives, regression functions, etc., in both the independent and dependent setting. For the independent setting, see the books (Devroye and Györfi (1985); Silverman (1986); Nadaraya (1989); Wand and Jones (1995); Schimek (2000); Scott (2015)) and the references therein. For the dependent setting, we refer the reader to (Tran (1992); Honda (2000); Wu and Mielniczuk (2002); Wu et al. (2010)). Bandwidth selection is an important issue in kernel density estimation, and there is a lot of research in this direction. See, e.g., Duin (1976); Rudemo (1982); Slaoui (2014, 2018).

A few remarks about notation and terms used in the paper follow. Let

{a_{n}}_{n = 1}^{\infty}

and

{b_{n}}_{n = 1}^{\infty}

be real-valued sequences. By

a_{n} = o (b_{n})

we understand that

a_{n} / b_{n} \to 0

and

a_{n} = O (b_{n})

means that

lim sup | a_{n} / b_{n} | < C

for some positive number C. Essentially, this is the standard Landau little oh and big oh notation. When we write,

a_{n} ≪ b_{n}

, we mean

a_{n} = o (b_{n})

, and as one might guess,

b_{n} ≫ a_{n}

means

a_{n} ≪ b_{n}

. We also employ the notation

a_{n} ≍ b_{n}

to indicate that

0 < {lim inf}_{n \to \infty} \frac{a_{n}}{b_{n}} \leq {lim sup}_{n \to \infty} \frac{a_{n}}{b_{n}} < \infty

. A function

l : [0, \infty) \to R

is referred to as slowly varying (at ∞) if it is positive and measurable on

[A, \infty)

for some

A \in R^{+}

such that

lim_{x \to \infty} l (λ x) / l (x) = 1

holds for each

λ \in R^{+}

. The set of all functions

g : R \to R

which are Hölder continuous of some order r will be denoted as

C^{r} (R)

. That is, for each

g \in C^{r} (R)

there exists

C_{g} \in R^{+}

, such that for all

x, x^{'} \in R

, we have

| g (x) - g (x^{'}) | \leq C_{g} {| x - x^{'} |}^{r}

, and when

r = 1

, we recognize this as the well-known Lipschitz condition. The notation

Ł^{p} (E)

with

0 < p < \infty

represents the set of all real-valued functions f defined on some measure space

(E, A, μ)

having the property that

\int_{E} {| f (x) |}^{p} d μ < \infty

. In the case that

E = R

and unless otherwise specified, the measure

μ

is tacitly understood to be Lebesgue measure and

A

is assumed to contain the Borel sets.

Ł^{\infty} (E)

refers to the set of real-valued functions defined on E which are bounded almost everywhere. Whenever the domain space of the function is understood, we may simply write

Ł^{p}

.

The following are bandwidth, kernel, and density conditions that we shall refer to throughout this paper:

B.1: $h_{n} ≍ {(n^{- 1} log n)}^{\frac{1}{5}}$ ;
K.1: $K \in C^{ι} (R)$ for some $ι \in (0, 1]$ is bounded with bounded support;
K.2: $\int u K (u) d u = 0$ ;
D.1: $f_{ε}, f_{ε}^{'}, f_{ε}^{''} \in Ł^{\infty} (R)$ ;
D.2: $f_{ε}, f_{ε}^{'}, f_{ε}^{''} \in Ł^{2} (R)$ ;
D.3: $f^{''} \in Ł^{\infty} (R)$ .

Notice that the bandwidth, kernel, and density conditions are prefixed using B, K, and D, respectively.

In this first section, we have provided an introduction to the problem, a survey of past research in this area, and the notation to be used throughout. The main results are reported in Section Two. In Section Three, we present the proofs of the main results. Finally, the Appendix A introduces the reader to foundational results, which will be required in the proof of our main results.

2. Main Results

If

{ε_{i} : i \in Z}

is a sequence of independent and identically distributed random variables over a common probability space

(Ω, F, P)

in

Ł^{q} (Ω)

for some

q > 0

,

E ε_{i} = 0

when

q \geq 1

, and

{a_{i}}_{i = 0}^{\infty}

is a sequence of real coefficients such that

\sum_{i = 0}^{\infty} {| a_{i} |}^{2 \land q} < \infty

, then the linear process

X_{n}

given in (2) exists and is well-defined. For the case

q \geq 2

where the innovations have finite variance, we say that the process has short memory (short-range dependence) if

\sum_{i = 0}^{\infty} | a_{i} | < \infty

and

\sum_{i = 0}^{\infty} a_{i} \neq 0

and long memory (long-range dependence) otherwise. Throughout, we assume that each

ε_{i} \in Ł^{q}

with

q \geq 2

.

Let

f (x)

be the probability density function of the linear process

X_{n} = \sum_{i = 0}^{\infty} a_{i} ε_{n - i}

,

n \in N

defined in (2).In this paper, we estimate the Shannon Entropy

- \int f (x) log f (x) d x

of the linear process. To do this, we employ the integral estimator

\begin{matrix} S_{n} (f) = - \int_{A_{n}} f_{n} (x) log f_{n} (x) d x, \end{matrix}

(4)

where

f_{n} (x)

is the standard kernel density estimator defined in (3). The (random) sets

A_{n}

are given by

\begin{matrix} A_{n} = {x \in R : 0 < γ_{n} \leq f_{n} (x)}, \end{matrix}

(5)

where

{γ_{n}}_{n = 1}^{\infty}

is an appropriately defined sequence in

R^{+}

that converges to zero.

Our estimator utilizes the kernel method of density estimation, and we will accordingly require adherence of the kernel to certain conditions. In addition, we impose some conditions on the bandwidths and on some of the densities of the problem. These conditions were listed in the previous section. Based on these conditions, let us consider the properties of the estimator (4). We proceed in a manner similar to the analysis done by Bouzebda and Elhattab (2011) for the independent case.

Theorem 1.

Let

{X_{n} : n \in N}

be the linear process given in (2), and assume that it has short memory. Furthermore, assume that

S (f)

is finite. If the bandwidth, kernel, and density conditions listed earlier are satisfied, then

\begin{matrix} \underset{n \to \infty}{lim sup} {(\frac{n γ_{n}^{5}}{log n})}^{\frac{2}{5}} | S_{n} (f) - \int_{A_{n}} (- log f (x)) f (x) d x | \end{matrix}

is bounded almost surely whenever the condition

γ_{n} ≫ h_{n}

is imposed on the sequence

{γ_{n}}_{n = 1}^{\infty}

.

Corollary 1.

If the conditions of Theorem 1 hold, then we have

\begin{matrix} lim_{n \to \infty} | S_{n} (f) - S (f) | = 0 \end{matrix}

almost surely.

Theorem 2.

Let

{X_{n} : n \in N}

be the linear process given in (2), and assume that it has short memory. Furthermore, assume that

S (f)

is finite. If the bandwidth, kernel, and density conditions listed earlier are satisfied, then

\begin{matrix} \underset{n \to \infty}{lim sup} {(\frac{n γ_{n}^{5}}{log n})}^{\frac{2}{5}} {∥S_{n} (f) - \int_{A_{n}} (- log f (x)) f (x) d x∥}_{2} \end{matrix}

(6)

is bounded whenever the condition

γ_{n} ≫ h_{n}

is imposed on the sequence

{γ_{n}}_{n = 1}^{\infty}

.

Corollary 2.

If the conditions of Theorem 2 hold, then the mean squared error (MSE) satisfies

\begin{matrix} lim_{n \to \infty} MSE (S_{n} (f)) = 0 . \end{matrix}

Remark 1.

In this paper, we work on the entropy estimation for short memory linear processes by applying the integral method. It is interesting to know whether the similar results hold for long memory linear processes. It is also interesting to know whether the resubstitution method works for dependent data such as linear processes. However, the research in these directions are beyond the scope of this paper. We leave research in these directions for future work.

Remark 2.

In a wide range of disciplines, including finance, geology, and engineering, many time series may be modeled using a linear process. In such instances, our result provides a method for estimating the associated Shannon Entropy. One example is the discriminatory data on the arrival phases of earthquakes and explosions, which were captured at a seismic recording station. Another example is the data about returns on the New York Stock Exchange. See these and many other time series data in the book by Shumway and Stoffer (2011) and other books on time series.

3. Proofs

Lemma 1.

If the conditions of Theorem 1 (or Theorem 2) hold, then

\begin{matrix} sup_{x \in R} | f_{n} (x) - f (x) | = O ({(\frac{log n}{n})}^{\frac{2}{5}}) \end{matrix}

(7)

almost surely.

Proof.

This lemma follows from Theorem 2 of Wu et al. (2010) (see their discussion immediately after the statement of Theorem 2 and in the penultimate paragraph of section 4.1). See also the discussion in the Appendix A on fundamental results. □

Lemma 2.

If the conditions of Theorem 1 (or Theorem 2) hold, then

\begin{matrix} γ_{n}^{5} ≫ \frac{log n}{n} . \end{matrix}

(8)

Proof.

Because

h_{n} ≍ {(n^{- 1} log n)}^{\frac{1}{5}}

, there exists

C \in R^{+}

such that

\frac{h_{n}^{5}}{n^{- 1} log n} > C

for sufficiently large n. Therefore,

\begin{matrix} lim_{n \to \infty} \frac{γ_{n}^{5}}{n^{- 1} log n} & = lim_{n \to \infty} \frac{γ_{n}^{5}}{h_{n}^{5}} \cdot \frac{h_{n}^{5}}{n^{- 1} log n} \\ \geq C lim_{n \to \infty} {(\frac{γ_{n}}{h_{n}})}^{5} \to \infty \end{matrix}

as

n \to \infty

, from which (8) follows. □

Note. Our use of Lemma 2 in the proofs of Theorems 1 and 2 will be tacit.

Lemma 3.

If ν is a finite signed measure that is absolutely continuous with respect to a measure μ, then corresponding to every positive number ε there is a positive number δ such that

| ν | (E) < ε

whenever E is a measurable set for which

μ (E) < δ

.

Proof.

This is a basic result from measure theory. See, for example, Theorem B of Halmos (1974) in section 30. □

Proof of Theorem 1.

We begin with the decomposition

\begin{matrix} \begin{matrix} S_{n} (f) - & \int_{A_{n}} (- log f (x)) f (x) d x \\ = - \int_{A_{n}} f_{n} (x) log f_{n} (x) d x + \int_{A_{n}} f (x) log f (x) d x \\ = - \int_{A_{n}} f_{n} (x) log f_{n} (x) d x + \int_{A_{n}} f (x) log f_{n} (x) d x \\ - \int_{A_{n}} f (x) log f_{n} (x) d x + \int_{A_{n}} f (x) log f (x) d x \\ = I_{n, 1} + I_{n, 2}, \end{matrix} \end{matrix}

(9)

where

\begin{matrix} I_{n, 1} & : = - \int_{A_{n}} [f_{n} (x) - f (x)] log f_{n} (x) d x, \end{matrix}

and

\begin{matrix} I_{n, 2} & : = - \int_{A_{n}} f (x) [log f_{n} (x) - log f (x)] d x . \end{matrix}

First, we consider

I_{n, 1}

. Using the inequality

\begin{matrix} | log z | \leq z + \frac{1}{z} \end{matrix}

for

z \in R^{+}

, we notice that for all

x \in A_{n}

, we have

\begin{matrix} | log f_{n} (x) | & \leq f_{n} (x) + \frac{1}{f_{n} (x)} \\ = (1 + \frac{1}{{(f_{n} (x))}^{2}}) f_{n} (x) \\ \leq (1 + \frac{1}{γ_{n}^{2}}) f_{n} (x) . \end{matrix}

It follows that

\begin{matrix} \begin{matrix} | I_{n, 1} | & \leq sup_{x \in R} | f_{n} (x) - f (x) | \int_{A_{n}} | log f_{n} (x) | d x \\ \leq (1 + \frac{1}{γ_{n}^{2}}) sup_{x \in R} | f_{n} (x) - f (x) |, \end{matrix} \end{matrix}

(10)

since

f_{n} (x)

integrates to unity over the real line.

Next, we consider

I_{n, 2}

. Since the set over which we are integrating may be changed to

A_{n} \cap {x : f (x) > 0}

without affecting the value of

I_{n, 2}

, we may assume that f is positive on

A_{n}

. Using the inequality

\begin{matrix} log z \leq | z - 1 | + | z^{- 1} - 1 | \end{matrix}

for

z \in R^{+}

, we notice that for all

x \in A_{n}

, we have

\begin{matrix} \begin{matrix} | log f_{n} (x) - log f (x) | & = | ln (\frac{f_{n} (x)}{f (x)}) | \\ \leq | \frac{f_{n} (x)}{f (x)} - 1 | + | \frac{f (x)}{f_{n} (x)} - 1 | \\ = | \frac{f_{n} (x) - f (x)}{f (x)} | + | \frac{f (x) - f_{n} (x)}{f_{n} (x)} | \\ = (1 + \frac{f_{n} (x)}{f (x)}) | \frac{f_{n} (x) - f (x)}{f_{n} (x)} | \\ \leq \frac{C}{γ_{n}} | f_{n} (x) - f (x) |, \end{matrix} \end{matrix}

(11)

if we can justify the existence of

C \in R^{+}

. To that end, define

\begin{matrix} ε_{n} = sup_{x \in A_{n}} | f_{n} (x) - f (x) |, \end{matrix}

and note that for all

x \in A_{n}

, we have

\begin{matrix} | 1 - \frac{f (x)}{f_{n} (x)} | \leq \frac{ε_{n}}{f_{n} (x)} \leq \frac{ε_{n}}{γ_{n}} . \end{matrix}

Taking the supremem over

A_{n}

yields

\begin{matrix} sup_{x \in A_{n}} | 1 - \frac{f (x)}{f_{n} (x)} | & \leq \frac{ε_{n}}{γ_{n}} = γ_{n}^{- 1} ε_{n} \\ \leq C γ_{n}^{- 1} / {(\frac{n}{log n})}^{\frac{2}{5}}, \end{matrix}

by Lemma 1. Note that

\begin{matrix} γ_{n}^{- 1} = o ({(\frac{n}{log n})}^{\frac{2}{5}}), \end{matrix}

since

\begin{matrix} lim_{n \to \infty} \frac{γ_{n}^{- 1}}{{(\frac{n}{log n})}^{\frac{2}{5}}} & = lim_{n \to \infty} \frac{{(\frac{log n}{n})}^{\frac{2}{5}}}{γ_{n}} \\ = lim_{n \to \infty} {(\frac{\frac{log n}{n}}{γ_{n}^{5}} \frac{\frac{log n}{n}}{1})}^{\frac{1}{5}} \\ = 0 . \end{matrix}

This guarantees the existence we sought to establish. We continue with

\begin{matrix} \begin{matrix} | I_{n, 2} | & \leq \frac{C}{γ_{n}} sup_{x \in R} | f_{n} (x) - f (x) | \int_{A_{n}} f (x) d x \\ \leq \frac{C}{γ_{n}} sup_{x \in R} | f_{n} (x) - f (x) |, \end{matrix} \end{matrix}

(12)

since

f_{n} (x)

integrates to unity over the real line.

In view of (9), (10) and (12), we have shown that

\begin{matrix} | S_{n} (f) - & \int_{A_{n}} (- log f (x)) f (x) d x | \end{matrix}

(13)

\begin{matrix} \leq (\frac{1}{γ_{n}^{2}} + \frac{C}{γ_{n}} + 1) sup_{x \in R} | f_{n} (x) - f (x) | . \end{matrix}

(14)

Therefore,

\begin{matrix} \underset{n \to \infty}{lim sup} & {(\frac{n γ_{n}^{5}}{log n})}^{\frac{2}{5}} | S_{n} (f) - \int_{A_{n}} (- log f (x)) f (x) d x | \\ \leq \underset{n \to \infty}{lim sup} {(\frac{n}{log n})}^{\frac{2}{5}} γ_{n}^{2} (\frac{1}{γ_{n}^{2}} + \frac{C}{γ_{n}} + 1) sup_{x \in R} | f_{n} (x) - f (x) | \\ = \underset{n \to \infty}{lim sup} (γ_{n}^{2} + C γ_{n} + 1) {(\frac{n}{log n})}^{\frac{2}{5}} sup_{x \in R} | f_{n} (x) - f (x) |, \end{matrix}

where the last expression is constant almost surely by Lemma 1 and since

γ_{n} \to 0

. □

Proof of Corollary 1.

By the triangle inequality

\begin{matrix} | S_{n} (f) - S (f) | \leq J_{n, 1} + J_{n, 2}, \end{matrix}

where

\begin{matrix} J_{n, 1} = | S_{n} (f) - \int_{A_{n}} (- log f (x)) f (x) d x | \end{matrix}

and

\begin{matrix} J_{n, 2} = | \int_{A_{n}} (- log f (x)) f (x) d x - S (f) | . \end{matrix}

Since

J_{n, 1} \to 0

almost surely by Theorem 1, we only need to contend with

J_{n, 2}

. That is, we need to show that

\begin{matrix} \int_{A_{n}^{c}} f (x) log f (x) d x \to 0 \end{matrix}

(15)

almost surely as

n \to \infty

.

For any Borel measurable set E, consider

\begin{matrix} P (E) = \int_{E} f (x) d x, \end{matrix}

and define the signed measure

\begin{matrix} ν (E) = - \int_{E} log f (x) d P . \end{matrix}

Since

| S (f) | < \infty

, both

ν^{+}

and

ν^{-}

are finite measures, and thus,

ν

is a finite signed measure that is absolutely continuous with respect to P. Because of Lemma 3, it suffices for us to demonstrate that

\begin{matrix} P (A_{n}^{c}) \to 0 \end{matrix}

almost surely. For any

x \in A_{n}^{c}

, we have

f_{n} (x) < γ_{n}

. By Lemma 1, there exists

C \in R^{+}

such that

f (x) \leq f_{n} (x) + | f_{n} (x) - f (x) | < γ_{n} + C {(\frac{log n}{n})}^{\frac{2}{5}}

almost surely, and hence, we have shown that

A_{n}^{c} \subseteq B_{n}

almost surely, where

B_{n} : = \{x : f (x) < γ_{n} + C {(\frac{log n}{n})}^{\frac{2}{5}}\} .

It is easy to see that

\begin{matrix} 0 \leq P (A_{n}^{c}) \leq P (B_{n}) \to 0 \end{matrix}

almost surely, since

γ_{n} + C {(\frac{log n}{n})}^{\frac{2}{5}} \to 0

as

n \to \infty

. □

Proof of Theorem 2.

We start with

\begin{matrix} ∥ S_{n} (f) - & \int_{A_{n}} (- log f (x)) f (x) d x ∥_{2} \\ \leq {∥S_{n} (f) + \int_{A_{n}} f_{n} (x) log f (x) d x∥}_{2} \\ + {∥\int_{A_{n}} f_{n} (x) log f (x) d x - \int_{A_{n}} f (x) log f (x) d x∥}_{2} \\ = : K_{n, 1} + K_{n, 2} . \end{matrix}

Recall inequality (11) in the proof of Theorem 1. Arguing in a similar manner as before, we can demonstrate the existence of

C_{1} \in R^{+}

so that

\begin{matrix} K_{n, 1} & = {∥\int_{A_{n}} f_{n} (x) [log f (x) - log f_{n} (x)] d x∥}_{2} \\ = {∥\int_{A_{n}} f_{n} (x) log \frac{f (x)}{f_{n} (x)} d x∥}_{2} \\ \leq {∥\int_{A_{n}} f_{n} (x) | log \frac{f (x)}{f_{n} (x)} | d x∥}_{2} \\ \leq {∥\int_{A_{n}} \frac{C_{1}}{γ_{n}} | f_{n} (x) - f (x) | f_{n} (x) d x∥}_{2} \\ \leq \frac{C_{1}}{γ_{n}} {(\frac{log n}{n})}^{\frac{2}{5}} {∥\int_{x \in A_{n}} f_{n} (x) d x∥}_{2} \\ \leq \frac{C_{1}}{γ_{n}} {(\frac{log n}{n})}^{\frac{2}{5}} . \end{matrix}

Notice also that

\begin{matrix} K_{n, 2} & = {∥\int_{A_{n}} [f_{n} (x) - f (x)] log f (x) d x∥}_{2} \\ \leq {∥\int_{A_{n}} | f_{n} (x) - f (x) | log f (x) d x∥}_{2} \\ \leq C_{2} {(\frac{log n}{n})}^{\frac{2}{5}} {∥\int_{A_{n}} (f (x) + \frac{1}{f (x)}) d x∥}_{2} \\ \leq C_{2} {(\frac{log n}{n})}^{\frac{2}{5}} [1 + {∥\int_{A_{n}} \frac{f_{n} (x)}{f (x)} \frac{1}{f_{n}^{2} (x)} f_{n} (x) d x∥}_{2}] \\ \leq C_{2} {(\frac{log n}{n})}^{\frac{2}{5}} (1 + \frac{C_{3}}{γ_{n}^{2}}) . \end{matrix}

Therefore,

\begin{matrix} {(\frac{n γ_{n}^{5}}{log n})}^{\frac{2}{5}} & {∥S_{n} (f) - \int_{A_{n}} (- log f (x)) f (x) d x∥}_{2} \\ \leq {(\frac{n}{log n})}^{\frac{2}{5}} γ_{n}^{2} [\frac{C_{1}}{γ_{n}} {(\frac{log n}{n})}^{\frac{2}{5}} + C_{2} {(\frac{log n}{n})}^{\frac{2}{5}} (1 + \frac{C_{3}}{γ_{n}^{2}})] \\ = C_{1} γ_{n} + C_{2} γ_{n}^{2} + C_{2} C_{3}, \end{matrix}

from which the result follows. □

Proof of Corollary 2.

Note the decomposition

\begin{matrix} \sqrt{MSE (S_{n} (f))} & = {∥S_{n} (f) - S (f)∥}_{2} \\ \leq {∥S_{n} (f) - \int_{A_{n}} (- log f (x)) f (x) d x∥}_{2} \\ + {∥\int_{A_{n}} (- log f (x)) f (x) d x - S (f)∥}_{2} \\ = : M_{n, 1} + M_{n, 2} . \end{matrix}

By Theorem 2,

M_{n, 1} \to 0

. Now, let

W_{n} = \int_{x \in R} \int_{y \in R} f (x) f (y) log f (x) log f (y) I ((x, y) \in A_{n}^{c} \times A_{n}^{c}) d x d y

and

W = \int_{x \in R} \int_{y \in R} f (x) f (y) | log f (x) log f (y) | d x d y .

Recall from (15) in the proof of Corollary 1 that

W_{n} \to 0

almost surely. Because

| S (f) | < \infty

, it follows that

W < \infty

, and moreover,

| W_{n} | \leq W

. Hence,

\begin{matrix} \begin{matrix} M_{n, 2}^{2} & = {∥\int_{A_{n}^{c}} f (x) log f (x) d x∥}_{2}^{2} \\ = {∥\int_{R} f (x) log f (x) I (x \in A_{n}^{c}) d x∥}_{2}^{2} \\ = E [W_{n}], \end{matrix} \end{matrix}

(16)

and the Lebesgue Dominated Convergence Theorem guarantees that

lim_{n \to \infty} M_{n, 2}^{2} = lim_{n \to \infty} E [W_{n}] = E [lim_{n \to \infty} W_{n}] = E [0] = 0,

thereby proving the corollary. □

Author Contributions

Conceptualization, T.F. and H.S.; Methodology, T.F. and H.S.; Formal Analysis, T.F. and H.S.; Investigation, T.F. and H.S.; Writing—Original Draft Preparation, T.F.; Writing—Review & Editing, H.S.; Supervision, H.S.; Funding Acquisition, H.S. Both authors have read and agreed to the published version of the manuscript.

Funding

This research is supported in part by the Simons Foundation Grant 586789.

Acknowledgments

The authors are grateful to the referees and Daniel J. Henderson for carefully reading the paper and for insightful suggestions that significantly improved the presentation of the paper. The research is supported in part by the Simons Foundation Grant 586789 and the College of Liberal Arts Faculty Grants for Research and Creative Achievement at the University of Mississippi.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In the paper Wu et al. (2010), Wu et al. establish results that are very useful in the proof section. Here, we briefly survey their definitions and results which show that the kernel density estimator for one-sided linear processes enjoys properties similar to the independent case—see Stute (1982). Their work identifies conditions under which the kernel density estimator enjoys strong uniform consistency for a wide class of time series. Included is the linear process in (2).

As is common in analysis of time series, we allude to an independent and identically distributed collection

{ε_{i} : i \in Z}

of random variables, typically referred to as the innovations. Note that many time series models fit the form

\begin{matrix} X_{n} = J (\dots, ε_{n - 1}, ε_{n}), \end{matrix}

(A1)

which regards the

X_{n}

as a system dependent on the innovations. Note here that J is some measurable function which is referred to as the filter. In this context, we also need to define the sigma algebras

\begin{matrix} F_{n} = σ {ε_{n}, ε_{n - 1}, \dots}, \end{matrix}

where

n \in Z

. In addition, let

ε_{0}^{'}

be an independent and identical copy of

ε_{0}

which is, of course, independent of all the

ε_{i}

. For

n \geq 0

, define

F_{n}^{*} = σ {ε_{n}, ε_{n - 1}, \dots, ε_{1}, ε_{0}^{'}, ε_{- 1}, \dots},

and for

n < 0

, put

F_{n}^{*} = F_{n}

.

Define the l-step ahead conditional distribution by

F_{l} (x | F_{k}) = P (X_{l + k} \leq x | F_{k}),

where

l \in N

and

k \in Z

. When it exists, the l-step ahead conditional density is

f_{l} (x | F_{k}) = \frac{d}{d x} F_{l} (x | F_{k}) .

As Wu et al. (2010) note, a sufficient condition for the existence of a marginal density of (A1) is that

f_{1} (x | F_{0})

exists and is uniformly bounded almost surely by some

M \in R^{+}

. We shall refer to this as the marginal condition. Similarly,

F_{l} (x | F_{k}^{*}) = P (X_{l + k}^{*} \leq x | F_{k}^{*})

, where

X_{l + k}^{*} = X_{l + k} - a_{l + k} ε_{0} + a_{l + k} ε_{0}^{'}

if

l + k \geq 0

and

X_{l + k}^{*} = X_{l + k}

if

l + k < 0

. Also,

f_{l} (x | F^{*}) = \frac{d}{d x} F_{l} (x | F_{k}^{*})

.

With this setup, the authors introduce the following measures of the dependence present in the system (A1). Now, for

k \geq 0

, define a pointwise measure of difference by

\begin{matrix} θ_{k} (x) = {∥ f_{1 + k} (x | F_{0}) - f_{1 + k} (x | F_{0}^{*}) ∥}_{2} \end{matrix}

and an

Ł^{2}

-integral measure of difference over

R

by

\begin{matrix} θ (k) = {[\int_{R} θ_{k}^{2} (x) d x]}^{\frac{1}{2}} . \end{matrix}

Finally, define an overall measure of difference by

\begin{matrix} Θ (n) = \sum_{j \in Z} {(\sum_{k = 1 - j}^{n - j} | θ (k) |)}^{2} . \end{matrix}

The distances on the derivatives are defined similarly, as given below.

\begin{matrix} ψ_{k} (x) & = {∥ f_{1 + k}^{'} (x | F_{0}) - f_{1 + k}^{'} (x | F_{0}^{*}) ∥}_{2}, \\ ψ (k) & = {[\int_{R} ψ_{k}^{2} (x) d x]}^{\frac{1}{2}}, and \\ Ψ (n) & = \sum_{j \in Z} {(\sum_{k = 1 - j}^{n - j} | ψ (k) |)}^{2} . \end{matrix}

With this setup, we can now report the following result of (Wu et al. 2010, Theorem 2).

Theorem A1.

Assume that, for some positive r and s, we have that

K \in C^{r}

is a bounded function with bounded support and that

X_{n} \in Ł^{s}

. Further, assume the marginal condition, and assume that

Θ (n) + Ψ (n) = O (n^{α} \tilde{l} (n))

, where

α \geq 1

and where

\tilde{l}

is a slowly varying function. If

log n = o (n h_{n})

, then

\begin{matrix} sup_{x \in R} | f_{n} (x) - E f_{n} (x) | = O (\sqrt{\frac{log n}{n h_{n}}} + n^{- \frac{1}{2}} l (n)), \end{matrix}

where

l (n)

is another slowly varying function.

Now consider our particular case when the filter is the linear process of (2). In view of our assumption that the innovations have finite variance and because we assume the coefficients are square-summable,

X_{n} \in Ł^{2}

. Moreover, we assume all of the bandwidth, kernel, and density conditions listed earlier, from which it easily follows that the marginal condition is satisfied. For the short memory linear process (under the aforementioned assumptions), Wu et al. (2010) demonstrated that

Θ (n) + Ψ (n) = O (n)

. Also, notice that condition B.1 implies that

log n = o (n h_{n})

. Therefore, the theorem of Wu et al. (2010) applies to (2).

In addition, the well-known Taylor series argument under the conditions K.2 and K.3, as well as D.3, yields

\begin{matrix} sup_{x \in R} | E [f_{n} (x)] - f (x) | = O (h_{n}^{2}), \end{matrix}

so, collectively, we see that

\begin{matrix} sup_{x \in R} | f_{n} (x) - f (x) | = O (\sqrt{\frac{log n}{n h_{n}}} + n^{- \frac{1}{2}} l (n) + h_{n}^{2}) . \end{matrix}

Basic methods of differential calculus show that

\sqrt{\frac{log n}{n h_{n}}} + h_{n}^{2}

is minimized when

h_{n}

satisfies B.1. Indeed, the optimum value of

h_{n}

has the exact order of

{(\frac{log n}{n})}^{\frac{1}{5}}

.

References

Ahmad, Ibrahim, and Pi-Erh Lin. 1976. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Transactions on Information Theory 22: 372–75. [Google Scholar] [CrossRef]
Beirlant, Jan, Edward J. Dudewicz, László Györfi, and Edward C. van der Meulen. 1997. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences 6: 17–39. [Google Scholar]
Bouzebda, Salim, and Issam Elhattab. 2011. Uniform-in-bandwidth consistency for kernel-type estimators of Shannon’s entropy. Electronic Journal of Statistics 5: 440–59. [Google Scholar] [CrossRef]
Devroye, Luc, and László Györfi. 1985. Nonparametric Density Estimation: The L¹ View. New York: Wiley. [Google Scholar]
Dmitriev, Yu G., and Felix P. Tarasenko. 1973. On the estimation functions of the probability density and its derivatives. Theory of Probability and Its Applications 18: 628–633. [Google Scholar] [CrossRef]
Duin, Robert P. W. 1976. On the choice of smoothing parameters of Parzen estimators of probability density function. IEEE Transactions on Computers 11: 1175–79. [Google Scholar] [CrossRef]
Halmos, Paul R. 1974. Measure Theory. New York: Springer. [Google Scholar]
Honda, Toshio. 2000. Nonparametric density estimation for a long-range dependent linear process. Annals of the Institute of Statistical Mathematics 52: 599–611. [Google Scholar] [CrossRef]
Nadaraya, Elizbar Akakevič. 1989. Nonparametric Estimation of Probability Densities and Regression Curves. Dordrecht: Kluwer Academic Pub. [Google Scholar]
Parzen, Emanuel. 1962. On estimation of a probability density and mode. Annals of Mathematical Statistics 31: 1065–79. [Google Scholar] [CrossRef]
Rosenblatt, Murray. 1956. Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics 27: 832–37. [Google Scholar] [CrossRef]
Rudemo, Mats. 1982. Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics 9: 65–78. [Google Scholar]
Sang, Hailin, Yongli Sang, and Fangjun Xu. 2018. Kernel entropy estimation for linear processes. Journal of Time Series Analysis 39: 563–91. [Google Scholar] [CrossRef]
Schimek, Michael G. 2000. Smoothing and Regression: Approaches, Computation, and Application. Hoboken: John Wiley & Sons. [Google Scholar]
Scott, David W. 2015. Multivariate Density Estimation: Theory, Practice, and Visualization, 2nd ed. Hoboken: John Wiley & Sons. [Google Scholar]
Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27: 379–423. [Google Scholar] [CrossRef]
Shumway, Robert H., and David S. Stoffer. 2011. Time Series Analysis and Its Applications, 3rd ed. New York: Springer. [Google Scholar]
Silverman, Bernard W. 1986. Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. [Google Scholar]
Slaoui, Yousri. 2014. Bandwidth selection for recursive kernel density estimators defined by stochastic approximation method. Journal of Probability and Statistics 2014: 739640. [Google Scholar] [CrossRef]
Slaoui, Yousri. 2018. Bias reduction in kernel density estimation. Journal of Nonparametric Statistics 30: 505–22. [Google Scholar] [CrossRef]
Stute, Winfried. 1982. A law of the logarithm for kernel density estimator. Annals of Probability 10: 414–22. [Google Scholar] [CrossRef]
Tran, Lanh Tat. 1992. Kernel density estimation for linear processes. Stochastic Processes and their Applications 41: 281–96. [Google Scholar] [CrossRef]
Wand, Matt P., and M. Chris Jones. 1995. Kernel Smoothing. London: Chapman and Hall. [Google Scholar]
Wu, Wei Biao, Yinxiao Huang, and Yibi Huang. 2010. Kernel estimation for time series: An asymptotic theory. Stochastic Processes and their Applications 120: 2412–31. [Google Scholar] [CrossRef]
Wu, Wei Biao, and Jan Mielniczuk. 2002. Kernel density estimation for linear processes. Annals of Statistics 30: 1441–59. [Google Scholar] [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.