Shannon entropy estimation for linear processes

In this paper, we estimate the Shannon entropy $S(f) = -\E[ \log (f(x))]$ of a one-sided linear process with probability density function $f(x)$. We employ the integral estimator $S_n(f)$, which utilizes the standard kernel density estimator $f_n(x)$ of $f(x)$. We show that $S_n (f)$ converges to $S(f)$ almost surely and in $\L^2$ under reasonable conditions.


Introduction
Let f (x) be the common probability density function of a sequence {X n } ∞ n=1 of identically distributed observations. The associated Shannon entropy of such an observation was first introduced by Claude Shannon [16]. In his 1948 paper, Shannon utilized this tool in his mathematical investigation of the theory of communication. Today, entropy is widely applied in the fields of information theory, statistical classification, pattern recognition and so on, since it is a measure of the amount of uncertainty present in a probability distribution.
In the literature, several estimators for the Shannon entropy have been introduced. See Beirlant et al. [2] for an overview. Many of these estimators have been studied in cases where the data is independent. In 1976, Ahmad and Lin [1] obtained results using the resubstitution estimator H n = − 1 n n i=1 ln f n (X i ) for independent data {X i } n i=1 . In particular, he showed consistency in the first and second mean under certain regularity conditions. Here, f n (x) is the kernel density estimator. Dmitriev and Tarasenko [5] reported results in 1973 for estimating functionals of the type H f (x), f ′ (x), ..., f k (x) dx, where the common density f (x) of the independent X i is assumed to have at least k derivatives. Plugging in kernel density estimators (see their paper and references therein) for the arguments of H and integrating only over the symmetric interval [−k n , k n ], which is determined by a sequence {k n } ∞ n=1 of a certain order, they provided a result for the estimation of Shannon entropy using the estimator that Beirlant et al. [2] refer to as the integral estimator. Their results give conditions for almost sure convergence.
Interestingly enough, Dmitriev and Tarasenko [5] also provided (because their work is a more general investigation of functionals) a result for the estimation of the quadratic Rényi entropy Q(f ) = f 2 (x)dx. Conditions are provided specifically for the almost sure convergence of their estimator to the true value Q(f ). The estimation of Rényi entropy for the dependent case is challenging. A dependent case is treated by Sang, Sang, and Xu [13]. They studied the estimation of the quadratic Entropy for the one-sided linear process. Utilizing the Fourier transform along with the projection method, they demonstrate that the kernel entropy estimator satisfies a central limit theorem for short memory linear processes.
To study the Shannon entropy for dependent data is also a challenging problem, and to the best of our knowledge, general results for the Shannon entropy estimation of regular time series data are still unknown. In this paper, we study the Shannon entropy S(f ) for the one-sided linear process where the innovations ε i are independent and identically distributed real valued random variables on some probability space (Ω, F, P) with mean zero and finite variance σ 2 ε and where the collection Additionally, we will require that the common density f ε (x) of the innovations be bounded. The estimator we utilize employs the kernel method, which was first introduced by Rosenblatt [12] and Parzen [10]. The kernel estimator will be denoted by where the sequence {h n } ∞ n=1 provides the bandwidths, and K : R → R is the kernel function which satisfies R K(x)dx = 1. Typically, the kernel function is a probability density function.
This method has proven to be successful in estimating probability density functions and their derivatives, regression functions, etc. in both the independent and dependent setting. For the independent setting, see the books (Devroye and Györfi [4]; Silverman [18]; Nadaraya [9]; Wand and Jones [23]; Schimek [14]; Scott [15]) and the references therein. For the dependent setting, we refer the reader to (Tran [22]; Honda [8]; Wu and Mielniczuk [25]; Wu, Huang and Huang [24]). Bandwidth selection is an important issue in kernel density estimation, and there is much research in this direction. See, e.g., Duin [6], Rudemo [12] and Slaoui [19,20]. A few remarks about notation and terms used in the paper follow. Let {a n } ∞ n=1 and {b n } ∞ n=1 be real-valued sequences. By a n = o(b n ) we understand that a n /b n → 0 and a n = O(b n ) means that lim sup |a n /b n | < C for some positive number C. Essentially, this is the standard Landau little oh and big oh notation. When we write, a n ≪ b n , we mean a n = o(b n ), and as one might guess, b n ≫ a n means a n ≪ b n . Also, we employ the notation a n ≍ b n to indicate that 0 < lim inf n→∞ an bn ≤ lim sup n→∞ an bn < ∞. A function l : [0, ∞) → R is referred to as slowly varying (at ∞) if it is positive and measurable on [A, ∞) for some A ∈ R + such that lim x→∞ l(λx)/l(x) = 1 holds for each λ ∈ R + . The set of all functions g : R → R which are Hölder continuous of some order r will be denoted as C r (R). That is, for each g ∈ C r (R) there exists C g ∈ R + such that for all x, x ′ ∈ R, we have |g(x) − g(x ′ )| ≤ C g |x − x ′ | r , and when r = 1, we recognize this as the well-known Lipschitz condition. The notation L p (E) with 0 < p < ∞ represents the set of all real-valued functions f defined on some measure space (E, A, µ) having the property that E |f (x)| p dµ < ∞.
In the case that E = R and unless otherwise specified, the measure µ is tacitly understood to be Lebesgue measure and A is assumed to contain the Borel sets. L ∞ (E) refers to the set of real-valued functions defined on E which are bounded almost everywhere. Whenever the domain space of the function is understood, we may simply write L p .
The following are bandwidth, kernel, and density conditions that we shall refer to throughout this paper.
Notice that the bandwidth, kernel, and density conditions are prefixed using B, K, and D, respectively.
In this first section, we have provided an introduction to the problem, a survey of past research in this area, and the notation to be used throughout. The main results are reported in section two. In section three, we present the proofs of the main results. Finally, the appendix introduces the reader to foundational results which will be required in the proof of our main results.

Main Results
If {ε i : i ∈ Z} is a sequence of independent and identically distributed random variables over a common probability space (Ω, F, P) in L q (Ω) for some q > 0, E ε i = 0 when q ≥ 1, and is a sequence of real coefficients such that ∞ i=0 |a i | 2∧q < ∞, then the linear process X n given in (2) exists and is well-defined. For the case q ≥ 2 where the innovations have finite variance, we say that the process has short memory (short range dependence) if ∞ i=0 |a i | < ∞ and ∞ i=0 a i = 0 and long memory (long range dependence) otherwise. Throughout, we assume that each ε i ∈ L q with q ≥ 2.
Let f (x) be the probability density function of the linear process X n = ∞ i=0 a i ε n−i , n ∈ N defined in (2). In this paper, we estimate the Shannon Entropy − f (x) log f (x) dx of the linear process. To do this, we employ the integral estimator where f n (x) is the standard kernel density estimator defined in (3). The (random) sets A n are given by where {γ n } ∞ n=1 is an appropriately defined sequence in R + that converges to zero. Our estimator utilizes the kernel method of density estimation, and we will accordingly require adherence of the kernel to certain conditions. In addition, we impose some conditions on the bandwidths and on some of the densities of the problem. These conditions were listed in the previous section. Based on these conditions, let us consider the properties of the estimator (4). We proceed in a manner similar to the analysis done by Bouzebda and Elhattab [3] for the independent case.
Theorem 2.1 Let {X n : n ∈ N} be the linear process given in (2), and assume that it has short memory. Furthermore, assume that S(f ) is finite. If the bandwidth, kernel, and density conditions listed earlier are satisfied, then is bounded almost surely whenever the condition γ n ≫ h n is imposed on the sequence {γ n } ∞ n=1 . S is bounded whenever the condition γ n ≫ h n is imposed on the sequence {γ n } ∞ n=1 . Remark 2.1 In this paper we work on the entropy estimation for short memory linear processes by applying the integral method. It is interesting to know whether the similar results hold for long memory linear processes. It is also interesting to know whether the resubstitution method works for dependent data such as linear processes. However, the research in these directions are beyond the scope of this paper. We leave the research in these directions for future work.

Remark 2.2
In a wide range of disciplines, including finance, geology, and engineering, many time series may be modeled using a linear process. In such instances, our result provides a method for estimating the associated Shannon Entropy. One example is the discriminatory data on the arrival phases of earthquakes and explosions, which was captured at a seismic recording station. Another example is the data about returns on the New York Stock Exchange. See these and many other time series data in the book by Shumway and Stoffer [17] and other books on time series.
almost surely.
Proof. This lemma follows from Theorem 2 of Wu et al. [24] (see their discussion immediately after the statement of Theorem 2 and in the penultimate paragraph of section 4.1) Also, see the discussion in the appendix on fundamental results.

Lemma 3.2 If the conditions of Theorem 2.1 (or 2.2) hold, then
Proof. Because h n ≍ (n −1 log n) 1 5 , there exists C ∈ R + such that Note. Our use of Lemma 3.2 in the proofs of Theorems 2.1 and 2.2 will be tacit.

Lemma 3.3
If ν is a finite signed measure that is absolutely continuous with respect to a measure µ, then corresponding to every positive number ε there is a positive number δ such that |ν|(E) < ε whenever E is a measurable set for which µ(E) < δ.
Proof. This is a basic result from measure theory. See, for example, Theorem B of Halmos [7] in section 30.
Proof of Theorem 2.1. We begin with the decomposition where First, we consider I n,1 . Using the inequality | log z| ≤ z + 1 z for z ∈ R + , we notice that for all x ∈ A n , we have It follows that since f n (x) integrates to unity over the real line.
Next, we consider I n,2 . Since the set over which we are integrating may be changed to A n ∩ {x : f (x) > 0} without affecting the value of I n,2 , we may assume that f is positive on A n . Using the inequality log z ≤ |z − 1| + |z −1 − 1| for z ∈ R + , we notice that for all x ∈ A n , we have if we can justify the existence of C ∈ R + . To that end, define and note that for all x ∈ A n , we have Taking the supremem over A n yields This guarantees the existence we sought to establish. We continue with since f n (x) integrates to unity over the real line.
In view of (9), (10), and (12), we have shown that Therefore, lim sup n→∞ n γ 5 n log n where the last expression is constant almost surely by Lemma 3.1 and since γ n → 0.
Proof of Corollary 2.1. By the triangle inequality Since J n,1 → 0 almost surely by Theorem 2.1, we only need to contend with J n,2 . That is, we need to show that almost surely as n → ∞.
For any Borel measurable set E, consider and define the signed measure Since |S(f )| < ∞, both ν + and ν − are finite measures, and thus, ν is a finite signed measure that is absolutely continuous with respect to P . Because of Lemma 3.3, it suffices for us to demonstrate that P (A c n ) → 0 almost surely. For any x ∈ A c n , we have f n (x) < γ n . By Lemma 3.1, there exists C ∈ R + such that f (x) ≤ f n (x) + |f n (x) − f (x)| < γ n + C log n n 2 5 almost surely, and hence, we have shown that A c n ⊆ B n almost surely, where .
It is easy to see that almost surely, since γ n + C log n n 2 5 → 0 as n → ∞.
Proof of Corollary 2.2. Note the decomposition By Theorem 2.2, M n,1 → 0. Now, let Recall from (15) in the proof of Corollary 2.1 that W n → 0 almost surely. Because |S(f )| < ∞, it follows that W < ∞, and moreover, |W n | ≤ W . Hence,

Appendix
In the paper [24], Wu et al. establish results that are very useful in the proof section. Here, we briefly survey their definitions and results which show that the kernel density estimator for onesided linear processes enjoys properties similar to the independent case-see Stute [21]. Their work identifies conditions under which the kernel density estimator enjoys strong uniform consistency for a wide class of time series. Included is the linear process in (2).
As is common in analysis of time series, we allude to an independent and identically distributed collection {ε i : i ∈ Z} of random variables, typically referred to as the innovations. Note that many time series models fit the form X n = J(· · · , ε n−1 , ε n ), which regards the X n as a system dependent on the innovations. Note here that J is some measurable function which is referred to as the filter. In this context, we also need to define the sigma algebras F n = σ{ε n , ε n−1 , · · · }, where n ∈ Z. In addition. let ε ′ 0 be an independent and identical copy of ε 0 which is, of course, independent of all the ε i . For n ≥ 0, define F * n = σ{ε n , ε n−1 , · · · , ε 1 , ε ′ 0 , ε −1 , · · · }, and for n < 0, put F * n = F n . Define the l-step ahead conditional distribution by where l ∈ N and k ∈ Z. When it exists, the l-step ahead conditional density is As Wu et al. [24] notes, a sufficient condition for the existence of a marginal density of (17) is that f 1 (x|F 0 ) exists and is uniformly bounded almost surely by some M ∈ R + . We shall refer to this as the marginal condition. Similarly, . With this setup, the authors introduce the following measures of the dependence present in the system (17). Now, for k ≥ 0, define a pointwise measure of difference by and an L 2 -integral measure of difference over R by Finally, define an overall measure of difference by The distances on the derivatives are defined similarly, as given below. With this setup, we can now report the following result of Wu et al. [24,Theorem 2].
Theorem 4.1 Assume that, for some positive r and s, we have that K ∈ C r is a bounded function with bounded support and that X n ∈ L s . Further, assume the marginal condition, and assume that Θ(n)+Ψ(n) = O(n αl (n)), where α ≥ 1 and wherel is a slowly varying function. If log n = o(nh n ), then sup x∈R f n (x) − E f n (x) = O log n nh n + n − 1 2 l(n) , where l(n) is another slowly-varying function.
Now consider our particular case when the filter is the linear process of (2). In view of our assumption that the innovations have finite variance and because we assume the coefficients are square summable, X n ∈ L 2 . Moreover, we assume all of the bandwidth, kernel, and density conditions listed earlier, from which it easily follows that the marginal condition is satisfied. For the short memory linear process (under the aforementioned assumptions), Wu et al. [24] demonstrated that Θ(n) + Ψ(n) = O(n). Also, notice that condition B.1 implies that log n = o(nh n ). Therefore, the theorem of Wu et al. [24] applies to (2).
In addition, the well-known Taylor series argument under the conditions K. Basic methods of differential calculus show that log n nhn + h 2 n is minimized when h n satisfies B.1.
Indeed, the optimum value of h n has the exact order of log n n 1 5 .