Abstract
In this paper, we estimate the Shannon entropy of a one-sided linear process with probability density function . We employ the integral estimator , which utilizes the standard kernel density estimator of . We show that converges to almost surely and in under reasonable conditions.
1. Introduction
Let be the common probability density function of a sequence of identically distributed observations. The associated Shannon entropy
of such an observation was first introduced by Shannon (1948). In his 1948 paper, Shannon utilized this tool in his mathematical investigation of the theory of communication. Today, entropy is widely applied in the fields of information theory, statistical classification, pattern recognition and so on, since it is a measure of the amount of uncertainty present in a probability distribution.
In the literature, several estimators for the Shannon entropy have been introduced. See Beirlant et al. (1997) for an overview. Many of these estimators have been studied in cases where the data are independent. In 1976, Ahmad and Lin (1976) obtained results using the resubstitution estimator for independent data . In particular, he showed consistency in the first and second mean under certain regularity conditions. Here, is the kernel density estimator. Dmitriev and Tarasenko (1973) reported results in 1973 for estimating functionals of the type , where the common density of the independent is assumed to have at least k derivatives. Plugging in kernel density estimators (see their paper and references therein) for the arguments of H and integrating only over the symmetric interval , which is determined by a sequence of a certain order, they provided a result for the estimation of Shannon entropy using the estimator that Beirlant et al. (1997) refer to as the integral estimator. Their results give conditions for almost sure convergence.
Interestingly enough, Dmitriev and Tarasenko (1973) also provided (because their work is a more general investigation of functionals) a result for the estimation of the quadratic Rényi entropy . Conditions are provided specifically for the almost sure convergence of their estimator to the true value . The estimation of Rényi entropy for the dependent case is challenging. A dependent case is treated by Sang et al. (2018). They studied the estimation of the quadratic Entropy for the one-sided linear process. Utilizing the Fourier transform along with the projection method, they demonstrate that the kernel entropy estimator satisfies a central limit theorem for short memory linear processes.
To study the Shannon entropy for dependent data is also a challenging problem, and to the best of our knowledge, general results for the Shannon entropy estimation of regular time series data are still unknown. In this paper, we study the Shannon entropy for the one-sided linear process
where the innovations are independent and identically distributed real valued random variables on some probability space with mean zero and finite variance and where the collection of real coefficients satisfies . Additionally, we will require that the common density of the innovations be bounded. The estimator we utilize employs the kernel method, which was first introduced by Rosenblatt (1956); Parzen (1962). The kernel estimator will be denoted by
where the sequence provides the bandwidths, and is the kernel function which satisfies . Typically, the kernel function is a probability density function.
This method has proven to be successful in estimating probability density functions and their derivatives, regression functions, etc., in both the independent and dependent setting. For the independent setting, see the books (Devroye and Györfi (1985); Silverman (1986); Nadaraya (1989); Wand and Jones (1995); Schimek (2000); Scott (2015)) and the references therein. For the dependent setting, we refer the reader to (Tran (1992); Honda (2000); Wu and Mielniczuk (2002); Wu et al. (2010)). Bandwidth selection is an important issue in kernel density estimation, and there is a lot of research in this direction. See, e.g., Duin (1976); Rudemo (1982); Slaoui (2014, 2018).
A few remarks about notation and terms used in the paper follow. Let and be real-valued sequences. By we understand that and means that for some positive number C. Essentially, this is the standard Landau little oh and big oh notation. When we write, , we mean , and as one might guess, means . We also employ the notation to indicate that . A function is referred to as slowly varying (at ∞) if it is positive and measurable on for some such that holds for each . The set of all functions which are Hölder continuous of some order r will be denoted as . That is, for each there exists , such that for all , we have , and when , we recognize this as the well-known Lipschitz condition. The notation with represents the set of all real-valued functions f defined on some measure space having the property that . In the case that and unless otherwise specified, the measure is tacitly understood to be Lebesgue measure and is assumed to contain the Borel sets. refers to the set of real-valued functions defined on E which are bounded almost everywhere. Whenever the domain space of the function is understood, we may simply write .
The following are bandwidth, kernel, and density conditions that we shall refer to throughout this paper:
- B.1
- ;
- K.1
- for some is bounded with bounded support;
- K.2
- ;
- D.1
- ;
- D.2
- ;
- D.3
- .
Notice that the bandwidth, kernel, and density conditions are prefixed using B, K, and D, respectively.
In this first section, we have provided an introduction to the problem, a survey of past research in this area, and the notation to be used throughout. The main results are reported in Section Two. In Section Three, we present the proofs of the main results. Finally, the Appendix A introduces the reader to foundational results, which will be required in the proof of our main results.
2. Main Results
If is a sequence of independent and identically distributed random variables over a common probability space in for some , when , and is a sequence of real coefficients such that , then the linear process given in (2) exists and is well-defined. For the case where the innovations have finite variance, we say that the process has short memory (short-range dependence) if and and long memory (long-range dependence) otherwise. Throughout, we assume that each with .
Let be the probability density function of the linear process , defined in (2).In this paper, we estimate the Shannon Entropy of the linear process. To do this, we employ the integral estimator
where is the standard kernel density estimator defined in (3). The (random) sets are given by
where is an appropriately defined sequence in that converges to zero.
Our estimator utilizes the kernel method of density estimation, and we will accordingly require adherence of the kernel to certain conditions. In addition, we impose some conditions on the bandwidths and on some of the densities of the problem. These conditions were listed in the previous section. Based on these conditions, let us consider the properties of the estimator (4). We proceed in a manner similar to the analysis done by Bouzebda and Elhattab (2011) for the independent case.
Theorem 1.
Let be the linear process given in (2), and assume that it has short memory. Furthermore, assume that is finite. If the bandwidth, kernel, and density conditions listed earlier are satisfied, then
is bounded almost surely whenever the condition is imposed on the sequence .
Corollary 1.
If the conditions of Theorem 1 hold, then we have
almost surely.
Theorem 2.
Let be the linear process given in (2), and assume that it has short memory. Furthermore, assume that is finite. If the bandwidth, kernel, and density conditions listed earlier are satisfied, then
is bounded whenever the condition is imposed on the sequence .
Corollary 2.
If the conditions of Theorem 2 hold, then the mean squared error (MSE) satisfies
Remark 1.
In this paper, we work on the entropy estimation for short memory linear processes by applying the integral method. It is interesting to know whether the similar results hold for long memory linear processes. It is also interesting to know whether the resubstitution method works for dependent data such as linear processes. However, the research in these directions are beyond the scope of this paper. We leave research in these directions for future work.
Remark 2.
In a wide range of disciplines, including finance, geology, and engineering, many time series may be modeled using a linear process. In such instances, our result provides a method for estimating the associated Shannon Entropy. One example is the discriminatory data on the arrival phases of earthquakes and explosions, which were captured at a seismic recording station. Another example is the data about returns on the New York Stock Exchange. See these and many other time series data in the book by Shumway and Stoffer (2011) and other books on time series.
3. Proofs
Lemma 1.
If the conditions of Theorem 1 (or Theorem 2) hold, then
almost surely.
Proof.
This lemma follows from Theorem 2 of Wu et al. (2010) (see their discussion immediately after the statement of Theorem 2 and in the penultimate paragraph of section 4.1). See also the discussion in the Appendix A on fundamental results. □
Lemma 2.
If the conditions of Theorem 1 (or Theorem 2) hold, then
Proof.
Because , there exists such that
for sufficiently large n. Therefore,
as , from which (8) follows. □
Note. Our use of Lemma 2 in the proofs of Theorems 1 and 2 will be tacit.
Lemma 3.
If ν is a finite signed measure that is absolutely continuous with respect to a measure μ, then corresponding to every positive number ε there is a positive number δ such that whenever E is a measurable set for which .
Proof.
This is a basic result from measure theory. See, for example, Theorem B of Halmos (1974) in section 30. □
Proof of Theorem 1.
We begin with the decomposition
where
and
First, we consider . Using the inequality
for , we notice that for all , we have
It follows that
since integrates to unity over the real line.
Next, we consider . Since the set over which we are integrating may be changed to without affecting the value of , we may assume that f is positive on . Using the inequality
for , we notice that for all , we have
if we can justify the existence of . To that end, define
and note that for all , we have
Taking the supremem over yields
by Lemma 1. Note that
since
This guarantees the existence we sought to establish. We continue with
since integrates to unity over the real line.
Therefore,
where the last expression is constant almost surely by Lemma 1 and since . □
Proof of Corollary 1.
By the triangle inequality
where
and
Since almost surely by Theorem 1, we only need to contend with . That is, we need to show that
almost surely as .
For any Borel measurable set E, consider
and define the signed measure
Since , both and are finite measures, and thus, is a finite signed measure that is absolutely continuous with respect to P. Because of Lemma 3, it suffices for us to demonstrate that
almost surely. For any , we have . By Lemma 1, there exists such that almost surely, and hence, we have shown that almost surely, where
It is easy to see that
almost surely, since as . □
Proof of Theorem 2.
We start with
Recall inequality (11) in the proof of Theorem 1. Arguing in a similar manner as before, we can demonstrate the existence of so that
Notice also that
Therefore,
from which the result follows. □
Proof of Corollary 2.
Note the decomposition
By Theorem 2, . Now, let
and
Recall from (15) in the proof of Corollary 1 that almost surely. Because , it follows that , and moreover, . Hence,
and the Lebesgue Dominated Convergence Theorem guarantees that
thereby proving the corollary. □
Author Contributions
Conceptualization, T.F. and H.S.; Methodology, T.F. and H.S.; Formal Analysis, T.F. and H.S.; Investigation, T.F. and H.S.; Writing—Original Draft Preparation, T.F.; Writing—Review & Editing, H.S.; Supervision, H.S.; Funding Acquisition, H.S. Both authors have read and agreed to the published version of the manuscript.
Funding
This research is supported in part by the Simons Foundation Grant 586789.
Acknowledgments
The authors are grateful to the referees and Daniel J. Henderson for carefully reading the paper and for insightful suggestions that significantly improved the presentation of the paper. The research is supported in part by the Simons Foundation Grant 586789 and the College of Liberal Arts Faculty Grants for Research and Creative Achievement at the University of Mississippi.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
In the paper Wu et al. (2010), Wu et al. establish results that are very useful in the proof section. Here, we briefly survey their definitions and results which show that the kernel density estimator for one-sided linear processes enjoys properties similar to the independent case—see Stute (1982). Their work identifies conditions under which the kernel density estimator enjoys strong uniform consistency for a wide class of time series. Included is the linear process in (2).
As is common in analysis of time series, we allude to an independent and identically distributed collection of random variables, typically referred to as the innovations. Note that many time series models fit the form
which regards the as a system dependent on the innovations. Note here that J is some measurable function which is referred to as the filter. In this context, we also need to define the sigma algebras
where . In addition, let be an independent and identical copy of which is, of course, independent of all the . For , define
and for , put .
Define the l-step ahead conditional distribution by
where and . When it exists, the l-step ahead conditional density is
As Wu et al. (2010) note, a sufficient condition for the existence of a marginal density of (A1) is that exists and is uniformly bounded almost surely by some . We shall refer to this as the marginal condition. Similarly, , where if and if . Also, .
With this setup, the authors introduce the following measures of the dependence present in the system (A1). Now, for , define a pointwise measure of difference by
and an -integral measure of difference over by
Finally, define an overall measure of difference by
The distances on the derivatives are defined similarly, as given below.
With this setup, we can now report the following result of (Wu et al. 2010, Theorem 2).
Theorem A1.
Assume that, for some positive r and s, we have that is a bounded function with bounded support and that . Further, assume the marginal condition, and assume that , where and where is a slowly varying function. If , then
where is another slowly varying function.
Now consider our particular case when the filter is the linear process of (2). In view of our assumption that the innovations have finite variance and because we assume the coefficients are square-summable, . Moreover, we assume all of the bandwidth, kernel, and density conditions listed earlier, from which it easily follows that the marginal condition is satisfied. For the short memory linear process (under the aforementioned assumptions), Wu et al. (2010) demonstrated that . Also, notice that condition B.1 implies that . Therefore, the theorem of Wu et al. (2010) applies to (2).
In addition, the well-known Taylor series argument under the conditions K.2 and K.3, as well as D.3, yields
so, collectively, we see that
Basic methods of differential calculus show that is minimized when satisfies B.1. Indeed, the optimum value of has the exact order of .
References
- Ahmad, Ibrahim, and Pi-Erh Lin. 1976. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Transactions on Information Theory 22: 372–75. [Google Scholar] [CrossRef]
- Beirlant, Jan, Edward J. Dudewicz, László Györfi, and Edward C. van der Meulen. 1997. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences 6: 17–39. [Google Scholar]
- Bouzebda, Salim, and Issam Elhattab. 2011. Uniform-in-bandwidth consistency for kernel-type estimators of Shannon’s entropy. Electronic Journal of Statistics 5: 440–59. [Google Scholar] [CrossRef]
- Devroye, Luc, and László Györfi. 1985. Nonparametric Density Estimation: The L1 View. New York: Wiley. [Google Scholar]
- Dmitriev, Yu G., and Felix P. Tarasenko. 1973. On the estimation functions of the probability density and its derivatives. Theory of Probability and Its Applications 18: 628–633. [Google Scholar] [CrossRef]
- Duin, Robert P. W. 1976. On the choice of smoothing parameters of Parzen estimators of probability density function. IEEE Transactions on Computers 11: 1175–79. [Google Scholar] [CrossRef]
- Halmos, Paul R. 1974. Measure Theory. New York: Springer. [Google Scholar]
- Honda, Toshio. 2000. Nonparametric density estimation for a long-range dependent linear process. Annals of the Institute of Statistical Mathematics 52: 599–611. [Google Scholar] [CrossRef]
- Nadaraya, Elizbar Akakevič. 1989. Nonparametric Estimation of Probability Densities and Regression Curves. Dordrecht: Kluwer Academic Pub. [Google Scholar]
- Parzen, Emanuel. 1962. On estimation of a probability density and mode. Annals of Mathematical Statistics 31: 1065–79. [Google Scholar] [CrossRef]
- Rosenblatt, Murray. 1956. Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics 27: 832–37. [Google Scholar] [CrossRef]
- Rudemo, Mats. 1982. Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics 9: 65–78. [Google Scholar]
- Sang, Hailin, Yongli Sang, and Fangjun Xu. 2018. Kernel entropy estimation for linear processes. Journal of Time Series Analysis 39: 563–91. [Google Scholar] [CrossRef]
- Schimek, Michael G. 2000. Smoothing and Regression: Approaches, Computation, and Application. Hoboken: John Wiley & Sons. [Google Scholar]
- Scott, David W. 2015. Multivariate Density Estimation: Theory, Practice, and Visualization, 2nd ed. Hoboken: John Wiley & Sons. [Google Scholar]
- Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27: 379–423. [Google Scholar] [CrossRef]
- Shumway, Robert H., and David S. Stoffer. 2011. Time Series Analysis and Its Applications, 3rd ed. New York: Springer. [Google Scholar]
- Silverman, Bernard W. 1986. Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. [Google Scholar]
- Slaoui, Yousri. 2014. Bandwidth selection for recursive kernel density estimators defined by stochastic approximation method. Journal of Probability and Statistics 2014: 739640. [Google Scholar] [CrossRef]
- Slaoui, Yousri. 2018. Bias reduction in kernel density estimation. Journal of Nonparametric Statistics 30: 505–22. [Google Scholar] [CrossRef]
- Stute, Winfried. 1982. A law of the logarithm for kernel density estimator. Annals of Probability 10: 414–22. [Google Scholar] [CrossRef]
- Tran, Lanh Tat. 1992. Kernel density estimation for linear processes. Stochastic Processes and their Applications 41: 281–96. [Google Scholar] [CrossRef]
- Wand, Matt P., and M. Chris Jones. 1995. Kernel Smoothing. London: Chapman and Hall. [Google Scholar]
- Wu, Wei Biao, Yinxiao Huang, and Yibi Huang. 2010. Kernel estimation for time series: An asymptotic theory. Stochastic Processes and their Applications 120: 2412–31. [Google Scholar] [CrossRef]
- Wu, Wei Biao, and Jan Mielniczuk. 2002. Kernel density estimation for linear processes. Annals of Statistics 30: 1441–59. [Google Scholar] [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).