Abstract
We propose a definition of entropy for stochastic processes. We provide a reproducing kernel Hilbert space model to estimate entropy from a random sample of realizations of a stochastic process, namely functional data, and introduce two approaches to estimate minimum entropy sets. These sets are relevant to detect anomalous or outlier functional data. A numerical experiment illustrates the performance of the proposed method; in addition, we conduct an analysis of mortality rate curves as an interesting application in a real-data context to explore functional anomaly detection.
1. Introduction
The family of -entropies, originally proposed by Rényi [1], plays an important role in information theory and statistics. Consider a random variable Z distributed according to a measure F that admits a probability density function f. Then, for and , the -entropy of Z is computed as follows:
where , and stands for the expected value with respect to the F measure. Several renowned entropy measures in the statistical literature are particular cases in the family of -entropies. For instance, when , we obtain the Hartley entropy; when , then converges to the Shannon entropy; and when , then converges to the Min-entropy measure. The contribution of this paper is two-fold. Firstly, we propose a natural definition of entropy for stochastic processes that extends the previous one and a suitable sample estimator for the observation of partial realizations of the process, the typical framework when dealing with functional data. We also show that Minimal Entropy Sets (MES), as formally defined in Section 3, are useful to solve anomaly detection problems, a common task in almost all data analysis contexts.
The paper is structured as follows: In Section 2, we introduce a definition of entropy for a stochastic process and suitable sample estimators for this measure. In Section 3, we show how to estimate minimum-entropy sets of a stochastic process in order to discover atypical functional data in a sample. Section 4 illustrates the theory with simulations and examples, and Section 5 concludes the work.
2. Entropy of a Stochastic Process
In this section, we extend the definition of entropy to a stochastic process. For the sequel, let be a probability space, where is the -algebra in and P a -finite measure. We consider random elements (functions) in a metric space . As usual in the case of functional data, the realizations of the random elements are assumed in , the space of real continuous functions in a compact domain endowed with the uniform metric.
The first step is to consider a suitable representation for the stochastic process. We make use of the well-known Karhunen–Loève expansion [2] (p. 25, Theorem 1.5). Let be a centered (zero-mean) stochastic process with continuous covariance function , then there exists a basis of such that for all :
where the sequence of random coefficients comprises zero mean random variables with (co)variance , being the Kronecker delta and the sequence of eigenvalues associated with the eigenfunctions of .
The equality in Equation (2) must be understood in the mean square sense, that is:
uniformly in T. Therefore, we can always consider a -near representation such that for all arbitrarily small, there exists an integer D such that for , then . From this result, it is possible to establish a suitable way to approximate the entropy of a random element according to the distribution of the “representation coefficients” obtained from .
Definition 1 (d-truncated entropy for stochastic processes).
Let X be a centered stochastic process with a continuous covariance function. Consider the truncation and the random vector ; then, the d-truncated entropy of X is defined as .
The “approximation error” when computing the entropy of the stochastic process X with Definition 1 decreases monotonically with the number of terms retained in the Karhunen–Loève expansion, at a rate that depends on the decay of the spectrum of the covariance function . In general, the more autocorrelated the process is, the more quickly the eigenvalues of converge to zero. In practical functional data applications (see for instance the mortality-rate curves in Section 4), the autocorrelation is usually strong, and the truncation parameter d will be small when approximating the entropy of the process. The next example illustrates the definition.
Example 1.
[Gaussian process] When X is a Gaussian Process (GP), the coefficients in the Karhunen–Loève expansion have the further property that they are independent and zero-mean normally distributed random variables. Therefore, the Shannon entropy () of X can be approximated with the truncated version of the GP as follows:
where Σ is the diagonal covariance matrix with elements for .
In practice, we can only observe some realizations of the stochastic process X, and these observations are sparsely registered. Therefore, to estimate the entropy of from a random sample of discrete realizations of a stochastic process, a first task is the representation of these paths by means of continuous functions. To this end, we consider a reproducing kernel Hilbert space of functions, associated with a positive definite and symmetric kernel function .
Estimating Entropy in a Reproducing Kernel Hilbert Space
Most functional data analysis approaches for representing raw data suggest proceeding as follows: (i) choose an orthogonal basis of functions , where each belongs to a general function space ; and (ii) represent each functional datum by means of a linear combination in the [3,4]. Our choice is to consider as a Reproducing Kernel Hilbert Space (RKHS) of functions [5]. In this case, the elements in the spanning set are the eigenfunctions associated with the positive-definite and symmetric kernel function that span [5] (Moore-Aronszajn Theorem p. 19).
In our setting, the functional representation problem can be framed as follows: We have available m discrete observations, that is a realization path of the stochastic element . We also assume that the discrete path , as usual when dealing with real data, contains zero mean error measurements. Then, the functional data estimator, denoted onwards as , is obtained solving the following regularization problem:
where V is a strictly convex functional with respect to the second argument, is a regularization parameter, frequently chosen by cross-validation, and is a regularization term. By the representer theorem [6,7] (Theorem 5.2, p. 91, Proposition 8, p. 51), the solution of the problem stated in Equation (4) exists, is unique and admits a representation of the form:
In the particular case of a squared loss function and considering , the coefficients of the linear combination in Equation (5) are obtained solving the following system:
where , , is the identity matrix of order m and is the Gram matrix with the kernel evaluations, , for and . To relate the Karhunen–Loève expansion in Equation (2) to the RKHS representation, we make use of Mercer’s theorem [2] (Lemma 1.3, p. 24), then , where is the eigenvalue associated with the orthonormal eigenfunction for , and invoking the reproducing property, then:
Therefore, following Equation (2), and ; and the connection is clearly established. When working with discrete realizations of a stochastic process, we must solve two sequential tasks. First, we need to represent raw data as functional data and later find a truncated representation of the function. To this end, when combining Equation (5) with Mercer’s theorem and the reproducing property, we obtain:
and now, is the realization of the random variable for ; see [8] for further details. For some kernel functions, for instance the Gaussian kernel, the associated sequence of eigen-pairs for is known [9] (pp. 10), and we can obtain an explicit value for all . If not, let be the j-eigenpair associated with the kernel matrix , then for .
In practice, given a sample of n discrete paths (realizations) of the stochastic process X, say for , a suitable input to estimate entropy in Definition 1 is to consider the set of multivariate vectors for , as formally proposed in the next definition.
Definition 2 (K-entropy estimation of a stochastic process).
Let for be a discrete random sample of X, and let be the eigen-pairs of the kernel matrix , where . Consider the corresponding finite dimensional representation , where for and for . Then, the estimated kernel entropy of X is defined as .
In Definition 2, denotes the estimated entropy using the (finite dimensional) representation coefficients . In Section 3, we formally introduce two approaches to estimate entropy departing from . The next example illustrates the estimation procedure in the context of GPs in Example 1.
Illustration with Example 1:
Consider 100 realizations of a GP as follows: 50 curves from and another 50 curves from ; where is a Fourier basis in , , and are independent normally distributed random variables (r.v.) for .
In Figure 1 (left), we illustrate the realizations of the stochastic processes, in black (“—”) the sample paths of and in red (“—”) the paths corresponding to . In Figure 1 (right), we show the distribution of the linear combination coefficients corresponding to these paths. Following Example 1, we estimate the covariance functions and using the respective coefficients and plug this covariance matrix into the Shannon entropy expression to obtain the estimated entropies and , similar to the true entropies and , respectively. We formally propose the estimation procedure in Algorithm 1.
Figure 1.
Gaussian processes realizations on the left and coefficients for entropy estimation on the right. The sizes of the balls on the right are proportional to the determinants of (in black) and (in red).
| Algorithm 1: Estimation of from a sample of random paths. |
![]() |
The choice of kernel parameters in Algorithm 1 is made by cross-validation. This ensures that the curve fitting method is asymptotically optimal. Nonetheless, although the selection of the kernel parameters affects the scale of the estimated entropy, the center-outward ordering induced by , as formally proposed in the next section, is unaffected. In the Supplementary Material, we present relevant experimental results to illustrate this property, which make the method robust in terms of the selection of the kernel and regularization parameters.
3. Minimum Entropy for Anomaly Detection
Anomaly detection is a common task in almost all data analysis context. The unsupervised approach considers a sample of random elements where most instances follow a well-defined pattern and a small proportion, here denoted as , present an abnormal pattern. In recent works (see for instance [10,11,12,13]), the authors propose depth measures and related methods, to deal with functional outliers. In this section, we propose a novel criterion to tackle the problem of anomaly detection with functional data using the ideas and concepts developed in Section 2. For a real-valued d-dimensional random vector Z that admits a continuous density function , define to be the entropy of the Borel-set A with respect to the measure . Then, the -Minimal-Entropy Set (MES) is formally defined as:
The is equivalent [14,15] to a -High Density Set (HDS) [16] formally defined as , where is the largest constant such that , for . Therefore, the complement of MES is a suitable set to define outlier data in the sample, considering as an atypical realization of X. Next, we give two approaches to estimate MES.
3.1. Parametric Approach
Given a random sample of n discrete random paths for , we transform this sample into d-dimensional vectors using the representation and truncation method proposed in this work, numerically implemented in Lines 2–8 in Algorithm 1. Assume further that is a suitable probability model for the random sample , then we estimate by Robust Maximum Likelihood (RML) the parameters . For instance, in this paper, we consider to be the normal density, and then, RML estimated parameters are , the robust mean vector and covariance matrix, respectively. For details on robust estimation, we refer to [17]. After the estimation of the distribution parameters, the computation of follows by plugging the estimated density into Equation (1). Moreover, for the normal model, the estimated set is defined trough the following expression:
where is the quantile of a Chi-square distribution with d-degrees of freedom. Then, if the coefficient , representing , lies outside this ellipsoid, we say that the functional datum is atypical. When the proportion of outlier in the sample is known a priori, the -quantile can be replaced by the corresponding sample Mahalanobis distance quantile, as is the case in Section 4.1.
3.2. Non-Parametric Approach
The following are definitions to introduce further non-parametric estimation methods. For the random vector distributed according to , let be the -centered ball with radius that fulfills the condition , then the -neighbors of the point comprise the open set .
Definition 3 (δ-local α-entropy).
Let , for and ; the δ-local α-entropy of the r.v. Z is:
Under mild regularity conditions on , the local entropy measure is a suitable metric to characterize the degree of abnormality of every point in the support of . Several natural estimators of local entropy measures can be considered, for instance the (average) distance from the point to its k-th-nearest neighbor. We estimate MES combining the estimated -Local -entropy. As in the parametric case, let for be a random sample of n discrete random paths; we transform this sample into d-dimensional vectors following Lines 2–8 in Algorithm 1. Next, we estimate the local entropy for these data using the estimator , where is the average distance from to its k-th-nearest neighbor [18], and then estimate solving the following optimization problem:
The solution to this problem, , leads to the following decision function:
where if corresponds to the proportion of curves projected near the origin, that is the set of curves that belongs to a low entropy (high density) set. The following theorem shows that as the number of available curves increases, the estimation method asymptotically detects the proportion of curves belonging to the .
Theorem 1.
At the solution of the optimization problem stated in Equation 8, the following equality holds:
where if and otherwise.
4. Experimental Section
The aim of this section is to illustrate the performance of the proposed methodology to detect abnormal observations in a sample of functional data. In what follows, for the representation of functional data, we consider the Gaussian kernel function . The kernel parameter and the regularization coefficient in Algorithm 1 were defined through cross-validation.
4.1. Simulation Analysis
In a Monte Carlo study, we investigate the performance of the proposed method over three data configurations (Scenarios A, B and C). Specifically, we consider the following generating processes: a fraction of curves are realizations of the following stochastic model:
where is a normally-distributed multivariate random variable with mean and diagonal co-variance matrix , and are independent autocorrelated random error functions.
The remaining proportion of data with comprises outliers that contaminate the sample according to the following typical scenarios (see [19]):
- (A)
- Magnitude outliers: where is a normally-distributed multivariate r.v. with parameters and .
- (B)
- Shape outliers: where is a normally-distributed multivariate r.v. with parameters and .
- (C)
- A combination considering outliers from Scenario A and outliers from Scenario B.
To illustrate the generating process, in Figure 2, we show one instance of the simulated paths in Scenario C with . We test our Parametric entropy (PA) and Non-Parametric entropy (NPA) method against several well-known depth measures for functional anomaly detection, namely: the Modified Band Depth (MBD), the H-Mode Depth (HMD), the Random Tukey Depth (RTD) and the Functional Spatial Depth (FSD) (see [10,11,12,13]), respectively, already implemented in the R-package fda-usc [20]. For this experiment, the values of the parameter are assumed known in each scenario. With respect to parameters and in Algorithm 1, in this simulation exercise, we chose them with a 10-fold cross-validation procedure using a single set of data, which correspond to the first instance of the simulations. The reference values (which remain fixed throughout the simulation exercise) are and .
Figure 2.
(Left) Raw data, 400 curves corresponding to Scenario C with . (Right) Functional data, in black (“—”), the sample of regular paths , and abnormal curves in red (“—”).
Let P and N be the amount of outlier and normal data in the sample, respectively, and let TP = True Positive and TN = True Negative be the respective quantities detected by different methods; in Table 1, we report the following average metrics TPR = TP/P (True Positive Rate or sensitivity), TNR = TN/N (True Negative Rate or specificity) and the area under the ROC curve (aROC) of each method obtained through the replications in the Monte Carlo study.
Table 1.
Simulation analysis: Scenarios and contamination percentages in columns. In rows, different methods and average sensitivities, specificities and the areas under the ROC curves (aROC) (this last on a scale of 102). The corresponding standard-error is reported in parenthesis.
As can be seen, the PA and NPA entropy methods proposed in this article outperform other recently-proposed depth measures in the three scenarios considered in the experiments when . In the remaining case (when ), PA and NPA outperform the other methods; however, the standard errors are slightly high to confirm a significant difference between the methods.
When we compare among the proposed methods, the parametric approach seems to be slightly (but consistently) more effective than the non-parametric approach in Scenario A. For Scenarios B and C, both methods provide similar results. It is important to remark that the PA method is especially adequate for Gaussian data, while the NPA method does not assume any distributional hypothesis on the data. In this sense, the simulation results show the robustness of the non-parametric approach even when competing with parametric methods designed for specific distributions.
4.2. Outliers in the Context of Mortality-Rate Curve Analysis
We consider the French mortality rates database, available in the R-package Demography [21], to study age-specific male death rates in a logarithmic scale. In Figure 3 (left), each curve corresponds to one year from 1901–2006 (106 paths in total) and accounts for the number of deaths per 1000 of the mean population in the age group (from 0–101 years) in question. As expected, for low-age cohorts (until 12 years, approximately), the mortality rates present a decreasing trend and then start to grow until late ages, where all cohorts achieve a 100% mortality rate.
Figure 3.
French mortality data: On the left, the regular curves in black (“—”) and outliers detected in red (“—”) for . On the right, the first two principal components of the kernel eigenfunctions; the area inside the doted blue ellipsoid (- -) corresponds PA estimation of and the region inside the convex hull in blue (—) to the NPA estimation. The regular curves, represented with black dots (•), lie inside the and detected outliers with a red asterisk (∗) outside of .
For some years, the evolution pattern of mortality presents an atypical behavior, mostly coinciding with the first and second World Wars, jointly with the influenza pandemic episode that took place in 1919.
In this experiment, we do not know a priori the proportion of atypical curves. Therefore, after having conducted inference over a wide range of values for , as a way to assess the sensitivity and reliability of the inference when determining the number of abnormal curves, we decided to fix . For further details on the way to choose the parameter (and an extended sensitivity analysis on the values of ), please refer to § 3.2 in the Supplementary Material. In Figure 3 (left), we highlight in red the anomalous detected curves with both the entropy-PA and NPA methods corresponding to the years 1914–1919 and 1940, 1942–1945, which match with men (between 20 and 40 years old) participating in World War I and II. In Figure 3 (right), we use the first two principal components of the kernel eigenfunctions to project the representation coefficients (in this experiment, in ) in two dimensions. As can be seen, the points laying outside the , represented with doted-blue ellipses when estimating it with PA (- -) and the convex hull with a continuous blue line (—) when estimating it with NPA, correspond to the the atypical curves in the sample.
5. Discussion
In this article, we propose a definition of entropy for stochastic processes. We provide a reproducing kernel Hilbert space model to estimate entropy from a random sample of realizations of a stochastic process, namely functional data, and introduce two approaches to estimate minimum entropy sets for functional anomaly detection.
In the experimental section, the Monte Carlo simulation illustrates the adequacy of the proposed method in the context of magnitude and shape outliers, outperforming other state of the art methods for functional anomaly detection. In the study of French mortality rates, the parametric and non-parametric approaches for minimum entropy sets estimation show their adequacy to capture anomalous curves, principally associated with the First and Second World Wars and the Influenza episode in 1919.
Regardless of the results presented in the paper, how widely the method can be used in practice, especially with noisier data, is an open question. In this sense, as future work, we will consider testing the performance of the proposed method in other scenarios with different noise assumptions in the observations. Another natural extension for future work entails the study of the asymptotic properties of the estimators. The extension of the proposed method from the stochastic process to random fields, useful for several statistical and information science areas, seems straightforward, but a wide range of simulations and numerical experiments must be done in order to stress the performance of entropy methods in comparison to other techniques when dealing with abnormal fields. Another natural avenue for future work entails the study of the connections between entropy for stochastic process, as formally defined here, and the maximum entropy principle when estimating the governing parameters of Gaussian processes.
Supplementary Materials
The following are available online at www.mdpi.com/1099-4300/20/1/33/s1.
Acknowledgments
We thank the referees and the editor for constructive comments and insightful recommendations. This work has been supported by CONICET Argentina Project 20020150200110BA, the Spanish Ministry of Economy and Competitiveness Projects ECO2015–66593-P, GROMA(MTM2015-63710-P), PPI (RTC-2015-3580-7) and UNIKO(RTC-2015-3521-7) and the “methaodos.org” research group at URJC.
Author Contributions
All authors have contributed equally to the paper.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| RML | Robust Maximum Likelihood. |
| MES and HDS | Minimum Entropy and High Density Sets, respectively. |
| PA and NPA | Parametric and Non-Parametric approaches. |
| MBD, HMD, RTD, FSD | Modified Band, H-Mode, Random Tukey and Functional Spatial Depths. |
Appendix A
Proof Theorem 1.
Consider the following optimization problem:
For the sake of simplicity, consider first the case where . Let be the quantile of the sample. Then, it can be shown that if and if is a solution for the problem stated in Equation (A1). As a consequence:
From the constraint in Equation (A1), it holds that , and then:
For the case , it holds that
where stands for the largest integer no greater than x. Therefore, the number of ’s equating to one is and:
Finally, we show that . The dual problem of (A1) is:
By the fundamental theorem of duality, the objective functions of the problems stated in Equations (A1) and (A2) take the same value at their solutions, and as a consequence, (see [22]). Since Problem (A2) differs from Problem (8) just in the scaling of the objective function, it holds that , which concludes the proof. ☐
References
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- Bosq, D. Linear Processes in Function Spaces: Theory and Applications; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]
- Ramsay, J.O. Functional Data Analysis; Wiley: New York, NY, USA, 2006. [Google Scholar]
- Ferraty, F.; Vieu, P. Nonparametric Functional Data Analysis: Theory and Practice; Springer: New York, NY, USA, 2006. [Google Scholar]
- Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer: New York, NY, USA, 2011. [Google Scholar]
- Kimeldorf, G.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971, 33, 82–94. [Google Scholar] [CrossRef]
- Cucker, F.; Smale, S. On the mathematical foundations of learning. Bull. Am. Math. Soc. 2002, 39, 1–49. [Google Scholar] [CrossRef]
- Munñoz, A.; González, J. Representing functional data using support vector machines. Pattern Recognit. Lett. 2010, 31, 511–516. [Google Scholar]
- Zhu, H.; Williams, C.; Rohwer, R.; Morciniec, M. Gaussian Regression and Optimal Finite Dimensional Linear Models; Aston University: Birmingham, UK, 1997. [Google Scholar]
- López-Pintado, S.; Romo, J. On the concept of depth for functional data. J. Am. Stat. Assoc. 2009, 104, 718–734. [Google Scholar] [CrossRef]
- Cuevas, A.; Febrero, M.; Fraiman, R. Robust estimation and classification for functional data via projection-based depth notions. Comput. Stat. 2007, 22, 481–496. [Google Scholar] [CrossRef]
- Sguera, C.; Galeano, P.; Lillo, R. Spatial depth-based classification for functional data. Test 2014, 23, 725–750. [Google Scholar] [CrossRef]
- Cuesta-Albertos, J.A.; Nieto-Reyes, A. The random Tukey depth. Comput. Stat. Data Anal. 2008, 52, 4979–4988. [Google Scholar] [CrossRef]
- Hero, A. Geometric entropy minimization (GEM) for anomaly detection and localization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; pp. 585–592. [Google Scholar]
- Xie, T.; Narabadi, N.; Hero, A.O. Robust training on approximated minimal-entropy set. arXiv, 2016; arXiv:1610.06806. [Google Scholar]
- Hyndman, R.J. Computing and graphing highest density regions. Am. Stat. 1996, 50, 120–126. [Google Scholar]
- Maronna, R.; Martin, R.; Yohai, V. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
- Beirlant, J.; Dudewicz, E.; Györfi, L.; Van der Meulen, E. Nonparametric entropy estimation: An overview. Int. J. Math. Stat. Sci. 1997, 6, 17–39. [Google Scholar]
- Cano, J.; Moguerza, J.M.; Psarakis, S.; Yannacopoulos, A.N. Using statistical shape theory for the monitoring of nonlinear profiles. Applied Stochastic Models in Business and Industry. Appl. Stoch. Models Bus. Ind. 2015, 31, 160–177. [Google Scholar] [CrossRef]
- Febrero-Bande, M.; De la Fuente, M.O. Statistical computing in functional data analysis: The R package fda.usc. J. Stat. Softw. 2012, 51, 1–28. [Google Scholar] [CrossRef]
- Hyndman, R.J. Demography Package; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- Muñoz, A.; Moguerza, J.M. Estimation of high-density regions using one-class neighbor machines. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 476–480. [Google Scholar] [CrossRef] [PubMed]
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
