Entropy Measures for Stochastic Processes with Applications in Functional Anomaly Detection

We propose a definition of entropy for stochastic processes. We provide a reproducing kernel Hilbert space model to estimate entropy from a random sample of realizations of a stochastic process, namely functional data, and introduce two approaches to estimate minimum entropy sets. These sets are relevant to detect anomalous or outlier functional data. A numerical experiment illustrates the performance of the proposed method; in addition, we conduct an analysis of mortality rate curves as an interesting application in a real-data context to explore functional anomaly detection.


13
The entropy measure of a stochastic process is a 'K-entropy', which means that the estimated 14 entropy depends on the choice of a particular kernel. In this sense, is the order in the sampled curves 15 (from most to least depth curves) induced by the entropy measure invariant to changes in the kernel 16 function? What we numerically show next is that the order induced by the entropy does not depend on 17 the the kernel function (or its parameters) when representing the functional data at hand. To illustrate 18 this, we constructed an experiment considering Scenario A in Section 4.1 of the paper when n " 1000 19 and ν " 0.05. As the aim of this section is to show the order invariance property, we consider two 20 different kernel function and different parameters, namely:

21
i) The Gaussian kernel function: K G pt l , t k q " e´σ }t l´tk } 2 , with σ " 5, 10, 15. ii) The spline kernel function: The results, displayed in Figure   The aim of this experiment is to illustrate the performance of the proposed methodology when the atypical data cannot be inferred considering particular extreme points in the curves and under different assumptions about the noise in the observed data. To this aim, a fraction 1´ν " 90% of n " 400 curves comprises the realizations of the following stochastic model: X l ptq " sinptq`cospt`ε l q`a l`bl t 2 , for l " 1, . . . , p1´νqn, and t P r0, 2πs, where the random coefficients pε l , a l , b l q are independently and normally distributed with means: µ ε " 0, µ a " 5 and µ b " 1 and variances σ 2 ε " σ 2 b " 0.25 and σ 2 a " 0.2. The remaining proportion of the data comprises outliers that contaminate the sample according to the following stochastic model: Y l ptq " sinptq`cospt`ε l q`1 2 psinp2πtq`cospπt`ε l qq`a l`bl t 2 , for l " 1, . . . , nν, and t P r0, 2πs, where the random coefficients pε l , a l , b l q are independently and normally distributed with the same 30 means and variances as in the case of Xptq. In Figure 4, we show simulated raw data on the left and the corresponding functional data on the right, as in the paper, we use a Gaussian kernel and choose 32 the parameters by cross-validation.

33
In Figure

40
In this section, we present an extended analysis of the empirical exercise of outlier detection in the context of mortality rate curves. In Table 2, we present the full results of the anomaly detection exercise considering entropy-PA and entropy-NPA and the results obtained with other measures described in Section 4 for ν " t0.50, 0.25, 0.15, 0.10, 0.05, 0.01u. In the first three scenarios, that is when ν " t0.50, 0.25, 0.15u, the results for the competitor measures show that only the HMD is able to capture almost all curves corresponding to the First and Second World War (except year 1941) and the influenza pandemic for a value of ν " 0.25. As is expected, the use of an inappropriate value for ν increases the number of false positives in the analysis. A convenient criterion for choosing the value of ν is to consider the ratio: where D M pz ris , p µ z q represents the Mahalanobis distance sorted in deceasing order of the vector z ris 41 representing a curve in the sample (in the case of non-parametric approach, we consider the sorted 42 sequence of estimated local entropies). Using this criterion, in Section 4.2, we have decided to fix 43 ν " 10%, since, as can be seen in Figure 6, the distributions of the estimated robust Mahalanobis 44 distances (left) and the local entropies (right) show an elbow at Points 10 and 4 respectively, and this  corresponds to a value of ν " 10% in both cases. Figure 6. Distribution of the estimated robust Mahalanobis distances (left) and local entropies (right) for the mortality rate dataset. The vertical red line (´´´) denotes the 'elbow' in the distribution of Mahalanobis distance and local entropies, respectively, and corresponds to ν " 10% in both cases.
When ν " 10%, most of the competitor measures identify as anomalous curves the years that 48 correspond to the First World War and the last years of the sample. Only the HMD is able to partially 49 identify as outliers some years corresponding to the Second World War. Even though it is true that for 50 the early 2000s, the mortality rates are the lowest ones, they present the same dynamic as the rest of 51 the years of the sample, so they could be considered as false-positive identifications. The temporal 52 dynamic implicit in the data shows that the mortality rate decreases systematically every year for all 53 the cohorts. This means that a curve that is far from the "center" of the distribution is not necessarily an 54 anomalous curve, but follows the natural dynamics of the process that generates the samples every year.

56
With respect to the proposed entropy methods, these are able to identify as anomalous curves 57 those years corresponding to the First and Second World War, except for the year 1941. Additionally, 58 the entropy methods are the only ones capable of identifying the year 1919 (influenza pandemic) as an 59 outlying curve. Last, but not least, it is important to mention that for the NPA, the obtained results are 60 robust with respect to the number of neighbors k considered in the method.