Combining Entropy Measures for Anomaly Detection

The combination of different sources of information is a problem that arises in several situations, for instance, when data are analysed using different similarity measures. Often, each source of information is given as a similarity, distance, or a kernel matrix. In this paper, we propose a new class of methods which consists of producing, for anomaly detection purposes, a single Mercer kernel (that acts as a similarity measure) from a set of local entropy kernels and, at the same time, avoids the task of model selection. This kernel is used to build an embedding of data in a variety that will allow the use of a (modified) one-class Support Vector Machine to detect outliers. We study several information combination schemes and their limiting behaviour when the data sample size increases within an Information Geometry context. In particular, we study the variety of the given positive definite kernel matrices to obtain the desired kernel combination as belonging to that variety. The proposed methodology has been evaluated on several real and artificial problems.


Introduction
Usual Data Mining tasks, such as classification, regression and anomaly detection, are heavily dependent on the geometry of the underlying data space. Kernel Methods, such as Support Vector Machines (SVM), provide the control on the data space geometry through the use of a Mercer kernel function [1,2]. Such functions, defined in the next section, induce embeddings of the data in feature spaces where Mercer kernels act as inner products. The choice of the appropriate kernel, including its parameters, is a particular case of model selection problems.
For instance, when working with SVM, a delicate parameterization is needed; otherwise, solutions might be suboptimal. In other words, the choice of a suitable kernel function and its parameters will affect both the geometry of the data embedding and the success of the algorithms [3,4]. A typical way to proceed is by means of cross-validation procedures [5]. However, these parameter calibration strategies, although intuitive and simple from an applied point of view, have some important drawbacks. In particular, their computational burden is of practical relevance when implementing cross-validation strategies in problems that involve calibrating a medium to large amount of parameters. An appealing alternative to model selection when working with SVM is to combine or merge different kernel functions into a single kernel [6,7].
Functional data [8] present the particularity of being intrinsically infinite dimensional. This peculiarity implies that classical procedures for multivariate data must be adapted or redesigned to cope with functional data. The statistical distribution of data is a basic element to afford outlier detection problems. Entropies are natural functions to use in anomaly detection problems given that any definition of entropy should produce large values for scattered distributions and small values for concentrated distributions. In addition, statistical distributions are a particular case of functional data and in this way entropy comes then into play in this context.
In this paper, we present an alternative proposal to solving anomaly detection problems that avoids the selection of kernel hyperparameters. A novelty of this work is that the methodology is developed to deal with functional data. We will explore several kernel combination techniques, including some methods from Information Geometry that respect the geometry of the manifold that contains the Gram matrices associated with the Mercer kernels involved.
The paper is organized as follows: Section 2 describes the functional data analysis methods used to produce the data representations from kernels, as well as the minimum entropy method used in this paper for anomaly detection. Section 3 develops several methods to obtain kernel combinations for the task of outlier detection. Section 4 illustrates the theory with simulations and examples; and Section 5 concludes the work.

Reproducing Kernel Hilbert Spaces for Multivariate and Functional Data
Let X be the "space" where the data live (a compact metric space). A Mercer kernel is a function K : XˆX Ñ R symmetric, continuous and such that, for all finite sets S " tx 1 , . . . , x n u Ă X, the matrix whose entries are Kpx i , x j q i,jPt1,¨¨¨, nu is positive semidefinite. Often, we will use the term "kernel function" when referring to a Mercer kernel. Kernel functions admit expansions of the type Kpx, zq " ř i φpxq T φpzq for some φ : X ÝÑ R d , where d is usually large. In particular, φpXq is some manifold embedded in R d [9]. For x P X, denote K x the function K x : X Ñ R given by K x pzq " Kpx, zq. There exists a unique Hilbert space H K of functions on X made up of the span of the set tK x |x P Xu, such that for all f P H K and x P X, f pxq " xK x , f y H K . The Hilbert space H K is said to be a Reproducing Kernel Hilbert Space (RKHS) [10]. Next, we describe the use of RKHS for data analysis, differentiating between the multivariate and functional cases.
In the multivariate case, we consider data sets S " tx 1 , . . . , x n u Ă X, where X is a compact subset of R D . Consider the RKHS H K and the linear integral operator L K defined by L K p f q " ş X Kp¨, sq f psqds. Since X is compact and K continuous, L K has a countable sequence of eigenvalues tλ j u and eigenfunctions tφ j u, and K can be expressed as Kpx, yq " ř j λ j φ j pxqφ j pyq, where the convergence is absolute and uniform (Mercer's theorem).
Consider the Gram matrix K S " Kpx i , x j q, i, j P t1,¨¨¨, nu. This matrix is real, symmetric and positive definite (by definition of K) and K S px i , x j q " φpx i q T φpx j q, where φpx i q " p b λ j φ j px i qq j is the mapping φ : X Ñ R d . Straightforwardly, K is the standard scalar product in R d . Thus, the use of K induces both a data transformation and a metric on the original data given by: Equation (1) shows that the choice of the kernel K determines the geometry of the data set after the transformation X Ñ φpXq. Now, we consider the functional data case, that is, the case where data are functions or, by generalization, infinite dimensional objects (such as images, for instance). Let pΩ, F , Pq be a probability space, where F is the σ-algebra in Ω and P a σ-finite measure. We consider random elements (functions) Xpω, tq : ΩˆT Ñ R in a metric space pT, τq, where T Ă R is compact and we assume Xpω,¨q to be continuous functions. We consider kernels K X ps, tq " KpXpω, sq, Xpω, tqq (the classical choice is K X ps, tq " EpXpω, sqXpω, tqq). Then, there exists a basis te i u iě1 of CpTq such that for all t P T Xpω, tq " for appropriate coefficients, where the e i are the eigenfunctions associated with the integral operator of K X ps, tq.
In real data analysis, we do not have theoretical random paths, or functional data described by mathematical equations, but finite samples from such processes. For instance, if we are considering normal distributions as the object of analysis, we will not know the vectors of means and real covariance matrices(µ and Σ), but a sample X " tx i u P R n from which we will estimate the covariance matrix S " 1 n XX T . In the case of functions, X will be a compact space or manifold in an Euclidean space, Y " R, and there will be available sample curves f n identified with data sets tpx i , y i q P XˆYu n i"1 . Let K : XˆX Ñ R a Mercer Kernel and H K its associated RKHS. Then, the coefficients in Equation (2) can be approximated by solving the following optimization problem [11]: where γ ą 0 and } f } 2 K represents the norm of the function f in H K . The solution, that constitutes an example of Equation (2), is given by f˚pxq " ř jλj φ j pxq, where theλ j are the weights of the projection of the function corresponding to the sample tpx i , y i qu onto the function space generated by the eigenvalues of L K .
Next, we use local entropies for anomaly detection through kernel combinations. For this preliminary work, we explore linear combinations and Karcher means, to validate the intuition that the use of a more natural mean than the arithmetic mean will produce better practical results, as far as positive definite matrices are involved.

Local Entropy Kernels
In order to link the metric induced by the kernel function and the underlying (empirical) density in the data, we propose local entropy kernels. Consider a measurable cover on pΩ, F , Pq-the probability space where the random element of interest X is defined-say tΩ i u iě1 , where Ť iě1 Ω i " Ω and Ω i X Ω j " H for any i ‰ j; we can define the α-Entropy [12] of X as follows: PpΩ i q log´PpΩ i q α´1¯, for α ě 0 and α ‰ 1.
The parameter α defines to which entropy inside the family of α entropies we are referring to. For instance, when α " 0, then H α is the Hartley entropy, when α Ñ 1 then H α converges to the Shannon entropy and when α Ñ 8 then H α converges to the Min-entropy measure. Let S Ω be the collection of finite partitions of Ω, for any subset A " Ť n i"1 A i P S Ω , the entropy of A can be computed as follows: This paves the way to define the ∆-local entropy [13] corresponding to any subset ∆ P F Ω as follows Let pX 1 , . . . , X n q be a random sample drawn i.i.d. from P, we would like to compute the local entropies of the corresponding random sets ∆ 1 , . . . , ∆ n , where ∆ i " Ω Ş BpX´1 i pωq, rq and BpX´1 i pωq, rq P Ω is the open ball with center in ω and a (data driven) small radius r. In practice, given a sample S n " px 1 , . . . , x n q, we compute the local entropy using the estimator p h α p∆ i q " d k px i , S n q{p1´αq, whered k px i , S n q is the average distance from x i to its k th -nearest neighbour. Notice that the locality parameter k ind k px, S n q, which represents the number of neighbours that we take into account to approximate the local entropy around x, is related to r in ∆ x " Ω Ş Bpx´1pωq, rq. We then consider ϕpxq " p h α p∆ x q, with α " 0, so to define the local entropy kernel as In the next section, we discuss how to avoid model selection problems. To this aim, a set of local entropy kernels is initially estimated from the data. Then, we estimate an average local entropy kernel that takes into account the particular geometry of the space of positive definite matrices. In this way, we obtain a unique low dimensional data representation, from which outliers are detected. This approach does not include neither a model selection step nor a parameter estimation procedure.

Kernel Combination for Anomaly Detection
Consider a data sample S n " tx 1 , . . . , x n u Ă X, where the x i can be multivariate or functional data, and consider a set of m Mercer kernels (or matrices) K e 1 ,¨¨¨, K e m , that induce m different data embeddings φ j : As stated in Equation (1), each of the kernels induces a kernel distance d K j on the original data space X, corresponding to the Euclidean distance on the manifold Z j " φ j pXq.
Next, we define a new set of transformations, suitable for anomaly detection, in line with the theory of Section 2.1 by: The corresponding kernels suitable for outlier detection are Now, kernel functions are positive definite type functions, i.e., the empirical kernel matrix K-obtained via the evaluation of the kernel function into the set of n training points-belongs to the cone of symmetric positive semidefinite matrices P :" tK P R nˆn |K " K T , K ľ 0u. Let K 1 , . . . , K m be the empirical kernel matrices defined in Equation (9), all of them in P, and let pw 1 , . . . , w m q T be a suitable non-negative vector of combination parameters, then define the "fusion" kernel K Kpw 1 , . . . , w m q :" w 1 K 1`, . . . ,`w m K m ľ 0.
In the context of SVM classification problems, the goal is to find the parameters w 1 , . . . , w m that maximize the optimal margin. Instead, in anomaly detection cases, the goal is to estimate the parameters w 1 , . . . , w m that produce a suitable data representation. This is achieved when the regular data within the sample -represented in the coordinate space provided by the fusion kernel K-have a reduced entropy or equivalently is scarcely scattered and those observations that are atypical in the sample are projected in distant regions from that of the regular data.
Next we consider three particular combination schemes. The first is rather straightforward, the second proposes the mean in the manifold that contains the kernels, and the third is a weighting scheme that assigns the weights according to the use of appropriate choices of entropy functions.

Definition 1 (Multivariate sparsity measures).
Consider m different sparsity measures φ 1 ,¨¨¨, φ m and let K 1 ,¨¨¨, K m be the corresponding set of Mercer kernels, where K i px, yq " φ T i pxqφ i pyq. We define a multivariate concentration measure by Φ " pφ 1 ,¨¨¨, φ m q : X Ñ R m .
The corresponding kernel, evaluated at the sample S, will be Thus, the kernel corresponding to a multivariate sparsity measure Φ " pφ 1 ,¨¨¨, φ m q is the sum of the univariate kernels K i associated with the φ i . This fact allows us to interpret linear combination of kernels ř w i K i as coming from (weighted) multivariate sparsity measures.

Entropy Weighting
Definition 2 (K-entropy of a data set). Consider a Mercer kernel K acting on a space X, a sample data set S n and the corresponding transformation φ : X Ñ R d induced by K, where Kpx, yq " φpxq T φpyq.
The K-entropy of S n is defined by: In the context of outlier detection, consider K 1 ,¨¨¨, K m , obtained from sparsity measures. From Equation (9), if a point x is an outlier, then it will be off the main bulk of data points and, thus, ϕ j pxq " d K j pφ j pxq, φpS n qq will be large and the same will be true for K j px, x i q for most x i P S n . As a consequence, E K j pS n q will tend to be large. On the other hand, and following the same reasoning, if a particular kernel K j induces a representation not suitable for detecting the outliers, then E K j pS n q will be small. Thus, the measure defined in Equation (11) acts as a true entropy for matrices: If data are very concentrated after the transformation induced by K, then the entropy of the data (measured by the) set will be low.
We establish the entropy-weighting scheme by solving the following semidefinite optimization problem: ř m j"1 λ j K j ľ 0, ř m j"1 λ j " 1, and 0 ď λ j ď u j , where u j P rp0, 1s are some positive constants that may be associated with each kernel matrix K j . We refer to [14] for a detailed description of the basics of semidefinite programming.

Theorem 1.
Consider the previous semidefinite optimization problem. If K 1 ,¨¨¨, K m ľ 0 and u i " E K j pS n q ř j E K j pS n q , then the solution to the optimization problem is given by λj " Proof. Given that λj " E K j pS n q ř j E K j pS n q ě 0 and K j ľ 0, the constraint ř m j"1 λ j K j ľ 0 holds. In addition, Since all the λj reach their upper bound, the theorem holds and the solution is unique.
Thus, the entropy-weighting scheme will be:

Karcher Mean
Next, we introduce the Karcher mean [15][16][17] of kernel matrices as an alternative approach to the linear combinations of matrices presented in Section 3.1. The Karcher mean preserves the particular Riemannian manifold in which the kernel matrices lie and constitutes a natural definition for the geometric mean of the matrices.
The set of positive definite square matrices P is a Riemannian manifold, with inner product xA, By X " TrpX´1 AX´1Bq on the tangent space to P at the point X. The distance between A, B P P is given by Given K 1 , . . . , K m kernel matrices, the Karcher mean, denoted onwards as K, is defined as the minimizer of the function f pXq " ř m i"1 d P pX, K i q 2 , and it is the unique solution X P P of the matrix equation ř m i"1 logpK´1 i Xq " 0.

Experimental Section
In this section, we illustrate, with the aid of multiple numerical examples and real data sets, the performance of the proposed methodology when the goal is to detect abnormal observations in a sample. We consider a list of several kernel functions, namely: (i) the Gaussian with parameter σ defined in a grid of values ranging in σ P t0.1 3 , 0.1 2 , 0.1, 1, 10, 50, 100, 500, 10 3 u; (ii) the linear kernel K L px i , x j q " xx i , x j y and (iii) the second degree polynomial kernel K P px i , x j q " pxx i , x j y`1q 2 . As it was explained in Section 1, the combination methods proposed can be considered as an alternative to model selection techniques for outlier detection purposes. Therefore, the results obtained are presented jointly with the single kernel methods. Our combination methods are denoted as: (i) the average kernel ( s K), (ii) the kernel constructed using the Karcher mean of the single kernel functions (K) and (iii) the minimum entropy linear combination kernel or entropy kernel (E K ).
For comparison purposes, we consider several alternative approaches for anomaly detection in both the multivariate and the functional data frameworks. In the multivariate case, we consider some alternative well-known techniques in the field of machine learning. These methods are: (i) LOF [18] and (ii) HiCS [19]. In the functional case, we test our proposals against three widely used depth measures: the Modified Band Depth (MBD) [20], the Modal Depth (HMD) [21] and the Random Tukey Depth (RTD) [22]. This depth measures induce an order with respect to the functional data set that can be used to determine which observations (curves) are far from the deepest or central point and can be classified as outliers.
Each database presents a set of regular observations and has been contaminated with abnormal or outlying observations. Let P and N be the amount of outlier and normal data in the sample, respectively, and let TP = True Positive and TN = True Negative be the respective quantities detected by different methods. In Tables 1 and 2, we report the following average metrics TPR = TP/P (True Positive Rate or sensitivity), TNR = TN/N (True Negative Rate or specificity). For the comparison with other techniques using real data sets, in Tables 4 and 5, we report the area under the ROC curve (AUC) for each experiment.

Synthetic Data
For the simulated experiment, we consider two synthetic data schemes. The first scheme has been built by generating a synthetic multivariate data set, while, for the second scheme, we have generated a synthetic functional data set.
Synthetic multivariate data: We consider a conditionally normal bivariate distribution model [23] for regular data and outliers were sampled from three different standard Gaussian models. The sample size is n " 1000. The data for the experiment, illustrated in Figure 1, was obtained using a Gibbs sampler.

Synthetic functional data:
We consider random samples of Gaussian processes tx 1 ptq, . . . , x n ptqu, with sizes 4000 and 2000, where a proportion ν " 0.1, known a priori, present an atypical pattern, and the remaining np1´νq curves are considered the main data. We consider the following generating processes: X l ptq " 2 ÿ j"1 ξ j sinpjπtq`ε l ptq, for l " 1, . . . , p1´νqn, Y l ptq " 2 ÿ j"1 ζ j sinpjπtq`ε l ptq, for l " 1, . . . , νn, where t P r0, 1s, εptq are independent autocorrelated random error functions and pξ 1 , ξ 2 q is a normally-distributed bivariate random variable (NDMRV) with mean µ ξ " p1, 2q and diagonal co-variance matrix Σ ξ " diagp1, 1q. To generate the outliers, we consider pζ 1 , ζ 2 q NDMRV with parameters µ ξ " p4, 5q and Σ ζ " Σ ξ . The data are plotted in Figure 2.  Table 1 shows the results of the experiment using synthetic multivariate data. Best results are marked using bold enhanced text. It can be observed that the proposed combination methods, namely the mean, the weighted entropy and the Karcher mean perform as well as the best single kernel in terms of the TNR. With respect to the TPR, the best combination method is the one based on the calculation of the Karcher mean. Table 1. Percentage of TPR (sensitivity) and TNR (specificity) for synthetic multivariate data. Experiment K G σ"0,1 3 K G σ"0,1 2 K Gσ"0,1 K Gσ"1 K Gσ"10 K Gσ"50 K Gσ"100 K Gσ"500 K G σ" 10  In Table 2, the results of the experiment using synthetic functional data are presented. In this case, two of the three proposals, the mean and the weighted entropy are always able to perform as well as the best single kernel (the polynomial kernel) in terms of both the TNR and TPR. The method based on the calculation of the Karcher mean obtains good results with respect to the TNR measure.

Real Data
Regarding real data, we also differentiate between multivariate and functional data. To test and compare proposals using multivariate data, we consider six databases from the UCI machine learning repository [24] which are available and properly described in [25]. The testing and comparison of our proposals using functional data are carried out over two functional data sets: (i) Poblenou NOx Emissions (NOx). This data set contains the nitrogen oxide (NO x ) emissions levels measured every hour by a control station in Poblenou in Barcelona (Spain). The data are publicly available in the R-package 'fda.usc' [26]. In the data set, working day NO x emissions, considered as regular data, and weekend day NO x emissions considered as atypical data can be distinguished; (ii) Vertical Density Profiles (VDP). This data set contains 24 curves of Vertical Density Profiles which come from the manufacture of engineered woodboards. Each one consists of 314 measurements taken 0.002 inches (see [27] for further details). In Table 3, we give the details about the sample size, the dimension and the percentage of outlier observations for each of the data sets. The NOx and VDP data sets are illustrated in Figure 3. Table 3. Summary of the data sets.   Table 4 shows the results of the experiment using real multivariate data. It can be observed that the best overall method in average is the weighted entropy proposal. In particular, this method attains the best results for two of the six data bases (Pima and Cardio), and for the rest of the sets its results are close to the best ones. Although the proposed methodologies seem to perform systematically better than other machine learning approaches, it is not clear, in terms of the AUC, whether for some data bases (Glass, Breast Cancer, Breast Cancer Diagnostic and Pima) the difference is statistically significant. In Table 5, the results of the experiment using real functional data are presented. For the VDP data set, in terms of the AUC measure, the weighted entropy and the mean proposals perform as well as the best single kernels and the MBD. For the NOx data set, the best overall method is the one based on the calculation of the Karcher mean, followed closely by the MBD approach.

Robustness of the Karcher Mean
In this experiment we explore the robustness of the proposed procedure in the context of detection of atypical functional data. To this aim, we generate n " 100 independent sample paths from the following Gaussian stochastic model: Xptq " ξ 1 sinptq`ξ 2 cosptq for t P r0, πs, and˜ξ 1 ξ 2¸" N « µ "˜0 0¸, Σ "˜0 .75´0.5 0.5 0.75¸ff (14) that is pξ 1 , ξ 2 q follows a zero mean bi-variate normal distribution with covariance parameters σ 11 " σ 22 " 0.75 and σ 12 " σ 21 "´0.5. Using the representation techniques introduced in § 2, we can represent these curves as points in R 2 and, moreover, we can estimate (by Maximum Likelihood) a covariance matrix p Σ using this data representation. We replicate the previous generating process 10 times, obtaining 10 covariance matrices estimates, namely p Σ i for i " 1, . . . , 10. Next, we construct the mean estimated covariance matrix as, where χ 2 2,0.99 is the value of a Chi-square with two degrees of freedom that accumulates 0.99 probability, λ 1,i andλ 2,i are the estimated eigenvalues, corresponding to each estimate p Σ i , andθ i is the estimated rotation angle with respect to the 'x 1 ' axis. In addition, in the same Figure, the estimated mean p s Σ (its corresponding ellipse estimation) is shown in red ("---"), and in blue ("---") the Karcher mean. To introduce some anomaly in our data, in Figure 4-right, we added one ellipse constructed with an anomalous bivariate distribution with covariance matrix with elements σ 11 " σ 22 " 7.5 and σ 12 " σ 21 "´10; this atypical covariance matrix corresponds to a different stochastic Gaussian model from the baseline introduced in Equation (14). It can be observed in Figure 4-left that the average covariance matrix and the Karcher mean of the covariance matrix generate similar 99th percentile ellipses. Since the generated covariance matrices p Σ i are located in a small region within the cone of semi-definite positive matrices, such a region can be approximated by a linear subspace that contains the average covariance matrix. On the other hand, in Figure 4-right, the curvature of the cone is depicted by the difference in the dispersion of the anomalous covariance matrix, illustrated by the ellipse with a black-dashed line. In this scenario, the Karcher mean of the covariance matrices generates similar 99th percentile ellipse with respect to the regular scenario (left panel), which shows the robustness of the Karcher mean in the presence of outliers. Nevertheless, in the contaminated scenario (right panel), the 99th percentile ellipse generated with the simple average mean of the covariance matrices changes radically with respect to the regular scenario. The robustness in the estimation of the covariance matrix allows us to ensure that the procedure proposed in this paper, based on the estimation of the Karcher mean in the cone of positive definite matrices, will be useful when solving atypical functional data identification problems.
Last but not least, the relevant aspect of this numerical example is that, using the Karcher mean as an estimator of the center of the distribution of semi-definite positive matrices, we are minimizing the Riemannian distance, as it is defined in Section 3, and, as a consequence, the proposed method is able to identify the anomalous covariance matrix with respect to the pattern given by the rest of the distributions.

Discussion
In this work, we have explored how to combine different sources of information for anomaly detection within the framework of Entropy measures. We define entropies associated with the transformation induced by Mercer kernels, both for random variables and for data sets. We propose a new class of combination methods that generate a single Mercer kernel (that acts as a similarity measure) for anomaly detection purposes from a set of entropy measures in the context of density estimation. In particular, three combination schemes have been proposed and analysed, namely: (i) an average of the kernel matrices; (ii) the mean in the manifold that contains the kernels; and (iii) a weighting scheme that assigns the weights as the solution of an optimization problem that seeks to maximize a particular kernel entropy. Such proposals, based on the idea of building the final combined kernel matrix within the same variety where the kernel matrices to be combined live, seem to be the most successful ones on average.
An innovative application of this methodology is the use of the Karcher mean as part of a method to identify anomalous covariance matrices. The success of this proposal is due to the fact that the Karcher mean acts as an estimator of the center of the distribution of semi-definite positive matrices, while minimizing their Riemannian distance, allowing the identification of the outlying matrices with respect to the pattern given by such an estimator.
A relevant aspect for the method applicability in real problems is its complexity and costs in comparison with other alternatives. The proposals whose structure is based on a linear combination of kernel matrices have a very low computational cost based on the computation of products of constants and sums of matrices. The proposal based on the use of the Karcher mean has the typical drawback of any semidefinite programming problem, that is, the computational and memory costs are related to the size of the matrices involved. Current systems are not able to deal with dense large matrices, given that processing time and memory grow quasi-exponentially as the size of the matrices increase. See [28] for a discussion on these aspects and current trends to improve the performance of methods for the solution of semidefinite programming problems. Most applications for general dense matrices in semidefinite programming involve a few hundred data cases. Fortunately, in this particular application (outlier detection), we do not need to work with the full database to success. Due to the presence of statistical regularities, a few thousand data cases will usually be enough to collect all the relevant statistical aspects of the data set at hand.
Further research is to be afforded, especially regarding the possibility of exploring other embeddings of the data. For instance, higher dimensional transformations specific for anomaly detection could be designed. In this regard, care should be taken with the scaling of such transformations, as dimensions with large magnitudes with respect to the others may lead to suboptimal results. In this work, for multivariate data, we have compared the methodologies proposed with some multivariate outlier detection techniques. In the future, systematic experiments comparing with other well known methodologies such as XBGOD [29], LODES [30], iForest [31] or MASS [32] are to be carried out. Regarding these multivariate techniques, another interesting research line is the extension of such methodologies to functional data analysis. In this regard, suitable multivariate representations of functional data similar to those in [2] should be explored.