Anomaly Detection in Multichannel Data Using Sparse Representation in RADWT Frames

We introduce a new methodology for anomaly detection (AD) in multichannel fast oscillating signals based on nonparametric penalized regression. Assuming the signals share similar shapes and characteristics, the estimation procedures are based on the use of the Rational-Dilation Wavelet Transform (RADWT), equipped with a tunable Q-factor able to provide sparse representations of functions with different oscillations persistence. Under the standard hypothesis of Gaussian additive noise, we model the signals by the RADWT and the anomalies as additive in each signal. Then we perform AD imposing a double penalty on the multiple regression model we obtained, promoting group sparsity both on the regression coefficients and on the anomalies. The first constraint preserves a common structure on the underlying signal components; the second one aims to identify the presence/absence of anomalies. Numerical experiments show the performance of the proposed method in different synthetic scenarios as well as in a real case.


Introduction
Anomaly detection (AD) is an important problem that has received much attention in recent years. It consists of establishing whether a given data deviates from nominal shape or form. It is not possible to establish a single mathematical framework to deal with AD because it depends on the type of application. There are a lot of surveys in the literature, some of them specific to certain applications and some others are broader and application-free. See, for example, the review by [1] regarding the Intrusion Detection Systems, or the paper by [2] providing an overview on AD, analysis, and prediction in Internet of Things (IoT) systems or the paper by [3] considering the application of AD techniques to aviation and how they improve the safety and service of flight operations, just to cite some. However, a milestone of the literature is the paper by Chandola et al. [4], where the authors offer a very good understanding of the subject, defining anomalies, describing its challenges, depicting a relevant taxonomy of the different techniques, and illustrating the various domains of applications. As well explained by the authors, the AD problem depends on the nature of input data (points, sequence, functions, graphs, objects of different nature), on the type of anomaly (point anomalies, contextual or behavioral anomalies, or their combination), on the availability of labeled data for training/validation of the AD techniques (leading to unsupervised AD and supervised AD), and on the type of output of AD (scores or label). Two more recent reviews are the ones by Pimentel et al. [5] and by Ahmed et al. [6], where the authors display a huge and comprehensive number of references on the topic. In particular, reference [5] is devoted to novelty detection, where the distinction with the word anomaly is that the former considers, as normal, the data after their detection. Another important and very recent paper is the one by Thudumu et al. [7], which reviews AD in the context of big data where a curse of dimensionality occurs, bringing the failure of different methodologies.
We can broadly divide the AD techniques into four categories as very well represented by the taxonomy of Figure 3 in [6]: classification techniques, statistical techniques, clustering techniques, and information theory techniques.
In this paper, we focus on statistical techniques, whose recipe is to fit a statistical model to the given data and then check if some of them do not fit this model. We think that a good survey on statistical methods is given by Rousseeuw and Hubert in [8], where they consider robust regression techniques, signal processing techniques, and robust principal component analysis (PCA) techniques.
Our paper is placed in the context of statistical signal processing techniques, where the data are indeed time series, signals, or functions. It is worth observing that signal processing is synonymous with functional data analysis in more classical statistics nomenclature, as well as outliers and anomalies. In particular, we can distinguish between isolated and persistent outliers, see [9]. The isolated outliers insist on a small part of the signal, while the persistent outliers insist on most of their domain. It is important to stress that the literature on AD in functional data is quite recent and is mainly devoted to univariate functions.
In this paper, we deal with the AD problem for multichannel (i.e., multivariate) data affected by persistent outliers and approach it from the perspective of penalized multiple regression framework. We model the nominal shape as K multiple signals from K channels sharing a joint sparse representation into a special dictionary, and then we add a possible anomaly term of any form and shape to each signal component. This framework represents situations where we expect similar shapes for the components of the signal, but some components can undergo anomaly behavior for some unpredictable reason. In particular, we consider nominal signals with a fast oscillating characteristic that can be well represented only by using an appropriate dictionary, namely a RADWT, as done in the context of multiple regression in [10]. A RADWT is a modern and fast computational tool for analyzing signals which are a mixture of oscillatory and nonoscillatory transient behaviors. Such kinds of signals are not periodic and may describe many physiological and physical signals (for example, speech, stock-market, biomedical EEG, etc.) which could not be represented sparsely in the classic orthogonal bases such as Fourier, wavelet, etc. The request of joint sparse representation formalizes the hypothesis that the signal components are expected to have the same spectral characteristics while preserving numerical differences. The idea we develop in our work is the following: a linear model is applied to simultaneously estimate the signal by using its sparse representation in this special dictionary and detect potential anomalies modeled as residuals in this sparse representation.
A similar idea has been already applied in the literature under different model assumptions, for example in [11][12][13]. Specifically, in [11] the authors propose to learn the dictionary by using a set of nominal signals through PCA, as well as in [12] the authors learn the dictionary by using a training set of "normal" nominal signals, i.e., not affected by anomalies, under the specific hypothesis that they are multivariate mixtures of discrete and continuous signals and applying their sparse coding algorithm to select the training signals having the highest residuals. Subsequently, in both papers, the authors detect anomalies for a new signal occurrence by evaluating the residual of its sparse representation in this dictionary. On the other hand, in [13], the authors face the univariate AD problem considering a linear regression model where the univariate signal is expressed as a linear combination of columns of a given design matrix without the sparseness hypothesis. The authors propose detecting anomalies by simultaneously estimating both the nominal signal and its residual using nonconvex penalized regression with the constraint on the outliers vector.
Our paper stands in between these recent proposals [11][12][13] for two reasons. First, we use a dictionary for sparsely representing functions (as in [11,12]) but without requiring a training set of "normal" nominal signals for dictionary learning because we rely on an innovative instrument such as the RADWT transform to deal with fast oscillating signals. Second, we look for anomalies by simultaneously estimating both the nominal signal and its residual (as in [13]) but imposing a double constraint on the regression coefficients and the anomalies, taking into account that few RADWT coefficients are necessary to represent the signals. To ensure that the residuals can preserve information about the outliers/anomalies, we must use robust regression techniques. As we will clarify later on, a good AD undergoes a good robust regression analysis, because the procedure aims simultaneously to detect the anomalies and to estimate the signal's components, the success of the first goal ensuring the success of the second and vice versa. Hence, our paper can be also considered as part of the literature on robust regression in the high-dimensional setting, see [14,15]. In this perspective, we also refer to [16], from which we were inspired to use robust losses, such as the Huber loss and the skipped mean loss.
The remainder of the paper is organized as follows. Section 2 describes the data model with the working hypothesis. Section 3 presents and discusses the proposed AD procedure within the paradigm of group-lasso regression, enlightening the connections with other existing procedures. Section 4 describes some important implementation issues. Section 5 shows numerical experiments on synthetic and real data and finally, the last section discusses conclusions.

The Data Model and Problem Setting
We will consider hereafter a functional dataset consisting of K curves f (k) (t), k = 1, . . . , K observed on a set of equispaced grid points t 1 , . . . , t n and resulting in observed K dimensional response vectors y (1) , y (2) , . . .,y (K) . The observed curves are composed by nominal signals f (1) , f (2) ,... f (K) which may have some functional properties to which are eventually added anomaly signals a (1) , a (2) ,..., a (K) which may appear as functional outliers a k (t), k = 1, . . . , K. Once the data is discretized our data model becomes (2) . . .
where, for each k = 1, · · · , K, f (k) ∈ R n×1 is the underlying unknown nominal signal, a (k) ∈ R n×1 is the potential anomaly, and ε (k) ∼ N(0, σ 2 I n ) is the additive white noise. Note that if the Gaussian noise in some channel is correlated with known covariance structure, we can easily manage this situation reparametrizing the data. The mathematical hypothesis is that, while the K components have a nominal behavior which results in a joint sparse representation into a RADWT dictionary, the anomalies can be of any shape, so they have no sparse representation in the same dictionary, and they are independent of each other, so an anomaly can be detected in any, some, or all components.
We do not hypothesize functions f (k) (t) belong to some functional Sobolev space H s p,q [a, b] as it is usually done in functional nonparametric regression setting, instead we let these functions be much more general and we restrict our attention to their finitedimensional representation. Since many physiological and physical signals exhibit a mixture of oscillatory and nonoscillatory behaviors (for example, speech, stock-market, biomedical, EEG, etc.), we suppose that each component is a "high resonance" signal or a "low resonance" signal or a sum of both types of signal. By a high-resonance component, we mean a signal consisting of multiple simultaneous sustained oscillations; in contrast, by a low-resonance component, we mean a signal consisting of nonoscillatory transients of unspecified shape and duration. We stress that the high and low resonance components of a signal can not be extracted from its high and low frequencies components in a time-scale decomposition, but they can be well represented by a high-Q factor RADWT and a low-Q factor RADWT respectively, see [17]. The RADWT is a normalized tight dictionary of L 2 (R) defined as ( q p ) k/2 ψ ( q p ) k t + sp q l k,l∈Z where ψ is a wavelet function and (p, q, s) is a triplet of parameters that gives the time-scale characteristic of the dictionary. In particular, the ratio q/p > 1 is closely related to the scale (or frequency) dilatation factor, the parameter s is closely related to the time dilatation factor, and p s(q−p) is the redundant factor. The Q-factor depends on these parameters although there is not an explicit formula, in particular setting the dilatation factor q/p between 1 and 2 and s > 1 gives a RADWT with high Qfactor, while setting s = 1 we obtain a low Q-factor RADWT with time-scale characteristic similar to the dyadic wavelet transform. In particular, when q = 2, p = 1 and s = 1, the dictionary reduces to the classical wavelet basis. Given a finite energy signal x of length n and J ∈ N levels of decomposition, the RADWT transform is obtained by a sequence of proper downsampling operations and fast Fourier transforms; it ends up with np J q J scaling coefficients (low-pass filtering) and np j q j s wavelet coefficients (high-pass filtering) at each level j = 1, ..J. See [18] for details on fast analysis and synthesis schemes.
In the following, we will use these signal processing results in order to formulate our working setup. Let Ψ ∈ R n×d1 be the finite matrix representation of the low Q-factor analysis filter, where d1 indicates the cardinality of the low Q-factor RADWT, and let Φ ∈ R n×d2 be the finite matrix representation of the high Q-factor analysis filter (the synthesis operators being just the transpose matrices), where d2 indicates the cardinality of the high Q-factor RADWT. Let us define the dictionary W = [Ψ Φ] ∈ R n×d , with d = d1 + d2, then our working hypothesis is the following: (H1) signals f (k) have a jointly sparse representation in W, i.e., setting f (k) = W β (k) , (1) , β (2) , · · · , β (K) ∈ R d×K has a minimal number of rows different from zero. This means considering the following constraint where B 2,0 counts the number of non zero rows, T is the a priori parameter indicating the expected degree of sparsity level, and I(a) is the indicator function: It is worth observing that the role of dictionary matrix W could be played by any other dictionary in which the nominal signals have a sparse joint representation. What we will expose in the following is not related to the nature of the dictionary W and can therefore be considered a valid method for other types of signals as long as there is a dictionary that sparsely represents them. For example, W could be made up of only Φ if it is known that the f signals have only a high resonance component, or it could be made up of the union of other types of basis, frames, or dictionaries. However, when Ψ and Φ are the finite representation of RADWT, they constitute thigh frames, i.e., Ψ Ψ = I n and Φ Φ = I n , so W W = 2I n and then, once we obtain the estimate of the coefficients it is straightforward to obtain the estimate of the signal. When the choice of the dictionary change, Ψ Ψ = I n and Φ Φ = I n is not necessarily satisfied, and hence signal reconstruction is not obvious.
which turns out to be a linear multiple regression model with a special common design matrix. A first and somewhat naive approach, but sub-optimal, would consist of treating each channel separately, ignoring the underlying common structure. This is the reason why such kind of problem is reformulated in terms of a unique regression problem in the following form: (1) y (2) . . . (1) β (2) . . . (2) . . . (1) ε (2) . . .
with obvious correspondence between elements of the two expressions. So, y is a column vector of n · K response variables, X a design matrix of dimension n · K × K · d, β an unknown regression coefficients column vector of length K · d, a is an unknown anomalies column vector of length n · K, and, finally without loss of generality, we let ε be a n · Kvariate Gaussian random column vector with zero mean and covariance matrix σ 2 I n·K .
Our regression problem (5) has d · K + n · K regression parameters and only n · K data point, so it clearly falls into the class of high-dimensional regression problems. Under the working hypothesis (H1), we expect β to be group sparse, i.e., for many j = 1, ..., d we expect that β (k) j = 0, for all k = 1, .., K. This provides the following nonoverlapping group structure for the whole vector with group of size K. Let us define the following l 2,1 norm with β G j denoting the reduction of vector β to the subset of index G j . The problem we are dealing with is not a classic regression problem, because our interest is to simultaneously estimate the vector a, more specifically the a (1) , .., a (K) anomalies that compose it, and the β vector. For example, if an anomaly is present in the first channel, the vector a (1) , representing the residual of the linear regression of y (1) on W will have a l 2 -norm much higher than the residual in another channel with no anomaly. While the anomalies vectors are not connected, because the anomalies can appear in any, some, or all channels, the underlying signals are not independent since by hypothesis they share the same spectral characteristic. For these reasons, we propose solving the following convex minimization problem where λ 1 > 0 is a regularization parameter that controls the group sparsity of vector β, while λ 2 > 0 is a regularization parameter that controls the number of expected anomalies. Both the regularization parameters represent a trade-off between the fit and the complexity of the model. Being λ 2 fixed, λ 1 → 0 implies a non sparse β, generalizing the model proposed in [13] to the multichannel setting, on the contrary λ 1 → ∞ implies a sparser β vector, increasing the energy of the anomalies. On the other hand, being λ 1 fixed, λ 2 → 0 detects anomalies in each channel, while λ 2 → ∞ brings to a model without anomalies. Their choice is a delicate point and we will discuss it in the Implementation, Section 4. Finally, for each k ∈ {1, 2, ..., K} we detect an anomaly if the obtained estimate â (k) is the solution of the optimization problem (7).
Moreover, the intensity of the detected anomaly, measured as â (k) 2 , can be considered as a score for the anomaly and treated as an incipient level of fault according to the specific application which defines an alarm threshold.
From a statistical perspective, the problem given in Equation (7) falls into the class of group lasso regression problem applied to the augmented parameter vector β = (β t ; a t ) t and augmented design matrix X = [X I n·K ]. Indeed, the problem in Equation (7) can be rewritten asˆ If the unknown vector β is sparse, under appropriate hypothesis on the design matrix X, consistency of the proposed estimator is guaranteed, see [19][20][21]. From a numerical perspective, problem (7) is jointly convex in β and a, and its simple form suggests that we can apply an alternating optimization given initial estimates for β and a. Of course, we need an estimate for β that takes into account the potential presence of the anomaly, i.e., robust.
Given an initial robust estimate for β, i.e.,β 0 , we defineâ 0 = y − Xβ 0 , then at each iteration the procedure solves the following two sub-problems, given a, estimate β given β, estimate â a = argmin a∈R Kn Iteration stops when a given number of maximum iteration is reached or when the relative error on the anomaly vector a is below a fixed threshold. Let us give a closer look at the two sub-problems. Function F 1 (a) = F 1 a (1) , ..., a (K) is separable in each channel, i.e., is the current residual of the regression of y (k) on W; so the solution of sub-problem (9) is obtained by the multivariate-Soft thresholding rulê where the directional vector v/||v|| plays the rule of sign() function in classical one dimensional setting, see [22]. In Equation (10), the Soft thresholding operator (.) + acts on the vector r (k) by shortening it towards 0 by an amount λ 2 if its norm is greater than λ 2 , by setting it to zero if its norm is less than λ 2 . For the seek of completeness, we sketch below how to derive the above statement. Vector in Equation (10) is the solution of the following convex problem Considering the case a (k) = 0 and setting the derivative with respect to a (k) to zero, we obtain from which the following two equations derive: by solving the second equation in a (k) and substituting into the first we get, Now, consider the case a (k) = 0. We need to impose that the 0 vector belongs to the pseudogradient evaluated at a (k) , i.e., there exists a vector v ∈ R n×1 , v = 0 and v 2 ≤ 1 such that −r (k) + λ 2 v = 0. This is possible only if r (k) 2 ≤ λ 2 . Finally, combining both cases into a single equation, we get exactly the expression in Equation (10).
Let us now consider sub-problem (8). It is possible to prove that minimization of F 2 (β) = F(β,â), whereâ is the solution of Equation (9), i.e., vectorsâ (k) given by Equation (10), is equivalent to solve the following Huber's cost functional with parameter λ 2 plus a group lasso penalty. i.e., Here the mutivariate Huber cost function is given by the following rule: for any v ∈ R n×1 . For the sake of completeness, we sketch below how to derive the above statement. Let < λ 2 and I c be its complement.
In robust regression, i.e., regression in presence of outliers, it is well known that Huber loss is less sensitive to outliers than quadratic loss because it combines the squared loss and the absolute loss in an adaptive way. Intuitively, when the data points are not too big, Huber's loss function penalizes like the squared loss; otherwise, it penalizes like the absolute loss. However, it is not resistant to high leverage outliers (outliers in the predictors) because it never rejects gross outliers that have moderate or high leverage. Its breakdown point is 1/n. Indeed, according to [23], a convex criterion is inherently incompatible with robustness.
The proposed procedure is summarized in Algorithm 1.

Possible Improvements
In this section we explore the possibility to improve our estimator, adopting a different type of threshold operator. We are inspired by [13], where the authors deal with the outliers detection problem in the case of one channel. In the paper they propose a two step procedure: after a proper initialization for β, in the first step, they apply a Soft thresholding operator to the residual, in the second step they use Ordinary Least Squares (OLS) to update vector β. The main difference with our procedure is in the second step, where we perform a group lasso regression instead of an OLS to face the multitask regression. Although in [13] the authors deal with the outliers detection, looking for pointwise anomalies, the philosophy of their estimator is comparable to ours. Moreover, they note that in the presence of multiple outliers, the Soft threshold is not able to deal with masking and swamping effects, and propose to replace Soft-thresholding with Hard-thresholding obtaining a great improvement. Then they generalize the idea and introduce an entire class of methods, namely Θ − IPOD which uses a general Θ thresholding operator instead of the Soft thresholding operator, discovering important results and outstanding performance. In this direction of research, we explore the possibility to improve our estimator, substituting the first step of our procedure, i.e., the multivariate Soft thresholding operator expressed in Equation (10), by an appropriate Θ operator as done by the authors of [13].
Specifically, we substitute Equation (10), by the following Hard-thresholding operator where I(·) is the indicator function defined in Equation (3). We stress that substituting Equation (13) into the first step of our procedure, the second step becomes equivalent to a skipped-mean loss penalized with a group lasso penalty. We sketch below how to derive it.
where I = k :â (k) = 0 = k : r (k) 2 ≤ λ 2 and I c its complement. Hence, substituting Soft threshold with Hard threshold automatically makes the second step equivalent to the following penalized skipped-mean losŝ with loss defined as: for any v ∈ R n×1 . Note that skipped-mean loss is not-convex and is a special case of Hampel loss, that belongs to the family of redescending M-estimators, see [24], with high breakdown point, close to 0.5. This means that is robust to outliers. For a summary, see Appendix 1 in [16]. Figure 1 shows a comparison between quadratic loss, Huber loss ρ H λ 2 (v), and skippedmean loss ρ SM λ 2 (v). We can conclude that, when using Hard threshold instead of Soft threshold, we solve the following problem instead of the problem in Equation (7) (β,â) ∈ argmin β∈R Kd , a∈R Kn with a 2,0 = ∑ K k=1 I a (k) = 0 , the l 0 grouped norm. Of course, the problem in Equation (16) is not convex, because the complexity penalty is not convex; therefore, we can only aspire to find one of the possible local minima of the objective function. Nonconvex functions may possess local optima that are not global optima, and our iterative method may terminate undesirably in one of these local optima. From a statistical perspective, although theoretical results for nonconvex penalties have been studied in [14,15] proving that all local optima are essentially as good as a global optimum, we cannot apply those results because our penalty does not satisfy their hypothesis. Therefore, our finding for this second procedure is purely empirical and we include the side condition β 1 + a 1 < R to guarantee the existence of at least one local/global optima. In addition, we will require R ≥ β 0 1 + a 0 1 so that the true regression vector β 0 and the true anomaly vector a 0 , that satisfy Equation (4), are feasible.
The proposed procedure is summarized in Algorithm 2.

Implementation
In this section, we describe all the issues that make the proposed procedure effective: namely the initialization step, the numerical algorithm, and the choice of the regularization parameters λ 1 and λ 2 .
As described in Section 3, we need to initialize the two step procedure by a robust estimator of parameters β which gives a robust estimator of X β to plug into expression of Equation (9). First, we note that vector (X β) t = W β (1) t , ..., W β (K) t , hence to initialize our procedure is enough to establish a robust estimator for each of the signal component f (k) = W β (k) . In our procedure, under the prior knowledge that the number of channels with anomaly are less then the half number of channels, we initialize f (k) by the median of the data channels, namely median y (1) , ..., y (K) As regards the numerical algorithm we note that the first step described in Equation (9) has a closed-form solution given in Equation (10) and Equation (13) in the case of Soft and Hard threshold respectively; the second step in Equation (8) requires the solution of a linear regression problem with a group Lasso penalty. Hence for the solution of the second step we employed the grpreg R package that implements the efficient Group Descendent Algorithm presented in [25,26]. This algorithm works groupwise by using the separability of the model (5) in terms of a group of variables, i.e., it updates each group of variables freezing the other groups to their current value until convergence. The updating of each group of variables is performed through a multivariate soft-thresholding operator, under the assumption of "orthonormal groups". We stress that the "orthonormal group" property refers to the condition X t G j X G j = I, with X G j denoting the reduction of design matrix X to columns of the subset of index G j . When this condition is not satisfied the grpreg automatically orthonormalizes the design matrix, but this practice leads to a slight modification of the l 1 /l 2 -norm contained in the penalty, as pointed out in [27,28]. However, this is not our case, because the design matrix defined in Equation (5) satisfies the "orthonormal groups" property by construction, and hence, we can take complete advantage of the Group Descendent Algorithm implemented in the grpreg package.
As in any penalized regression approach, the choice of the regularization parameter is really strategic. In particular, in the proposed procedure the regularization parameter has two components, namely λ 1 and λ 2 . The latter controlling the degree of regularization in the regression model on parameter β, the last controlling the threshold over which the residual of the previous regression is considered not acceptable. Under the data model described in Equation (1), the expected size of the residual r (k) , in absence of anomaly in channel k, is given by σ √ n, hence it is natural to fix λ 2 =σ √ n whereσ is an estimator of the noise variance. In particular, we adoptσ = median r (1) ) 2 , ..., r (K) 2 / √ n where r (k) is the residual after having initialized f (k) as described above. The choice of the regularization λ 1 is more demanding; we avoided CV criterion since it is quite computationally heavy and in principle when applied on the (y, X) data, a large prediction error can be due to both a suboptimal β as well as to the presence of an anomaly. Hence we suggest tuning parameter λ 1 by the BIC criterion, which is much less computationally demanding and easily modifiable to our setting. In particular, we adopted the following formula where d f (λ 1 ) is evaluated according to formula (22) in [26] and RSS(λ 1 ) = y − Xβ −â 2 2 , withβ andâ estimates obtained for λ 1 . The grid of λ 1 values has been obtained as in Section 3.5 of [26].

Simulations and Real Examples
To show the performance of the proposed methodology, we performed numerical experiments both in a simulation setting and in a real dataset case.

Synthetic Data
We generated data according to the model in Equation (1) with some known functions f k (t) and some known anomalies and evaluated the performance of the proposed techniques in terms of signals reconstruction and anomalies detection.
Since the proposed method deals with nominal signals with fast oscillating characteristics, we choose the HiSine and the TwoChirp signals (represented in Figures 2 and 3) among the classical signals employed in the literature. Both can be well represented by using an appropriate RADWT. In particular, a RADWT with Q-factor almost 5 (p = 8, q = 9, s = 3 and j = 10) is appropriate for their sparse representation.
As regards the anomalies, there are no assumptions on their shape, i.e., they can be of any duration and shape and they can appear in any of the signal components, so we decided to model them by the following formula to represent situations where the mean of the signal undergoes to a shift during a specific interval of time (t 1 , t 2 ); where the shift M = h · max 1≤i≤n,k y (k) (t i ) with h = 0.5 or 2, to mimic situation where the anomaly is weak or strong with respect to the underlying signals. We generated data according to model (1), building K signals, f (1) , . . . , f (K) of length n = 256 and for such a choice the W matrix has dimension n × d = 256 × 695. Each signal has been generated randomizing the original signal f (HiSine or Twochirp) by the formula f (k) = f · u, where u is a vector of i.i.d. variables with distribution Unif(0, 1). These generations guarantee that the true underlying signals f (1) , . . . , f (K) have similar shape and the same sparsity patterns. According to the model in Equation (1), we added noise to each channel as ε (k) ∼ N(0, σ 2 I) with σ 2 related to the variance of the true underlying signal realizing three different type of SNR = (0.5, 1, 6) to mimic situations of different severity of noise. For completeness, here is the SNR expression Finally, we added anomalies a (k) to some channels as in Formula (17).
To sketch, we have considered two settings: • f = HiSine, K = 3, a (1) as in Formula (17) with (t 1 , t 2 ) = (78, 177); • f = TwoChirp, K = 5, a (1) as in Formula (17) with (t 1 , t 2 ) = (78, 177) and a (4) as in Formula (17) with (t 1 , t 2 ) = (157, 256).  To summarize, for each of these two settings (f = HiSine and f = TwoChirp), we analyzed two different levels of anomalies (h = 0.5 and h = 2 in Formula (17)) and three different levels of SNR (SNR = 0.5, 1 and 6). These choices are arbitrary but they were dictated by the idea of understanding a sort of breakdown point of our procedure. Naturally, we expect that the weaker the anomalies and the stronger the noise, the worse the problem. Finally, we evaluated performance by computing the MSE for each signal's component as withf (k) estimate of f (k) , and by computing the relative number of (falsely) detected anomalies (FPR) as well as the relative number of not detected anomalies (FNR), namely A being the true number of channel with anomalies. In addition, we evaluated the intensity of the estimated anomalies to compare with the intensity of the simulated ones and to analyze the detection capabilities of both our proposals as To be robust to the particular realization in generating synthetic data (and corresponding noise), each experiment was run several times. In particular, we ran 10 instances for each of the 12 considered cases (2 (function settings) × 2 (levels of anomaly) × 3 (SNRs)). For the sake of exposition, in the following we show separately results obtained for each function's settings. Specifically, Figures 4-9 and Tables 1 and 2 Tables 2 and 4). This is a consequence of the masking effect: the weaker the anomaly the easier the noise, and the anomalies can mask each other. However, there is an interesting difference between the two procedures. In particular, for the Hard thresholding, we observe that FNR is always zero, while there can be some FPR > 0 in the more difficult setting, i.e., when the anomalies are weaker and the level of noise is higher. This means that the Hard thresholding procedure always detects the true anomalies, being prone to some false positives detection when stressing the problem setting. This observation is relevant from the practical point of view because in such extreme cases one can always adopt a post-processing rule to classify the detected anomalies, being confident enough not to miss any anomaly. On the contrary, the Soft thresholding is much more conservative, hence when stressing the setting, i.e., when the anomalies are weaker (h = 0.5) and the level of noise higher, the Soft thresholding procedure is more prone to false negatives detection. Indeed, for the Soft we always get FPR = 0, while stressing the setting we get some FNR > 0 (the opposite of Hard). This is not a drawback, rather a different way of reacting to the difficulty of the problem. However, from a practical point of view, when using Soft instead of Hard it can be a disadvantage to have false-negative detection because when the result undergoes post-processing the anomalies not detected are not further analyzed. However, although our two procedures are not the definitive answer to the AD problem, it is important from a practical point of view to be aware that the Hard thresholding tends to select more anomalies (and therefore to have some false positives) while the Soft threshold tends to select fewer anomalies (and therefore to have some false negatives). In addition, Figures 9 and 15 show the boxplots of the intensity of the simulated anomalies and of the estimated anomalies by both Hard and Soft thresholding. We can note the good performance of the Hard thresholding procedure with respect to the Soft thresholding one, both in the case of weak anomalies and strong anomalies, with a better intensities estimate in the last case.
As regards the estimation performance, we observe that for both procedures, Hard and Soft, and both settings, f = HiSine, and f = TwoChirp, it decreases when increasing the level of noise (i.e., decreasing the SNR). In fact, the MSE significantly decreases when moving from the left to the right in all box-plot shown in Figure 4 for the first setting f = HiSine and in Figure 10 for the second setting f = TwoChirp; panels (a)-(c) refer to Hard thresholding for h = 0.5 and h = 2, respectively, and panels (b) and (d) refer to Soft thresholding for h = 0.5 and h = 2, respectively. Interestingly, in line with what we observed on the detection performance results, we have significant differences in terms of MSE performance on the anomalous channels between the Hard and the Soft. See box-plots in positions 1, 4, 7, corresponding to the first channel for the HiSine signal and box-plots in positions 1, 4, 6, 9, 11, 14, corresponding to the first and the fourth channels for the TwoChirp signal. The anomalous channels always have a better MSE for the Hard than for the Soft. This is a direct consequence of our double-step procedure: when we correctly detect the anomaly (in the first step), we correctly adjust the residual (in the second step) and hence we obtain a better estimation. On the channels without anomaly, detection performance is comparable.

Real Data
To illustrate the performance of our procedures on a real example, we considered the problem of detecting anomalies for signals recording measurements on a water pump. The signals represent the behavior of the pump in both normal and abnormal conditions and a complete description of it is given in [29]. The dataset is part of a repository managed by the Delft University of Technology and is freely available at http://homepage.tudelft. nl/n9d04/occ/index.html (accessed on the 18 February 2021) (dataset 541: Delft pump AR app). Each signal has a length n = 160 and has been obtained by first fitting an AutoRegressive model of order 32 (AR(32)) to each of 5 vibration measurements and then by combining the coefficients in a single vector. This dataset has been previously analyzed in [30]. In that paper, the authors proposed a data domain description, named the Support Vector Data Description (SVDD) to find the finest representation of the data such that the normal target signals are optimally clusterized and can be distinguished as best as possible from the abnormal ones. Their approach has a good performance and in some sense, it is another perspective for looking at the detection of anomalies. However, the data representation proposed in [30] does not take into account the longitudinal shape of the data, which are ordered coefficients; hence the method proposed in [30] is expected to be invariant under permutation of the data. On the other hand, our analysis is truly functional, since we treat the data as functions/signals; moreover, the dataset satisfies the hypothesis of our model. Since the underlying signals have a fast oscillating characteristic, they share a similar shape which presupposes the same sparsity pattern into the RADWT dictionary and the signals measured under abnormal conditions can be modeled as the signals measured under normal conditions plus some shift in the first part of the interval.
We considered 20 signals, 12 nominal and 8 with the anomaly, randomly chosen from the whole dataset, see Figure 16. We applied both procedures, Hard and Soft thresholding, obtaining reconstruction shown in Figure 17, panels (a) and (b). In the same figure, panels (c) and (d), we also display the retrieved residuals which represent anomalies. While the Soft thresholding has no false positives neither false negatives, the Hard thresholding has 2 false positives (channel 10 and 11), with the norm of the order of the estimated variance 0.0035, see Table 5. However, the Hard thresholding estimates better the intensity of the anomalies (which represent residuals for the regression on β), permitting a better signals reconstruction. This does not happen for the Soft thresholding, which results in over-smoothed signal reconstruction. Finally, we can conclude that the Hard thresholding procedure has a better performance although it has two false positives, which could be easily detected by standard post-processing.

Conclusions
In this paper, we presented two procedures for anomaly detection in multichannel signals under a structural hypothesis on the underlying signals covering some specific real-life situations. In particular, we dealt with fast oscillating signals under the assumption that they share a common specific sparsity pattern in a given dictionary. The construction of the dictionary leverages on a complete filter bank (RADWT) of L 2 (R) which guarantees a perfect reconstruction property and a tunable Q-factor. We based the two procedures on group penalized regression methods and we implemented them by a two-step iterative algorithm. The two methods differ in the second step iteration, the first applying Soft thresholding to the residual vectors, the second one applying Hard thresholding. Moreover, we made connections between them and the robust regression literature in a high dimensional setting obtaining interesting interpretations. From the computational point of view, we observed that the Hard thresholding procedure reveals some advantages to the Soft thresholding procedure, confirming some previous results on nonconvex robust regression.