A Pattern Dictionary Method for Anomaly Detection

In this paper, we propose a compression-based anomaly detection method for time series and sequence data using a pattern dictionary. The proposed method is capable of learning complex patterns in a training data sequence, using these learned patterns to detect potentially anomalous patterns in a test data sequence. The proposed pattern dictionary method uses a measure of complexity of the test sequence as an anomaly score that can be used to perform stand-alone anomaly detection. We also show that when combined with a universal source coder, the proposed pattern dictionary yields a powerful atypicality detector that is equally applicable to anomaly detection. The pattern dictionary-based atypicality detector uses an anomaly score defined as the difference between the complexity of the test sequence data encoded by the trained pattern dictionary (typical) encoder and the universal (atypical) encoder, respectively. We consider two complexity measures: the number of parsed phrases in the sequence, and the length of the encoded sequence (codelength). Specializing to a particular type of universal encoder, the Tree-Structured Lempel–Ziv (LZ78), we obtain a novel non-asymptotic upper bound, in terms of the Lambert W function, on the number of distinct phrases resulting from the LZ78 parser. This non-asymptotic bound determines the range of anomaly score. As a concrete application, we illustrate the pattern dictionary framework for constructing a baseline of health against which anomalous deviations can be detected.


Introduction
Anomaly detection and outlier detection are used for detecting data samples that are inconsistent with normal data samples. Early methods did not take the sequential structure of the data into consideration [1]. However, many real world applications involve data collected as a sequence or time series. In such data, anomalous samples are better characterized as subsequences of time series. Anomaly detection is a challenging task due to the uncertain nature of anomalies. Anomaly detection in time series and sequence data is particularly difficult since both length and occurrence frequency of potentially anomalous subsequences are unknown. Additionally, algorithmic computational complexity can be a challenge, especially for streaming data with large alphabet sizes.
In this paper, we propose a universal nonparametric model-free anomaly detection method for time series and sequence data based on a pattern dictionary (PD). Given training and test data sequences, a pattern dictionary is created from the sets of all the patterns in the training data. This dictionary is then used to sequentially parse and compress (in a lossless manner) the test data sequence. Subsequently, we interpret the number of parsed phrases or the codelength of the test data as anomaly scores. The smaller the number of parsed phrases or the shorter the compressed codelength of the test data, the more similarity between training and test data patterns. This sequential parsing and lossless compression procedure leads to detection of anomalous test sequences and their potential anomalous patterns (subsequences).
The proposed pattern dictionary method has the following properties: (i) it is nonparametric since it does not rely on a family of parametric distributions; (ii) it is universal in the sense that the detection criterion does not require any prior modeling of the anomalies or nominal data; (iii) it is non-Bayesian as the detection criterion is model-free; and (iv) as it depends on data compression, data discretization is required prior to building the dictionary. While the proposed pattern dictionary can be used as a stand-alone anomaly detection method (Pattern Dictionary for Detection (PDD)), we show how it can be utilized in the atypicality framework [2,3] for more general data discovery problems. This results in a method we call PDA (Pattern Dictionary based Atypicality), in which the proposed pattern dictionary is contrasted against a universal source coder which is the Tree-Structured Lempel-Ziv (LZ78) [4,5]. We use the LZ78 as the universal encoder since its compression procedure is similar to our proposed pattern dictionary, and it is (asymptotically) optimal [4,5].
The main contributions of this paper are as follows. First, we propose the pattern dictionary method for anomaly detection and characterize its properties. We show in Theorem 1 that using a multi-level dictionary that separates the patterns by their depth results in a shorter average indexing codelength in comparison to a uni-level dictionary that uses a uniform indexing approach. Second, we develop novel non-asymptotic lower and upper bounds of the LZ78 parser in Theorem 2 and further analyze its non-asymptotic properties. We demonstrate that the non-asymptotic upper bound on the number of distinct phrases resulting from the LZ78 parsing of an |X |-ary sequence of length l can be explicitly expressed by the Lambert W function [6]. To the best of our knowledge, such characterization has not previously appeared in the literature. Then, we show in Lemma 1 that the achieved non-asymptotic upper bound on the number of distinct phrases resulting from the LZ78 parsing converges to the optimal upper bound l log l of the LZ78 parser as l → ∞. Third, we show how the pattern dictionary and LZ78 can be used together in an atypicality detection framework. We demonstrate that the achieved non-asymptotic lower and upper bounds on both LZ78 and pattern dictionary determine the range of the anomaly score. Consequently, we show how these bounds can be used to analyze the effect of dictionary depth on the anomaly score. Furthermore, the bounds are used to set the anomaly detection threshold. Finally, we compare our proposed methods with the competing methods, including nearest neighbors-based similarity [7], threshold sequence time-delay embedding [8][9][10][11], and compression-based dissimilarity measure [12][13][14][15]15,16], that are designed for anomaly detection in sequence data and time series. We conclude our paper with an experiment that details how the proposed framework can be used to construct a baseline of health against which anomalous deviations are detected.
The paper is organized as follows. In Section 2, we briefly review the relevant literature in anomaly detection (readers who are familiar with anomaly detection can skip this section). Section 3 introduces the detection framework and the notation used in this paper. Section 4 presents our proposed pattern dictionary method and its properties. In Section 5, we show how the proposed pattern dictionary can be used in an atypicality framework alongside LZ78, and we analyze the non-asymptotic properties of the LZ78 parser. Section 6 presents experiments that illustrate the proposed pattern dictionary anomaly detection procedure. Finally, Section 7 concludes our paper.

Related Works
Anomaly detection has a vast literature. Anomaly detection procedures can be categorized into parametric and nonparametric methods. Parametric methods rely on a family of parametric distributions to model the normal data. The slippage problem [17], change detection [18][19][20][21], concept drift detection [19][20][21][22], minimax quickest change detection (MQCD) [23][24][25], and transient detection [26][27][28][29] are examples of parametric anomaly detection problems. The main difference between our proposed pattern dictionary method and the aforementioned techniques is that our method is a model-free nonparametric method. The main drawback of the parametric anomaly detection procedure is that it is difficult to accurately specify the parametric distribution for the data under investigation.
Nonparametric anomaly detection approaches do not assume any explicit parameterized model for the data distributions. An example is an adaptive nonparametric anomaly detection approach called geometric entropy minimization (GEM) [30,31] that is based on the minimal covering properties of K-point entropic graphs constructed on N training samples from a nominal probability distribution. The main difference between GEM-based methods and our proposed pattern dictionary is that former techniques are designed to detect outliers and cannot easily incorporate the temporal information regarding anomaly in a data stream. Another nonparametric detection method is sequential nonparametric testing that considers data as online stream and addresses the growing data storage problem by sequentially testing every new data samples [32,33]. A key difference between sequential nonparametric testing and our proposed pattern dictionary method is that our method is based on coding theory instead of statistical decision theory.
Information theory and universal source coding have been used previously in anomaly detection [34][35][36][37][38][39][40][41][42][43][44][45]. The detection criteria in these approaches are based on comparing metrics such as complexity or similarity distances that depend on entropy rate. An issue with these approaches is that there are many completely dissimilar sources with the same entropy rate, reducing outlier sensitivity. Another related problem is universal outlier detection [46,47]. In these works, different levels of knowledge about nominal and outlier distributions and number of outliers are incorporated. Unlike these methods, our proposed pattern dictionary approach does not require any prior knowledge about outliers and anomalies. In [48], a measure of empirical informational divergence between two individual sequences generated from two finite-order, finite-alphabet, stationary Markov processes is introduced and used for a simple universal classification. While the parsing procedure used in [48] is similar to the pattern dictionary used in this paper, there are important differences. The empirical measure proposed in [48] is a stand alone score function that is designed for two-class classification, while our measure is a direct byproduct of the LZ78 encoding algorithm designed for single-class classification, i.e., anomaly detection. In addition, the theoretical convergence of the empirical measure to the relative entropy between the class conditioned distributions, shown in [48], is only guaranteed when the sequences satisfy the finite-order Markov property, a condition that may be difficult to satisfy in practice. In [2,3], an information theoretic data discovery framework called atypicality has been introduced in which the detection criterion is based on a descriptive codelength comparison of an optimum encoder or a training-based fixed source coder, namely a data-dependent source coder introduced in [2]) with a universal source coder. In this paper, we show how our proposed pattern dictionary method can be used as a training-based fixed source coder in an atypicality framework.
Anomaly and outlier detection for time series has also been extensively studied [49]. Various time series modeling techniques such as regression [50], auto regression [51], auto regression moving average [52], auto regressive integrated moving average [53], support vector regression [54], and Kalman filters [55] have been used to detect anomalous observations by comparing the estimated residuals to a threshold. Many of these methods depend on a statistical assumption on the residuals, e.g., an assumption of Gaussian distribution, while the pattern dictionary method is model-free.
The proposed pattern dictionary method is closely related to the anomaly detection methods that are designed for sequence data. Many of these methods are focused on specific applications. For instance, detection of mutations in DNA sequences [7,56], detection of cyberattacks in computer network [57], and detection of irregular behaviors in online banking [58] are all application-specific examples of anomaly detection for discrete sequences. In the recent years, multiple sequence data anomaly detection methods have been developed specifically for graphs [59], dynamic networks [60], and social networks [61]. Chandola et al. [34] summarized many anomaly detection methods for discrete sequences and identified three general approaches to this problem. These anomaly detection formulations are unique in the way that anomalies are defined, but similar in their reliance on comparison between a test (sub)sequence and normal sequences in the training data. For example, kernel-based techniques such as nearest neighbor-based similarity (NNS) [7] are designed to detect anomalous sequences that are dissimilar to the training data. As another example, threshold sequence time-delay embedding (t-STIDE) [8][9][10][11] is established to detect anomalous sequences that contain subsequences with anomalous occurrence frequencies. The compression-based dissimilarity measure (CDM) is proposed for discord detection [12][13][14][15]15,16] to detect anomalous subsequences within a long sequence. Chandola et al. [34] also showed how various techniques developed for one problem formulation can be changed and applied to other problem formulations. While our pattern dictionary method shares similarity with NNS, CDM, and t-STIDE, our proposed method is generally applicable to any of the categories of anomaly detection identified in [34]. Furthermore, our detection criterion does not depend on the specific type of anomaly. Note that while CDM is also a compression-based method, its anomaly score is based on a dissimilarity measure that might fail to detect atypical subsequences [2]. For instance, using CDM method, a binary i.i.d. uniform training sequence is equally dissimilar to another binary i.i.d. uniform test sequence or to a test sequence drawn from some other distribution. In Section 6, the detection performance of our proposed pattern dictionary method is compared with NNS, CDM, t-STIDE, and the Ziv-Merhav method of [48].
It is worth mentioning that since the proposed pattern dictionary method is based on lossless source coding, it requires discretization of time series prior to deployment. In fact, many anomaly detection approaches require discretization of continuous data prior to applying inference techniques [62][63][64][65]. Note that discretization is also a requirement in other problem settings such as continuous optimization in genetic algorithms [66], image pattern recognition [67], and nonparametric histogram matching over codebooks in computer vision [68].

Framework and Notation
In the anomaly detection literature for sequence data and time series, the following three general formulations are considered [34]: (i) an entire test sequence is anomalous if it is notably different from normal training sequences; (ii) a subsequence within a long test sequence is anomalous if it is notably different from other subsequences in the same test sequence or the subsequences in a given training sequence; and (iii) a given test subsequence or pattern is anomalous if its occurrence frequency in a test sequence is notably different from its occurrence frequency in a normal training sequence. In this paper, we consider a unified formulation in which we determine if a (sub)sequence is anomalous with respect to a training sequence (or training sequence database) if any of the aforementioned three conditions are met. In other words, given a training sequence or a training sequence database, a test sequence is anomalous if it is significantly different from training sequences, or it contains a subsequence that is significantly different from subsequences in the training sequence, or it contains a subsequence whose occurrence frequency is significantly different from its occurrence frequency in the training data.

Notation
We use x to denote a sequence and x m n to denote a subsequence of x: x m n = {x i , i = n, n + 1, . . . , m}, and x l represents a sequence of length l, i.e., {x n , n = 1, . . . , l}. X denotes a finite set, and D represents a dictionary of subsequences. Throughout this paper: • All logarithms are base 2 unless otherwise is indicated. • In the encoding process, we always adhere to lossless compression and strict decodability at the decoder. • While adhering to strict decodability, we only care about the codelength, not the codes themselves.

Pattern Dictionary: Design and Properties
Consider a long sequence, called the training data, {x n , n = 1, . . . , L} of length L drawn from a finite alphabet X . The goal is to learn the patterns (subsequences) of this sequence by creating a dictionary that contains all distinct patterns of maximum length (depth) D max L that are embedded in the sequence. We call this dictionary a pattern dictionary D with the maximum depth D max and the set of observed patterns S D x L 1 . Since the pattern dictionary is going to be used as a training-based fixed source coder (a data-dependent source coder as defined in [2]), an efficient structure for the pattern representation that minimizes the indexing codelength is of interest. The simplest approach is to consider all the patterns of length 1 ≤ d ≤ D max in one set S D and use a uniform indexing approach. This approach is called a uni-level dictionary. Another approach is to separate all the patterns by their depth (pattern length) and arrange them in D max sets S D , which we call a multi-level dictionary. In the following sections, we show that the latter results in a shorter average indexing codelength. It is worth mentioning that since a multi-level dictionary results in a depthdependent indexing codelength, the average over the depth is considered. A relevant question is if the average of indexing codelength over all the patterns independent of depth should be used as an alternative. Since such pattern dictionaries are used to sequentially parse test data, patterns at smaller depth are more likely to be matched, even if they are anomalous. Thus, the average of indexing codelength over depth can better differentiate depth-dependent anomalies.

A Special Case
Suppose all the possible patterns of depth d ≤ D max exist in the training sequence {x n , n = 1, . . . , L}. That is, the cardinality of S (d) Then, the total number of patterns is Hence, a uni-level dictionary results in a uniform indexing codelength of On the other hand, a multi-level dictionary requires a two-stage description of index. The first stage is the index of the depth d (using log D max bits), and the second stage is the index of the pattern among all the patterns with the same depth (using d log(|X |) bits). This two-stage description of the index leads to a non-uniform indexing of codelength: the minimum indexing codelength occurring for the patterns of depth d = 1 equals to L multi min =log D max + log(|X |) bits, while the maximum indexing codelength occurring for the patterns of depth d = D max equals to L multi max =log D max + D max log(|X |) bits. Thus, the average indexing codelength of a multi-level dictionary is given by Figures 1 and 2 graphically compare the indexing codelength between a uni-level dictionary and a multi-level dictionary for a fixed alphabet size and a fixed D max , respectively. As seen, the average indexing codelength of a multi-level dictionary results in a shorter indexing codelength.

The General Case
Given the training sequence {x n , n = 1, . . . , L}, suppose there are a d = S (d) D ≤ |X | d patterns of depth d ≤ D max (a 1 patterns of depth one, a 2 patterns of depth two, etc.). The following Theorem 1 shows that the average indexing codelength using a multi-level dictionary is always less than the indexing codelength of a uni-level dictionary.
in a training sequence of length L D max . Let L uni and L multi be the indexing codelength of a uni-level dictionary and the average indexing codelength of a multi-level dictionary, respectively. Then, (1) L multi ≤L uni ; and Proof. Since L D max , clearly 0 < a 1 ≤ a 2 ≤ · · · ≤ a D max . Using a uni-level dictionary, the indexing codelength is where A D max (a 1 + a 2 + · · · + a D max )/D max is the arithmetic mean of a 1 , a 2 , . . . , a D max . Using a multi-level dictionary the average indexing codelength is is the geometric mean of a 1 , a 2 , . . . , a D max . Hence, the comparison between L uni and L multi comes down to comparing the arithmetic mean and the geometric mean of a 1 , a 2 , . . . , a D max . Thus, A D max ≥ G D max , which established the first part of the theorem. For the second part of the theorem, we use lower and upper bounds on Theorem 1 shows that a multi-level dictionary gives shorter average indexing codelength than a uni-level dictionary. log D max + log a d is the indexing codelength for patterns of depth d, where a d is the total number of observed patterns of the depth d. In order to reduce the indexing codelength even further, the patterns of the same length in each set S (d) D can be ordered according to their relative frequency (empirical probability) in the training sequence. This allows Huffman or Shannon-Fano-Elias source coding [4] to be used to assign prefix codes to patterns in each set S (d) D separately. In this case, for any pattern D , the indexing codelength becomes where L (d) D x d 1 is the codelength assigned to the pattern x d 1 based on its empirical probability using a Huffman or Shannon-Fano-Elias encoder. If such encoders are used, the codelength (1) is optimal ([4] Theorem 5.8.1). Since the whole purpose of creating a pattern dictionary is to learn the patterns in the training data, assigning the shorter codelength to the more frequent patterns and assigning longer codelength to the less frequent patterns in any pattern set S (d) D will improve the efficiency of the coded representation.

Example 2.
Suppose the alphabet is X = {A, B, C, D} and the training sequence is x = ABACADABBACCADDABABACADAB. Table 1 shows the dictionary with D max = 3 created by the patterns inside the training sequence, and the codelength assigned for each pattern using Huffman coding. Pr

Pattern Dictionary for Detection (PDD)
Suppose we want to sequentially compress a test sequence x l 1 = {x n , n = 1, . . . , l} using a trained pattern dictionary D with maximum depth D max < l. The encoder parses the test sequence x l 1 into c phrases, denote the set of the parsed phrases using pattern dictionary D. The parsing process begins with setting v 1 = 1 and finding the largest This results in the first phrase This type of cross-parsing was first introduced in [48] in order to estimate an empirical relative entropy between two individual sequences that are independent realizations of two finite-order, finite-alphabet and stationary Markov processes. Here, we do not impose such an assumption on the sources generating the sequences. Algorithm 1 summarizes the procedure of the proposed pattern dictionary (PD) parser. After parsing the whole test sequence x l 1 into c phrases,

Algorithm 1 Pattern Dictionary (PD) Parser
Require: Pattern Dictionary D, Test Sequence x l For detection purposes, on a test sequence x l 1 , either the number of parsed phrases or the codelength can be used as anomaly scores with respect to the trained pattern dictionary D. In other words, for any test sequence x l 1 and given a pattern dictionary, if the number of parsed phrases S D x l 1 or the codelength L x l 1 in Equation (2) are greater than a certain threshold, then x l 1 is declared to be anomalous. While the proposed pattern dictionary technique can be used as a stand-alone anomaly detection technique, below we show how it can be used for atypicality detection [2,3] as a training-based fixed source coder (data-dependent encoder).

Pattern Dictionary-Based Atypicality (PDA)
In [2,3], an atypicality framework was introduced as a data discovery and anomaly detection framework that is based on a central definition: "a sequence (or subsequence) is atypical if it can be described (coded) with fewer bits in itself rather than using the (optimum) code for typical sequences". In this framework, detection is based on the comparison of a lossless descriptive codelength between an optimum encoder (if the typical model is known) or a training-based fixed source coder (if the typical model is unknown, but training data are available) and a universal source coder in order to detect atypical subsequences in the data [2,3]. In this section, we apply our proposed pattern dictionary as a training-based fixed source coder (typical encoder) in an atypicality framework. We call it pattern dictionary-based atypicality (PDA) method.
The pattern dictionary-based source coder can be considered as a generalization of the Context Tree [70][71][72] based fixed source coder that was used in [2] for discrete data. The universal source coder (atypical encoder) used here is the Tree-Structured Lempel-Ziv (LZ78) [4,5]. The primary reason for choosing LZ78 as the universal encoder is that its sequential parsing procedure is similar to the proposed pattern dictionary described in Section 4, and it is (asymptotically) optimal [4,5]. One might ask why do we even need to compare descriptive codelengths of a training-based (or optimum) encoder with a universal encoder for data discovery purposes when, as alluded to in the end of last section, a training-based fixed source coder can be a stand-alone anomaly detector. The necessity of such concurrent comparison is articulated in [2]. In fact, such a codelength comparison enables the atypicality framework to go beyond the detection of anomalies and outliers, extending to the detection of rare parts of data that might have a data structure of interest to the practitioner.
We give an example to provide further intuition for why anomaly detection can benefit from our framework that compares the outputs of a typical encoder and an atypical encoder. Consider an i.i.d. binary sequence of length L with P(X = 1) = p in which there is embedded an anomalous subsequence of length l L with P(X = 1) = p = p that we would like to detect. If p = 1 2 and p = 1, the typical encoder cannot catch the anomaly while the atypical encoder can. On the other hand, if p = 1 3 and p = 2 3 , the typical encoder identifies the anomaly while an atypical encoder fails to do so (since the entropy for p = 1 3 and p = 2 3 is the same). Note that in both cases, our framework would catch the anomaly since it uses the difference between the descriptive codelengths of these two encoders.
Recall that in Section 4, we supposed that a test sequence x l 1 has been parsed using a trained pattern dictionary D with maximum depth D max < l. This parsing results in S D x l 1 parsed phrases. Using Equation (2), the typical codelength of the sequence x l 1 is given by For the atypical encoder, the LZ78 algorithm results in a distinct parsing of the test sequence x l 1 . Let S LZ x l 1 denote the set of parsed phrases in the LZ78 parsing of x l 1 . As such, the resulting atypical codelength is [4,5] L A x l 1 = S LZ x l 1 log S LZ x l 1 + 1 .
Since L x l 1 using both LZ78 and the pattern dictionary depends on the number of parsed phrases, we investigate the possible range and properties of S D x l 1 − S LZ x l 1 . While the LZ78 encoder is a well-known compression method which is asymptotically optimal [4,5], its non-asymptotic behavior is not well understood. In the next section, we establish a novel non-asymptotic property of an LZ78 parser, and then compare it with the pattern dictionary parser.

Lempel-Ziv Parser
We start this section with a theorem that establishes the non-asymptotic lower and upper bounds on the number of distinct phrases in a sequence parsed by LZ78.
Proof. First, we establish the upper bound. Note that the number of parsed distinct phrases c(l) is maximized when all the phrases are as short as possible. Define M |X | and let l k be the sum of the lengths of all distinct strings of length less than or equal to k. Then, Since l = l k occurs when all the phrases are of length ≤ k, We conclude that the parsing ends up with c(l k ) phrases of length ≤ k and l−l k k+1 phrases of length k + 1. Therefore, ( We now bound the size of k for a given sequence of length l by setting l = l k . Define where k = αk − 1. The last equation can be solved using the Lambert W function [6]. Since all the involved numbers are real and for M > 1 and l ≥ 2, we have β α M −1−1/α ln M ≥ 0 > − 1 e , it follows that where W(.) is the Lambert W function. Using equation (3), we write To prove the lower bound, note that the number of parsed distinct phrases c(l) is minimized when the sequence of length l consists of only one symbol that repeats. Let l k be the sum of the lengths of all such distinct strings of length less than or equal to k. Then, Thus, given a sequence of length l by enforcing l = k(k+1) 2 , we obtain the lower bound. Figure 3 illustrates the lower and upper bounds established in Theorem 2 against the sequence length for various alphabet sizes. Note that the lower bound on the number of distinct phrases is independent of the alphabet size.
While numerical experiments are not a substitute for the mathematical proof of Theorem 2 provided above, the reader may find it useful to understand the theorem in terms of a simple example. In Figures 4-6, we compare the theoretical bound with numerical results of simulation for binary i.i.d. sequences. In these experiments, for each value of P(X = 1), a thousand binary sequences are generated; then, the number of distinct phrases resulting from LZ78 parsing of each sequence is calculated, and hence, the average, minimum, and maximum of these counts are found and represented by error bars.
where the logarithm is base |X | and l = min 1,

Proof. The proof is similar to the proof in ([4] Lemma 13.5.3) or ([74] Theorem 2). Let
M |X |. In Theorem 2, we defined l k as the sum of the lengths of all distinct strings of length less than or equal to k, and we showed that for any given l such that l k ≤ l < l k+1 , we have c(l) ≤ c(l k ) + l−l k k+1 ≤ l k− 1 M−1 . Next, we bound the size of k. As such, we have l ≥ l k ≥ M k or, equivalently, k ≤ log l where the logarithm is base M. Additionally, Next, we analyze the properties of the number of distinct phrases c(l) resulting from LZ78-parsing of an |X |-ary sequence x l 1 = {x n , n = 1, . . . , l} when l is fixed. The error bar representation in Figure 4 shows the variation of c(l) when l is fixed. A possible explanation for such variations is that the statistical distribution of the pseudorandomly generated data are different from the theoretical distribution of the generating source. To elucidate this possibility, we enforce the exact matching of the source probability mass function and the empirical probability mass function of the generated data. Figure 5 represents the number of distinct phrases c(l) resulting from LZ78-parsing of a binary sequence of fixed length where the characteristic of the generating source and the generated data matches. As seen, there is still some variation around the average value of c(l). We can specify a distribution-dependent bound on c(l) when both l and the distribution of the source are fixed.
In ( [75] Theorem 1), for sequences generated from a memoryless source, c(l) is assumed to be a random variable with the following mean and variance: where h = − ∑ a∈X p a log p a is the entropy rate, and h 2 = ∑ a∈X p a log 2 p a with p a being the probability of symbol a ∈ X . Note that the approximations (4) are asymptotic as l → ∞. Below, we obtain a finite sample characterization of c(l). Consider an |X |-ary sequence x l 1 = {x n , n = 1, . . . , l} with fixed length l generated from a source with the probability mass function p(x). Here, the notations x l 1 and x l are used interchangeably. Let c(l, p) denote the number of distinct phrases resulting from LZ78-parsing of the sequence x l 1 of length l and the generating probability mass function is defined by p(x). In order to find a distribution-dependent bound on the number of distinct phrases in LZ78-based parsing of x l 1 , we note that since the generating distribution is not necessarily uniform, all the strings x n for n < l ∞ do not necessarily appear as parsed phrases. For instance, consider the binary case with P(X = 1) = 0.9. Then, it is very unlikely to have a string with multiple consecutive zeros in any parsing of a realization of the finite sequence x l . As such, using the Asymptotic Equipartition Properties (AEP) ( [4] Chapter 3) or Non-asymptotic Equipartition Properties (NEP) [76], we define the typical set A (n) with respect to p(x) as the set of subsequences x n ∈ X n of x l 1 with the property where h is the entropy. Then, we have therefore, A (n) ≤ 2 n(h+ ) . Let l k be the sum of the lengths of all the distinct strings x n in the set A (n) of length less than or equal to k. We write, where m 2 h+ . Therefore, l = 1 (m−1) 2 ((m − 1)k − 1)m k+1 + m can be solved for k which leads into an upper bound for c(l, p) as follows where α = m − 1 and β = (m − 1) 2 l − m. Therefore, the dependency of the c(l, p) upper bound on the distribution is only through the entropy. Figure 6 depicts the upper bound on c(l, p) for = 0.1.  Figure 6. Simulation of the probability-dependent upper bound c(l, p) for binary sequences of fixed length l = 100 with various probability parameters P(X = 1). For every P(X = 1), one thousand binary sequences of length l are generated. Error bars represent the maximum, minimum, and average number of distinct phrases.

Pattern Dictionary Parser versus LZ78 Parser
Given an |X |-ary sequence x l 1 = {x n , n = 1, . . . , l}, let c T (l) be the number of parsed phrases of x l 1 when the typical encoder (pattern dictionary with D max ) is used, and c A (l) be the number of parsed phrases of x l 1 when the atypical encoder (LZ78) is used. Clearly, The above bounds have asymptotic and non-asymptotic implications. The asymptotic analysis of the bounds in (5) suggests that as l → ∞, for a dictionary with fixed D max , we have l D max ≤ c T (l) − c A (l) ≤ l. This inequality implies the asymptotic dominance of the parser using a typical encoder. This is to be expected due to the asymptotic optimality of LZ78. However, the above inequality also implies a more interesting result: if D max > log l log(l ln|X |) as l → ∞, then c T (l) can be smaller than c A (l). The non-asymptotic behavior of the bounds in (5) is more relevant to the anomaly detection problem. These bounds suggest that for a fixed l and |X |, increasing D max has a vanishing effect on the possible range of the anomaly score. Additionally, the achieved bounds on c T (l) − c A (l) provide the range of values of the anomaly score. This facilitates the search for a data-dependent threshold for anomaly detection, as the search can be restricted to this range.

Atypicality Criterion for Detection of Anomalous Subsequences
Consider the problem of finding the atypical (anomalous) subsequences of a long sequence with respect to a trained pattern dictionary D. Suppose we are looking for an infrequent anomalous subsequence x n+l−1 n = {x n , n = n, . . . , n + l − 1} embedded in a test sequence {x n , n = 1, . . . , L} from the finite alphabet X . Using Equation (2) where log * (l) + τ is an additive penalty for not knowing in advance the start and end points of the anomalous sequence [2,3], and log * (l) = log l + log log l + . . . where the sum continues as long as the argument to the outer log is positive. Let L A = L A − τ. We propose the following atypicality criterion for detection of an anomalous subsequence: where τ can be treated as an anomaly detection threshold. In practice, τ can be set to ensure a false positive constraint, e.g., using bootstrap estimation of the quantiles in the training data.

Experiment
In this section, we illustrate the proposed pattern dictionary anomaly detection on a synthetic time series, known as Mackey-Glass [77], as well as on a real-world time series of physiological signals. In both experiments, first, the real-valued samples are discretized using a uniform quantizer [78], and then, anomaly detection methods are applied.

Anomaly Detection in Mackey-Glass Time Series
In this section, we illustrate the proposed anomaly detection method for the case of a chaotic Mackey-Glass (MG) time series that has an anomalous segment grafted into the middle of the sequence. MG time series are generated from a nonlinear time delay differential equation. The MG model was originally introduced to represent the appearance of complex dynamic in physiological control systems [77]. The nonlinear differential equation is of the form 1+x 10 (t−δ) , t ≥ 0, where a, b and δ are constants. For the training data, we generated 3000 samples of the MG time series with a = 0.2, b = 0.1, and δ = 17. For the test data, we normalized and embedded 500 samples of the MG time series with a = 0.4, b = 0.2, and δ = 17 inside 1000 samples of a MG time series generated from the same source as the training data, resulting in a test sequence of length 1500. Figure 7 shows a realization of the training data and the test data. The anomaly detection performance of our proposed pattern dictionary is evaluated. To illustrate the effect of the model parameter, i.e., the maximum depth D max , on the detection and compression performance of the pattern dictionary, we run two experiments. First, we use a 30-fold cross-validation on the training data (resulting in 30 sequences of length 100) and calculate the number of distinct parsed phrases against D max . Second, we train a pattern dictionary with various D max using the training data and then evaluate the sensitivity of detector of the anomalous subsequences in the test data using Equation (6) with τ = 0. In this experiment, the detection sensitivity (true positive rate) is defined as the ratio of number of samples correctly identified as anomalous over the total number of anomalous samples. Figure 8 illustrates the result of both experiments. As seen, after some point, increasing D max has diminishing effect on both detection sensitivity and the number of distinct parsed phrases. Note that this behavior is to be expected as it was suggested by the bounds in (5).
Next, we compare anomaly detection performance of our proposed pattern dictionary methods, PDD and PDA, with the nearest neighbors-based similarity (NNS) technique [7], the compression-based dissimilarity measure (CDM) method [12][13][14], Ziv-Merhav method (ZM) [48], and the threshold Sequence Time-Delay Embedding (t-STIDE) technique [8][9][10][11]. In this experiment, a window of length 100 is slid over the test data and each method measures the anomaly score (as described below) of the current subsequence with respect to the training data. The anomaly is detected when the score exceeds a threshold, determined to ensure a specified false positive rate. In the following, we compute AUC (area under the curve) of the ROC (receiver operating characteristic) and Precision-Recall curves as performance measures. In the following, we provide details of the implementation. Pattern Dictionary for Detection (PDD) First, the training data are used to create a pattern dictionary with D max = 40, as described in Section 4. Then, for each subsequence x 100 (the sliding window of length 100) of the test data, the anomaly score is computed as the codelength L x 100 of Equation (2) described in Section 4.3.
Pattern Dictionary Based Atypicality (PDA) Similar to PDD, first the training data are used to create a pattern dictionary with D max = 40, as described in Section 4. Then, for each subsequence x 100 of the test data, the anomaly score is the atypicality measure described in Section 5, i.e., L T x 100 − L A x 100 , the difference between the compression codelength of the test subsequence using typical encoder (pattern dictionary) and atypical encoder (LZ78).
Ziv-Merhav Method (ZM) [48] In this method, a cross-parsing procedure is used in which for each subsequence x 100 of the test data, the anomaly score is computed as the number of the distinct phrases of x 100 with respect to the training data.
Nearest Neighbors-Based Similarity (NNS) [7] In this method, a list S of all the subsequence of length 100 (the length of the sliding window) of the training data is created. Then, for each subsequence x 100 of the test data, the distance between x 100 and all the subsequences in the list S is calculated. Finally, the anomaly score of x 100 is its distance to the nearest neighbor in the list S.
Compression-Based Dissimilarity Measure (CDM) [12][13][14] In this method, given the training data x train , for each subsequence x 100 of the test data the anomaly score is where C(y, x) represents concatenation of sequences y and z, and L(x) is the size of the compressed version of the sequence x using any standard compression algorithm. The CDM anomaly score is close to 1 if the two sequence are not related, and smaller than one if the sequences are related.
Threshold Sequence Time-Delay Embedding (t-STIDE) [8][9][10][11] In this method, given l < 100, for each sub-subsequence x l of the subsequence x 100 of the test data, the likelihood score of x l is the normalized frequency of its occurrence in the training data, and the anomaly score of x 100 is one minus the average likelihood score of all its sub-subsequences of length l. In this experiment, various values of l are tested and the best performance is reported.
We compare the detection performance of the aforementioned methods by generating 200 test data sequences with different anomaly segments (the anomalous MG segments have different initializations in each test dataset). The detection results of comparisons are reported in Table 2. As seen, our proposed PDD and PDA methods outperform the rest, with ZM and CDM coming in third place. The effect of alphabet size of the quantized data (the resolution parameter of the uniform quantizer [78]) on anomaly detection performance is summarized in Table 3. Table 3 shows that our proposed PDD and PDA methods outperform in all three cases of data resolution. Since the parsing procedure of our proposed PD-based methods and the ZM method [48] are similar, it is of interest to compare the running time of these two methods. While the cross-parsing procedure of the ZM method was introduced as an on the fly process [48], we can also consider another implementation similar to our proposed PD by creating a codebook of all the subsequences of the training data prior to the parsing procedure. As such, in order to compare the running time of the dictionary/codebook creation and parsing procedure of our PD-based methods with the aforementioned two implementations of the ZM method, we use the same MG training data of length 3000, one test dataset of length 1500 while a sliding window of length 100 is slid over it for anomaly score calculation, and the PD-based method with D max = 40. Note that since a sliding window of length 100 over the test data is considered, for the codebook-based implementation of ZM, all the subsequences of the training data up to length 100 are extracted which make its codebook creation process significantly faster. Table 4 summarizes the running time comparison. As it can be seen, our PD-based method is faster in both dictionary/codebook creation and parsing process. Table 3. Comparison of anomaly detection methods for different cases of data resolutions: high resolution corresponds to an alphabet size of 90, medium resolution corresponds to an alphabet size of 45, and low resolution corresponds to an alphabet size of 10. In this table, µ ± σ representation is used where µ is the mean and σ is the standard deviation. The proposed PDA method achieves overall best performance (bold entries of table).

Infection Detection Using Physiological Signals
Finally, we apply the proposed pattern dictionary method to detect unusual patterns in physiological signals of two human subjects after exposure to a pathogen while only one of these subjects became symptomatically ill. The time series data were collected in a human viral challenge study that was performed in 2018 at the University of Virginia under a DARPA grant. Consented volunteers were recruited into this study following an IRB-approved protocol and the data was processed and analyzed at Duke University and the University of Michigan. The challenge study design and data collection protocols are described in [79]. Volunteers' skin temperature and heart rate were recorded by a wearable device (Empatica E4) over three consecutive days before and five consecutive days after exposure to a strain of human Rhinovirus (RV) pathogen. During this period, the wearable time series were continuously recorded while biospecimens (viral load) were collected daily. The infection status can be clinically detected by biospecimen samples, but in practice, the collection process of these types of biosamples can be invasive and costly. As such, here, we apply the proposed anomaly detection framework to the measured two-dimensional heart rate and temperature time series to detect unusual patterns after exposure with respect to the normal (healthy) baseline patterns.
In the preprocessing phase, we followed the wearable data preprocessing procedure described in [80]. Specifically, we first downsample the time series to one sample per minute by averaging. Then, we apply an outlier detection procedure to remove technical noise, e.g., sensor contact loss. After preprocessing, the two-dimensional space of temperature and heart rate time series is discretized using a two-dimensional uniform quantizer [78] with step size of 5 for heart rate and 0.5 for temperature, resulting in one-dimensional discrete sequence data. The first three days of data are used as the training data, and the PDA methods with maximum depth D max = 30 are used to learn the patterns in the training data. In order to detect anomalous patterns of the test data (the last five days), we used the result of Section 5.3 and the atypicality criterion of Equation (6), which requires choosing the threshold τ. While this threshold can be chosen freely, we selected it using cross-validation on the training data. Leave-one-out cross-validation over the training data generates an empirical null distribution of the PDA anomaly score function L T − L A . The threshold τ was chosen as the upper 99% quantile of this distribution. Figure 9 illustrates the result of anomaly detection on one subject who became infected as measured by viral shedding as shown in Figure 9C. All the anomalous patterns occur when the subject was shedding the virus. Figure 10 also depicts the result of anomaly detection on one subject who had a mild infection with a low level of viral shedding, as shown in Figure 10C. Note that in this case, no anomalous patterns were detected.   Figure 10. Anomaly detection using the proposed PDA method for a subject who had a mild infection with low level of viral shedding based on heart rate and temperature data collected from a wearable wrist sensor. Note that no anomaly has been detected: (a) heart rate, (b) temperature, and (c) infection level.

Conclusions
In this paper, we have developed a universal nonparametric model-free anomaly detection method for time series and sequence data using a pattern dictionary. We proved that using a multi-level dictionary that separates the patterns by their depth results in a shorter average indexing codelength in comparison to a uni-level dictionary that uses a uniform indexing approach. We illustrated that the proposed pattern dictionary method can be used as a stand-alone anomaly detector, or integrated with Tree-Structured Lempel-Ziv (LZ78) and incorporated into an atypicality framework. We developed novel nonasymptotic lower and upper bounds of the LZ78 parser and demonstrated that the nonasymptotic upper bound on the number of distinct phrases resulting from LZ78-parsing of an |X |-ary sequence can be explicitly derived in terms of the Lambert W function, an important theoretical result that is not trivial. We showed that the achieved non-asymptotic bounds on LZ78 and pattern dictionary determine the range of the anomaly score and the anomaly detection threshold. We also presented an empirical study in which the pattern dictionary approach is used to detect anomalies in physiological time series. In the future work, we will investigate the generalization of the context tree weighting methods to the general discrete case, using the pattern dictionary since the pattern dictionary handles sparsity well and is computationally less expensive when the alphabet size is large.  Institutional Review Board Statement: University of Michigan ethical review and approval were waived since only de-identified data artifacts from a previously approved human challenge study were made available to the co-authors for the analysis reported in Section 6.2 Information concerning provenance of the data, i.e., the human rhinovirus (RV) challenge study experiment and its IRB approved protocol, was reported in [79].

Informed Consent Statement:
The de-identified data used for our analysis in Section 6.2 came from a human rhinovirus (RV) challenge study in which informed consent was obtained from all subjects involved in the study. See the [79] for details.