Detecting Pattern Anomalies in Hydrological Time Series with Weighted Probabilistic Su ﬃ x Trees

: Anomalous patterns are common phenomena in time series datasets. The presence of anomalous patterns in hydrological data may represent some anomalous hydrometeorological events that are signiﬁcantly di ﬀ erent from others and induce a bias in the decision-making process related to design, operation and management of water resources. Hence, it is necessary to extract those “anomalous” knowledge that can provide valuable and useful information for future hydrological analysis and forecasting from hydrological data. This paper focuses on the problem of detecting anomalous patterns from hydrological time series data, and proposes an e ﬀ ective and accurate anomalous pattern detection approach, TFSAX_wPST , which combines the advantages of the Trend Feature Symbolic Aggregate approximation ( TFSAX ) and weighted Probabilistic Su ﬃ x Tree ( wPST ). Experiments with di ﬀ erent hydrological real-world time series are reported, and the results indicate that the proposed methods are fast and can correctly detect anomalous patterns for hydrological time series analysis, and thus promote the deep analysis and continuous utilization of hydrological time series data.


Introduction
In the era of Big Data, new satellite, space, airborne, shipborne and ground-based remote sensing systems, as well as Internet of Things (IoT) devices, are ubiquitous, producing data rapidly and continuously, which lead to hydrological time series being acquired at a breathless pace, both in size and variety [1,2]. However, due to measurement/manual operation errors, instrument failure, changes in natural laws caused by human activities or hydrological evolution, there is a large number of "anomalous" data in hydrological time series. Undoubtedly, those "anomalous" data will significantly affect the models related to flood forecasting and hydrological analysis, and lead to potentially incomplete or inaccurate results [3]. Therefore, detecting those "anomalies" in hydrological datasets is becoming an important and urgent task for hydrology and information researchers [4].
Anomalies are individuals that behave in an unexpected way or feature abnormal properties [5]. According to the literature [6,7], anomalies in time series can be divided into point anomalies and pattern anomalies, and the problem of finding those unexpected individual points or patterns is referred to as anomaly detection. For hydrological time series data, many researchers have proposed different anomaly detection algorithms from different application aspects to address "anomalous" variables [8][9][10][11][12][13][14][15]. However, those methods pay more attention to detect point anomalies to improve Those approaches have a lower fitting error and higher accuracy in the task of anomalous hydrological time series patterns mining, but how to choose a proper number (N) of feature points and the anomaly determination threshold will affect the detection accuracy and results of the algorithm. Wu et.al [35] used the quantile perturbation method (QPM) to reveal rainfall time series anomalies and changes over the Yellow River Basin due to the fragile ecosystem and rainfall-related disasters. The QPM method is a tool for analyzing extreme values and effective for the identification and analysis of extreme meteorological events. However, it is relatively weak for the detection of ordinary anomalous events.
For detecting pattern anomalies in time series, some existing methods in the literature reduce the problem to a point anomaly detection problem before solving it [11]. In some other methods, pattern anomalies are detected by using different machine learning and data mining approaches [3]. Markov Models (MM) [36] and their variants [29,31] are the popular machine learning approaches extensively used for pattern anomalies detection in time series. In the next section, we briefly study PST-based anomaly detection approaches.

PST-Based Anomaly Detection
The Markov model is a powerful finite state machine and widely used in sequence modeling. The Markov approaches are used in several studies to solve anomaly detection problems with the idea that an odd behavior might be represented not only by a single observation, but also by a series of consecutive observations [36]. The Probabilistic Suffix Tree (PST) is a compact representation of the Variable Order Markov Model (VMM) and uses a suffix tree as its storage structure. It originally comes from Probabilistic Suffix Automata (PSA) [37] and is believed to have a more memory efficient representation than the PSA. Hence, it has been used in several domains as an efficient approach for classifying sequences [38,39]. Figure 1 is an example of a PST corresponding to string s 1 : abbabbabaaba over the alphabet Σ = {a, b} and tree depth L = 2. In PST, each edge is labelled by a unique symbol σ in Σ. Each node has at most two (|Σ|) children and records a string representing a path from the node to the root. The node also records a probability distribution vector corresponding to the conditional probabilities of seeing a symbol right after the label string in the dataset [40]. PST models the normal behave using the maximum likelihood criterion likelihood ratio. For a given sequence S and its PST T, the total likelihood-ratio of the observations can be expressed mathematically as L = Pr (S|T). If the probability of the observation sequence given the model has the largest likelihood ratio (or exceeding a certain preset threshold θ), then an anomaly is detected [29,41]. 3 of 23 hydrological time series patterns mining, but how to choose a proper number (N) of feature points and the anomaly determination threshold will affect the detection accuracy and results of the algorithm. Wu et.al [35] used the quantile perturbation method (QPM) to reveal rainfall time series anomalies and changes over the Yellow River Basin due to the fragile ecosystem and rainfall-related disasters. The QPM method is a tool for analyzing extreme values and effective for the identification and analysis of extreme meteorological events. However, it is relatively weak for the detection of ordinary anomalous events.
For detecting pattern anomalies in time series, some existing methods in the literature reduce the problem to a point anomaly detection problem before solving it [11]. In some other methods, pattern anomalies are detected by using different machine learning and data mining approaches [3]. Markov Models (MM) [36] and their variants [29,31] are the popular machine learning approaches extensively used for pattern anomalies detection in time series. In the next section, we briefly study PST-based anomaly detection approaches.

PST-Based Anomaly Detection
The Markov model is a powerful finite state machine and widely used in sequence modeling. The Markov approaches are used in several studies to solve anomaly detection problems with the idea that an odd behavior might be represented not only by a single observation, but also by a series of consecutive observations [36]. The Probabilistic Suffix Tree (PST) is a compact representation of the Variable Order Markov Model (VMM) and uses a suffix tree as its storage structure. It originally comes from Probabilistic Suffix Automata (PSA) [37] and is believed to have a more memory efficient representation than the PSA. Hence, it has been used in several domains as an efficient approach for classifying sequences [38,39]. Figure 1 is an example of a PST corresponding to string s1: abbabbabaaba over the alphabet Σ = {a, b} and tree depth L = 2. In PST, each edge is labelled by a unique symbol σ in Σ. Each node has at most two (|Σ|) children and records a string representing a path from the node to the root. The node also records a probability distribution vector corresponding to the conditional probabilities of seeing a symbol right after the label string in the dataset [40]. PST models the normal behave using the maximum likelihood criterion likelihood ratio. For a given sequence S and its PST T, the total likelihood-ratio of the observations can be expressed mathematically as L = Pr (S|T). If the probability of the observation sequence given the model has the largest likelihood ratio (or exceeding a certain preset threshold θ), then an anomaly is detected [29,41]. However, PST is a sequence statistical model based on VMM. It inherits the shortcomings, such as losing important sequence information and reducing detection accuracy of the VMM in anomaly detection tasks [42]. For subsequence A: abbabbabaaba, and B: abbabaaaaaaa, the probability that seeing a right after event ab in subsequence A (PA(a|ab)) is equal to that of subsequence B (PB(a|ab)). However, the frequency of event ab occurring in A (PA(ab) = 4/11) is higher than that in B (PB(ab) = 2/11). Hence, if it only uses probability to represent the sequence for anomaly detection tasks, it may lead to an erroneous analysis result.
Therefore, we propose a novel wPST model to better descript and accurately distinguish different time series sequences; thus, we give here a formal definition for pattern anomaly based on wPST to define the detection boundary and detection target for our anomaly detection algorithm. However, PST is a sequence statistical model based on VMM. It inherits the shortcomings, such as losing important sequence information and reducing detection accuracy of the VMM in anomaly detection tasks [42]. For subsequence A: abbabbabaaba, and B: abbabaaaaaaa, the probability that seeing a right after event ab in subsequence A (P A (a|ab)) is equal to that of subsequence B (P B (a|ab)). However, the frequency of event ab occurring in A (P A (ab) = 4/11) is higher than that in B (P B (ab) = 2/11). Hence, if it only uses probability to represent the sequence for anomaly detection tasks, it may lead to an erroneous analysis result. Therefore, we propose a novel wPST model to better descript and accurately distinguish different time series sequences; thus, we give here a formal definition for pattern anomaly based on wPST to define the detection boundary and detection target for our anomaly detection algorithm.

A Novel Time Series Anomaly Detection Approach TFSAX_wPST
In this section, we propose a novel time series pattern anomaly detection approach, TFSAX_ wPST. Firstly, we propose a novel wPST model as the structure to store symbol sequences and give a formal definition for a hydrological time series pattern anomaly based on wPST. Then, we conduct a novel time series anomaly detection approach TFSAX_wPST to detect pattern anomalies within a given hydrological time series. TFSAX_wPST can be performed as in Figure 2.

A Novel Time Series Anomaly Detection Approach TFSAX_wPST
In this section, we propose a novel time series pattern anomaly detection approach, TFSAX_ wPST. Firstly, we propose a novel wPST model as the structure to store symbol sequences and give a formal definition for a hydrological time series pattern anomaly based on wPST. Then, we conduct a novel time series anomaly detection approach TFSAX_wPST to detect pattern anomalies within a given hydrological time series. TFSAX_wPST can be performed as in Figure 2. (1) Anomaly Definition: As the basis of anomaly detection, the anomaly definition determines the object of the detection algorithm, the accuracy and interpretability of detection results. Therefore, we give a pattern anomaly definition based on our wPST model.
(2) Anomaly Detection: Based on the wPST model and our previous research work, TFSAX, we propose a novel TFSAX_wPST algorithm to detect those patterns that meet our definition within given time series.

Time Series Pattern Anomaly Based wPST Model
The wPST model is an improvement of the PST model. It increases the model frequency weight of the subsequence corresponding to a node to distinguish different sequences accurately. For a given sequence, its wPST model can be defined as follows: where wi is the frequency weighting of the subsequence σ1, σ2…σi − 1. Figure 3 shows the wPST model of the sequence s1 in Figure 1. Compared to PST, each node in the figure stores the conditional probability distribution vector of the subsequent symbol as well as the frequency weight corresponding subsequence, and thus can better present the feature information of the sequence.  (1) Anomaly Definition: As the basis of anomaly detection, the anomaly definition determines the object of the detection algorithm, the accuracy and interpretability of detection results. Therefore, we give a pattern anomaly definition based on our wPST model.
(2) Anomaly Detection: Based on the wPST model and our previous research work, TFSAX, we propose a novel TFSAX_wPST algorithm to detect those patterns that meet our definition within given time series.

Time Series Pattern Anomaly Based wPST Model
The wPST model is an improvement of the PST model. It increases the model frequency weight of the subsequence corresponding to a node to distinguish different sequences accurately. For a given sequence, its wPST model can be defined as follows: where w i is the frequency weighting of the subsequence σ 1 , σ 2 . . . σ i − 1 . Figure 3 shows the wPST model of the sequence s 1 in Figure 1. Compared to PST, each node in the figure stores the conditional probability distribution vector of the subsequent symbol as well as the frequency weight corresponding subsequence, and thus can better present the feature information of the sequence.
where wi is the frequency weighting of the subsequence σ1, σ2…σi − 1. Figure 3 shows the wPST model of the sequence s1 in Figure 1. Compared to PST, each node in the figure stores the conditional probability distribution vector of the subsequent symbol as well as the frequency weight corresponding subsequence, and thus can better present the feature information of the sequence.  The representation of a TSP is a trend feature over a period of time, such as a rising, a stable or a falling subsequence. Moreover, a time series TS of length n can be treated as a plurality of subsequences of length m (m <<= n); each subsequence has its own trend feature as well as the overall trend features of the original sequence. Therefore, the pattern features between different subsequences are closely related to the temporal trend of the sequence. For these reasons, this paper gives the definition of the hydrological time series anomaly subsequence and time series pattern anomalies based on the physical mechanism as below.

Definition 1. Time series anomalous subsequence.
Given a time series TS, the event sequence set is Σ, the subsequence s, and the subsequent events σ of s (σ = suffix(s) & σ∈Σ). Let Pr min and MinCt represent the predefined minimum occurrence probability and minimum occurrence number for the conditional occurrence probability of σ under the condition of s, respectively. If the conditional occurrence probability of σ under the condition of s is satisfied: (1) Pr (σ|s) < Pr min and (2) occ_num(σ) ≤ MinCt, then, it can define sσ to be an anomalous subsequence on time series TS.
As it can be seen from Definition (1), for a given time series TS, the smaller the probability of sσ occurring under the condition of s occurring, the higher the anomaly probability of sσ is. Therefore, the top k subsequences with the smallest occurrence probability (or occurrence number) are the top_k anomaly subsequences.

Definition 2. Time series anomaly pattern.
The time series pattern anomaly is a pattern consisting of one or a series of consecutive anomalous subsequences within a given time series.

TFSAX Representation
TFSAX is our previous work to extend the SAX representation. TFSAX employs the sequence mean feature and trend feature to represent the time series, and thus overcomes the shortcomings of SAX that only uses the mean values to describe the original time series. It can be obtained as follows: Step 1: Normalization. Transform the original time series TS into the normalized time series TS with a mean of 0 and standard deviation of 1.
Step 2: Dimensionality reduction. Use the PAA (Piecewise Aggregate Approximation) approach [43] to divide time series TS into w equal-sized segments, then extract the mean feature and trend feature of each segment. Step 3: Discretization. According to the breakpoints lookup table, choose alphabet cardinality, obtain the Trend Feature Symbolic Representation of the original time series and discretize TS into symbols, denoted by TS.
For more detailed information about TFSAX, please refer to our previous research work [23].

wPST Construction
The construction of wPST starts from a subsequence with a single element. It first initializes an empty wPST containing only one root node, and then iterates through all possible subsequences, with the length varying from 1 to L. For each to be checked subsequence s, its occurrence time should be greater than MinCt; then, continue to search the subsequent symbol σ (σ∈Σ) of s and count the occurrence times of sσ and σs to determine whether σ needs to be constructed in wPST; otherwise, discriminate all the subsequences, beginning with s, as being pattern anomalies, and stops searching.
To improve the algorithm's efficiency, this approach uses the hash map data structure at each level to search and update the information of each node before and after a segment in the sequential database. For example, assume that we are at level 2 and the alphabet is {a, b}. Then, without pruning, the hash.keys at level 2 are all the possible 2 combinations of the alphabet:{aa, ab, ba, bb}. These combinations are lexicographically ordered, and the orders are stored at hash.values. Thus, the hash.values are {0, 1, 2, 3}. Now, the hash.key, hash.value combination is used as an index to the arrays A before and A after . Moreover, The arrays' size is the size of the alphabet, and the value of each element of A before is the current count of σs , where s is the element of hash.keys and σ is a character in the alphabet. Similarly, A after will store the count of s σ. Thus, we can update all the counts at each level of the tree after one scan. After a level of the wPST is constructed, the current hash map is destroyed and a new hash map for the next level is initialized. For example, assuming that we have a sequential database consisting of one sequence {abba}, in one scan we can update the counts of ab→ b, a← bb, bb→ a and b← ba. The formal description of constructing wPST is shown in Algorithm 1.

Input:
Sequence S, Maximum depth H Output: wPST T 1.
While k≤ H Do 5.

Candidate Anomalies Pattern Set Generation
Theoretically, the number of entries in the hash map on the Lth level of wPST is |Σ| L − 1 without pruning while wPST is constructed. Therefore, the total complexity of this implementation is O(NmL) [29]. Thus, we can prune the wPST by using Pr min or MinCt, which only increases the number of nodes exponentially at first a few levels and then decreases and converges to some constant C. However, using the Pr min or MinCt to perform the pruning operation during the wPST construction process may result in the loss of the anomalous subsequence. In order to solve the above problem, this paper proposes a strategy to put the sequence corresponding to the node whose occurrence number is less than MinCt or occurrence probability is less than Pr min into the candidate pattern anomalies set, and then analyzes and mines the candidate set to obtain pattern anomalies that meets the user's requirements.
During the wPST construction process, each node of the wPST stores the occurrence number of the string traversing from the root to this node, the occurrence probability of the node and the probability vector of the subsequent nodes. Hence, it only needs to analyze the node to determine if the sequence is a pattern anomaly or not during the wPST tree construction process; that is, if the sequence whose occurrence number is less than MinCt or the occurrence probability is less than Pr min , then it puts the node corresponding to the sequence and all its descendant nodes into the candidate pattern anomalies set. The formal description of the candidate anomaly mining algorithm Mine_Candidate_ Anomaly is shown in Algorithm 2.
Return caps

Pattern Anomalies Verification
Generally speaking, pattern anomalies have a higher probability coming from the candidate pattern anomalies set caps. However, there may be some special pattern anomalies that are not in the cpas; in addition, the cpas may also have partially redundant pattern anomalies. Hence, it is necessary to mine and analyze the cpas to obtain the pattern anomalies. The pattern anomalies mining mainly include: (1) Pattern filtering: for pattern s 1 corresponding to node u and pattern s 2 corresponding to node v in the cpas, if pattern s 2 is a substring of the pattern s 1 , add pattern s 1 to the pattern anomalies set pas.
(2) Pattern merging: for pattern s 1 corresponding to node u and pattern s 2 corresponding to node v in the cpas, if pattern s 1 and pattern s 2 have the longest common substring s 3 ; furthermore, s 3 is the true suffix of pattern s 1 and pattern s 2, then merge pattern s 1 ; (s 2 -s 3 ) becomes the new pattern s and is added to the pattern anomalies set pas; else add s 3 to pas, where -in (s 2 -s 3 ) means deleting pattern s 3 from s 2 .
(3) Pattern expanding: for each pattern σ i s i corresponding to node u i (1 ≤ i ≤ |Σ|) and its parents node u in wPST, if pattern sσ corresponding to node u does not include in caps but all σ i s i is included in caps, prune the parent node u corresponding to pattern sσ from wPST and add sσ to the pattern anomalies set pas.
(4) Pattern verifying: for each pattern sσ in cpas, if there exists an alphabet σ ∈Σ, make the occurrence number of sσ be equal to the occurrence number of sσσ ; that is, the probability that the event σ occurs after the event sσ is 1. Although the probability of event sσ is lower than MinCt, the occurrence of sσ represents the occurrence of a high confidence event sσσ . Therefore, sσ cannot be simply treated as a pattern anomaly and should be deeply verified and analyzed.
(5) Pattern sorting: the probability of sequences corresponding to nodes in different levels of wPST is different. Generally, if the symbol sequence has the same occurrence number, the closer a node is to the root, the higher the probability it is to be an anomalous pattern. Thus, for pattern s 1 corresponding to node u 1 and pattern s 2 corresponding to node u 2 in the pas, if the occurrence number of s 1 equals the occurrence number of s 2 and the node u 1 is closer to root than u 2 , it seems that s 1 has a higher probability to be an anomalous pattern than s 2 . Therefore, the top-k anomalous patterns can be gained by using this rule to sort the patterns in pas.
The formal description of the pattern anomalies mining process is shown in Algorithm 3.
Return aps

Algorithm Analysis
TFSAX_wPST can be divided into four parts: time series symbolization TFSAX, wPST construction, candidate pattern anomalies generation and pattern anomalies verification. For TFSAX, it has been proven to have a slightly more time complexity than SAX, but can achieve better symbolization. For the second part, it prunes the wPST by using Pr min or MinCt, thus the number of nodes only increases exponentially at first a few levels and then decreases and converges to some constant C [29]. Therefore, the total cost of constructing the wPST is approximately equal to O(NmL) where N is the total length of S, m is the average length of the sequence of S, αis a fixed integer, which depends upon the pruning parameters (usually less than 4), and C is a constant. Since the probability of pattern anomalies is small, the number of nodes included in the candidate pattern anomalies set is far less than |Σ| L −1 . Therefore, the time complexity required for candidate pattern anomalies generation and pattern anomalies mining will be much lower than that of wPST construction. Hence, the time complexity of TFSAX_wPST is mainly concentrated on TFSAX representation and wPST construction. Theoretically, the performance and efficiency of our algorithm are effectively improved compared to PST-based methods.

Case Studies
In this section, we conduct a set of experiments to show the accuracy and feasibility of our new approach. Here we choose different datasets (the NWIS dataset and Poyang Lake dataset) for experiments to prove the generality of our model.  Figure 4). This station is a typical hydrological station in the southern United States. Every year from July to October, the water level and discharge gradually decrease due to the influence of the Atlantic monsoon and will fall to the lowest value in September or October. With the increase of precipitation from November to June of the following year, the water level and discharge begin to rise and will reach the highest level from January to February. According to historical data, its monthly mean water level varies from 6.4 feet (September) to 9.3 feet (February) and its monthly mean discharge varies from 76 ft 3 /s (October) to 452 ft 3 /s (February). 9 of 23 begin to rise and will reach the highest level from January to February. According to historical data, its monthly mean water level varies from 6.4 feet (September) to 9.3 feet (February) and its monthly mean discharge varies from 76 ft 3 /s (October) to 452 ft 3 /s (February). Therefore, we used hourly water level, discharge and rainfall from 15 November 2010 to 15 November 2013 provided by the NWIS, USGS (NWIS: https://waterdata.usgs.gov/nwis/inventory/?site_no=02214075&agency_cd=USGS&amp), to verify the feasibility and effectiveness of this algorithm. The original water level data is shown in Figure 5. It should be mentioned that we used data quality control methods in the literature [15] and the point outlier detection method in the literature [13] to perform data quality control and point outlier detection on the original data set, so as to provide high-quality data for pattern anomalies detection.

TFSAX Representation
As can be seen from Figure 5, the water level data of Station 02214075 is smooth overall, but there are still some local "extreme" patterns that are obviously inconsistent with other patterns. In order to discover those "interesting" information in the series, we first use TFSAX to transform hydrological time series into symbolic sequence representation. In this experiment, we discretize the daily monitoring data (including 24 monitoring records with an interval of 1 h) into a mean symbol and a trend feature symbol representation.Therefore, the experimental time series will be divided into 1096 sequence segments (the total number of days from 15 November 2010 to 15 November 2013); that is, w = 1096. According to the TFSAX, the mean and trend feature of the given time series from  Therefore, we used hourly water level, discharge and rainfall from 15 November 2010 to 15 November 2013 provided by the NWIS, USGS (NWIS: https://waterdata.usgs.gov/nwis/inventory/ ?site_no=02214075&agency_cd=USGS&amp), to verify the feasibility and effectiveness of this algorithm. The original water level data is shown in Figure 5. It should be mentioned that we used data quality control methods in the literature [15] and the point outlier detection method in the literature [13] to perform data quality control and point outlier detection on the original data set, so as to provide high-quality data for pattern anomalies detection.    Figure 5. It should be mentioned that we used data quality control methods in the literature [15] and the point outlier detection method in the literature [13] to perform data quality control and point outlier

TFSAX Representation
As can be seen from Figure 5, the water level data of Station 02214075 is smooth overall, but there are still some local "extreme" patterns that are obviously inconsistent with other patterns. In order to discover those "interesting" information in the series, we first use TFSAX to transform hydrological time series into symbolic sequence representation. In this experiment, we discretize the daily monitoring data (including 24 monitoring records with an interval of 1 h) into a mean symbol and a trend feature symbol representation. Therefore, the experimental time series will be divided into 1096 sequence segments (the total number of days from 15 November 2010 to 15 November 2013); that is, w = 1096. According to the TFSAX, the mean and trend feature of the given time series from 15 November 2010 to 15 November 2013 can be represented by 5 and 7 symbols. That means the number of mean symbols is α = 5, and the number of trend feature symbols is α = 7. The character set represented by the mean and trend feature and their corresponding physical meanings are shown in Tables 1 and 2. After TFSAX symbolic representation, the original water level time series containing 1096 × 24 records will be symbolized into a symbol sequence containing 1096 symbols. The TFSAX representation of the water level time series is shown in Table 3. Table 3. The TFSAX representation of the daily mean water level of station 02214075.

Symbol Frequency Symbol Frequency Symbol Frequency Symbol Frequency Symbol Frequency
As shown in Table 3, we can find some pattern anomalies. For example, pattern E b means the water level is in state E (high water level between 13.48 and 16.56 feet) and the trend feature is in state b (the water level drops rapidly, and the trend feature angle is −45 • -30 • ) is a rare pattern in the time series. It will be added to the candidate pattern anomaly set according to TFSAX_wPST. In order to analyze the symbolized sequence, we used the wPST construction algorithm Build_wPST to construct the wPST for the sequences shown in Table 3. For the convenience of description, it uses A d with the constraint of the depth of tree L ≤ 3 to illustrate the construction of wPST. The constructed wPST is shown in Figure 6.   Figure 6. wPST tree representations for event Ad.

Detection Results and Analysis
From Figure 6, it can be inferred that the normal subsequent stages of state Ad should be Ac, Ad and Ae. Hence, it may indicate an anomalous event occurred if state Af or Bg appears right after state Ad. Here we use the algorithm Mine_Candidate_Anomaly and Mine_Anomaly to detect those patterns that meet the anomaly pattern definition in Definition (2).
In this experiment, we set parameters Prmin = 0.01 and MinCt = 5. When wPST is constructed, any node whose occurrence probability is less than Prmin or occurrence number is less than MinCt will be pruned from wPST. Moreover, the sequences corresponding to those nodes and all of its descendant nodes will be put into the candidate pattern anomalies set caps. For example, the node AfAd and all its descendant nodes will be pruned from the wPST shown in Figure 6, and all the sequences that contain patterns AdAf (e.g., AdAdAf) will be put into the caps.
After caps is generated, we will validate and analyze the patterns in it to determine the final pattern anomalies. Take AdBg for instance: we checked and analyzed the original data shown in Figure  5 and find that the pattern AdBgCg corresponds to the anomalous rain event from 15 August 2013 to 17 August 2013 in the Echeconnee Creek basin. On August 15, 16 and 17, the precipitation of this station was 1.41 in, 0.98 in and 1.45 in, respectively. As a result, the water level sharply rose 2.22 ft, 2.79 ft, 1.27 ft and 1.14 ft on 15 August-18 August, and the water level state represented by TFSAX changed drastically from Ad to Bg and then to Cg. Our method can quickly and accurately detect the pattern corresponding to this time series as an anomalous pattern. Similarly, the algorithm can also detect pattern anomalies, such as AcAe, AcAg, AeBd and AfBg, in a given time series.
The pattern anomalies detected by TFSAX_wPST on the Echeconnee Creek water level time series data set are shown in Table 4. The analysis of the results and corresponding events verifies that our method can effectively detect anomalous patterns, and thus provides high-quality data and knowledge support for subsequent hydrological analysis and application.

Detection Results and Analysis
From Figure 6, it can be inferred that the normal subsequent stages of state A d should be A c , A d and A e . Hence, it may indicate an anomalous event occurred if state A f or B g appears right after state A d . Here we use the algorithm Mine_Candidate_Anomaly and Mine_Anomaly to detect those patterns that meet the anomaly pattern definition in Definition (2).
In this experiment, we set parameters Pr min = 0.01 and MinCt = 5. When wPST is constructed, any node whose occurrence probability is less than Pr min or occurrence number is less than MinCt will be pruned from wPST. Moreover, the sequences corresponding to those nodes and all of its descendant nodes will be put into the candidate pattern anomalies set caps. For example, the node A f A d and all its descendant nodes will be pruned from the wPST shown in Figure 6, and all the sequences that contain patterns A d A f (e.g., A d A d A f ) will be put into the caps.
After caps is generated, we will validate and analyze the patterns in it to determine the final pattern anomalies. Take A d B g for instance: we checked and analyzed the original data shown in Figure 5 and find that the pattern A d B g C g corresponds to the anomalous rain event from 15 August 2013 to 17 August 2013 in the Echeconnee Creek basin. On August 15, 16 and 17, the precipitation of this station was 1.41 in, 0.98 in and 1.45 in, respectively. As a result, the water level sharply rose 2.22 ft, 2.79 ft, 1.27 ft and 1.14 ft on 15 August-18 August, and the water level state represented by TFSAX changed drastically from A d to B g and then to C g . Our method can quickly and accurately detect the pattern corresponding to this time series as an anomalous pattern. Similarly, the algorithm can also detect pattern anomalies, such as A c A e , A c A g , A e B d and A f B g , in a given time series.
The pattern anomalies detected by TFSAX_wPST on the Echeconnee Creek water level time series data set are shown in Table 4. The analysis of the results and corresponding events verifies that our method can effectively detect anomalous patterns, and thus provides high-quality data and knowledge support for subsequent hydrological analysis and application.

Research Area
Poyang Lake (Figure 7), the largest freshwater lake in China, is an important reservoir lake and an important international wetland in the mainstream of the Yangtze River. It is located on the south bank of the middle reaches of the Yangtze River and north of Jiangxi Province. The catchment has a subtropical wet climate characterized by an annual mean precipitation of 1680 mm and an annual mean evaporation of 1200 mm. Poyang Lake receives water flows from five rivers: Ganjiang, Fuhe, Xinjiang, Raohe and Xiushui, and exchanges water with the Yangtze River. Lake storage and lake level variation is controlled by catchment discharges and interactions with the Yangtze River [44]. From April to June each year, the lake experiences large water level fluctuations in response to the catchment's annual cycle of precipitation. From July to September, it is affected by the backflushing or backwatering of the Yangtze River to maintain high water levels. In the wet season (April to September), the water level rises and the lake coverage expands, covering an area of roughly 170 km from the north to the south and 17 km from the east to the west. The lake shrinks to little more than a river during the dry season (October to March), exposing extensive floodplains and wetland areas that support migrating waterfowls and a variety of invertebrate species [45].

Research Area
Poyang Lake (Figure 7), the largest freshwater lake in China, is an important reservoir lake and an important international wetland in the mainstream of the Yangtze River. It is located on the south bank of the middle reaches of the Yangtze River and north of Jiangxi Province. The catchment has a subtropical wet climate characterized by an annual mean precipitation of 1680 mm and an annual mean evaporation of 1200 mm. Poyang Lake receives water flows from five rivers: Ganjiang, Fuhe, Xinjiang, Raohe and Xiushui, and exchanges water with the Yangtze River. Lake storage and lake level variation is controlled by catchment discharges and interactions with the Yangtze River [44]. From April to June each year, the lake experiences large water level fluctuations in response to the catchment's annual cycle of precipitation. From July to September, it is affected by the backflushing or backwatering of the Yangtze River to maintain high water levels. In the wet season (April to September), the water level rises and the lake coverage expands, covering an area of roughly 170 km from the north to the south and 17 km from the east to the west. The lake shrinks to little more than a river during the dry season (October to March), exposing extensive floodplains and wetland areas that support migrating waterfowls and a variety of invertebrate species [45]. According to historical data, the multi-annual mean water level of Poyang Lake is 14.01 m; the monthly mean water level is highest in July (17.59 m) and lowest in January (10.52 m); and the highest water level appeared on 31 July 1998 (22.59 m), and the lowest appeared on 6 February 1963 (5.90 m). The Xingzi gauging station is the representative hydrological station of Poyang Lake and situated in the northern arm of the Lake at about 39 km from the Yangtze River. Typically, when the water level of Xingzi is below 11 m, it means that Poyang Lake has entered the dry season. Meanwhile, if the water level of Xingzi Station is above 19 m, it indicates that the water level of the Poyang Lake exceeds the warning line and is entering the flood season. The monthly mean lake water level at the Xingzi station from 1953 to 2009 is shown in Figure 8. We also used data quality control methods from the literature [15] and the point outlier detection method from the literature [13] to perform data quality control and point outlier detection on the original data set, so as to provide high-quality data for pattern anomalies detection. According to historical data, the multi-annual mean water level of Poyang Lake is 14.01 m; the monthly mean water level is highest in July (17.59 m) and lowest in January (10.52 m); and the highest water level appeared on 31 July 1998 (22.59 m), and the lowest appeared on 6 February 1963 (5.90 m). The Xingzi gauging station is the representative hydrological station of Poyang Lake and situated in the northern arm of the Lake at about 39 km from the Yangtze River. Typically, when the water level of Xingzi is below 11 m, it means that Poyang Lake has entered the dry season. Meanwhile, if the water level of Xingzi Station is above 19 m, it indicates that the water level of the Poyang Lake exceeds the warning line and is entering the flood season. The monthly mean lake water level at the Xingzi station from 1953 to 2009 is shown in Figure 8. We also used data quality control methods from the literature [15] and the point outlier detection method from the literature [13] to perform data quality control and point outlier detection on the original data set, so as to provide high-quality data for pattern anomalies detection.

TFSAX Representation
The monthly mean water level data of Xingzi Station is smooth overall in Figure 8, but there are still some local "extreme" patterns that are obviously inconsistent with other patterns. In order to discover those "interesting" patterns in this series, we first use TFSAX to transform the monthly mean water level data into a symbolic sequence representation.
In this experiment, we discretize the monthly statistics data (including 30 or 31 records with an interval of 1 day) into a mean symbol and a trend feature symbol representation.Therefore, the experimental time series will be divided into 684 sequence segments (the total number of month from January 1953 to December 2009); that is, w = 684. According to the TFSAX, both the mean and trend feature of the given time series from January 1953 to December 2009 would be represented by 5 symbols. That means the number of mean symbols is α = 5, and the number of trend feature symbols is α′ = 5. The character set represented by the mean and trend feature and their corresponding physical meanings are shown in Tables 5 and 6.

TFSAX Representation
The monthly mean water level data of Xingzi Station is smooth overall in Figure 8, but there are still some local "extreme" patterns that are obviously inconsistent with other patterns. In order to discover those "interesting" patterns in this series, we first use TFSAX to transform the monthly mean water level data into a symbolic sequence representation.
In this experiment, we discretize the monthly statistics data (including 30 or 31 records with an interval of 1 day) into a mean symbol and a trend feature symbol representation. Therefore, the experimental time series will be divided into 684 sequence segments (the total number of month from January 1953 to December 2009); that is, w = 684. According to the TFSAX, both the mean and trend feature of the given time series from January 1953 to December 2009 would be represented by 5 symbols. That means the number of mean symbols is α = 5, and the number of trend feature symbols is α = 5. The character set represented by the mean and trend feature and their corresponding physical meanings are shown in Tables 5 and 6. After the TFSAX symbolic representation, the original water level time series containing 684 × 30 records will be symbolized into a symbol sequence containing 684 symbols. The TFSAX representation of the water level time series is shown in Table 7.

wPST Construction
We can find some pattern anomalies in Table 7. For example, pattern B a means when the water level is in state B (dry season, water level is 8-11 m), the trend feature is in state a (the water level drops rapidly, and the trend feature angle is −90 • -−30 • ). It will be a rare pattern in the time series if the subsequent state of B a is C (normal, water level is between 11 and 15 m) and the subsequent trend feature of B a is e (water level rises rapidly, the trend feature angle is 30 • -90 • ). It will be added to the candidate pattern anomaly set according to TFSAX_wPST.
In order to analyze the symbolized sequence, we used the algorithm Build_wPST to construct the wPST for the sequences shown in Table 7. For the convenience of description, it uses B d under the constraint that the depth of tree L ≤ 3 to illustrate the construction of wPST. The constructed wPST is shown in Figure 9.

of 23
After the TFSAX symbolic representation, the original water level time series containing 684*30 records will be symbolized into a symbol sequence containing 684 symbols. The TFSAX representation of the water level time series is shown in Table 7.

wPST Construction
We can find some pattern anomalies in Table 7. For example, pattern Ba means when the water level is in state B (dry season, water level is 8-11 m), the trend feature is in state a (the water level drops rapidly, and the trend feature angle is −90°-−30°). It will be a rare pattern in the time series if the subsequent state of Ba is C (normal, water level is between 11 and 15 m) and the subsequent trend feature of Ba is e (water level rises rapidly, the trend feature angle is 30°-90°). It will be added to the candidate pattern anomaly set according to TFSAX_wPST.
In order to analyze the symbolized sequence, we used the algorithm Build_wPST to construct the wPST for the sequences shown in Table 7. For the convenience of description, it uses Bd under the constraint that the depth of tree L ≤ 3 to illustrate the construction of wPST. The constructed wPST is shown in Figure 9.

Detection Results and Analysis
From Figure 9, it can be inferred that state Bd (dry season, water level rises slowly) means the water level of Poyang Lake starts to rise slowly and its subsequent patterns is most likely to be Bd, Be and Ce. So, it may indicate that an anomalous event occurred if states Bb or Bc appears right after state Bd. In order to detect those patterns that meet the anomaly pattern definition in Definition (2), we set parameters Prmin = 0.02 and MinCt = 4.
When wPST is constructed, any node whose occurrence probability is less than Prmin or occurrence number is less than MinCt will be pruned from wPST. Moreover, the sequences corresponding to those nodes and all of its descendant nodes will be put into the candidate pattern anomalies set caps. For example, the node BcBd and all its descendant nodes will be pruned from the wPST shown in Figure 9. Meanwhile, all the sequences that contain pattern BdBc (e.g., BdBcBe) will be put into the caps.

Detection Results and Analysis
From Figure 9, it can be inferred that state B d (dry season, water level rises slowly) means the water level of Poyang Lake starts to rise slowly and its subsequent patterns is most likely to be B d , B e and C e . So, it may indicate that an anomalous event occurred if states B b or B c appears right after state B d . In order to detect those patterns that meet the anomaly pattern definition in Definition (2), we set parameters Pr min = 0.02 and MinCt = 4.
When wPST is constructed, any node whose occurrence probability is less than Pr min or occurrence number is less than MinCt will be pruned from wPST. Moreover, the sequences corresponding to those nodes and all of its descendant nodes will be put into the candidate pattern anomalies set caps. For example, the node B c B d and all its descendant nodes will be pruned from the wPST shown in Figure 9. Meanwhile, all the sequences that contain pattern B d B c (e.g., B d B c B e ) will be put into the caps.
After caps is generated, we will validate and analyze the patterns in it to determine the final pattern anomalies. Take pattern B d C e C c for instance: we checked and analyzed the original data shown in The pattern anomaly results detected by TFSAX_wPST on the Xingzi water level time series data set are shown in Table 8. The analysis of the results and corresponding events verifies that our method can effectively detect anomalous patterns, and thus provides high-quality data and knowledge support for subsequent hydrological analysis and application.

Analysis and Discussion
In order to verify the accuracy and efficiency of our method, we conducted three sets of comparative experiments. We first compared the construction efficiency and detection accuracy of the wPST and PST models. Then we compared the detection result of our algorithm with other different algorithms. Lastly, we compared the time complexity of our algorithm with other hydrological time series pattern anomaly detection algorithms. The following performance metrics-True Positive Rate (TPR), True Negative Rate (TNR), False Positive Rate (FPR), False Negative Rate (FNR), Accuracy, Precision, Recall and F1-score and Area Under the Curve (AUC) [46]-were used to evaluate the different approaches.

Construction Algorithm Comparison
Here, we first compared the construction efficiency between the wPST model and the traditional PST model on the Poyang Lake daily water level dataset. In this experiment, the performance is measured by the negative log-likelihood of the normal patterns given the observation of the anomalous patterns. Specifically, we constructed both wPST and PST models from the experimental data with Markov orders 1, 2, 3, 4 and 5. For each wPST/PST model, we calculate the negative log-likelihood P (s|T) of the experiment sequence s based on the given wPST/PST model T. The larger the negative log-likelihood value is, the more dissimilar are the compared sequences. We expect the dissimilarity between the anomalous patterns and the normal patterns to grow as the memory order grows.
Our results are summarized in Table 9. The empirical results indicate that the sizes of the wPST model are much smaller than that of the PST model as the order increases. For example, the 5th order PST model uses 138 states to characterize the experimental dataset, while the 5th order wPST model only uses 84 states. The negative log-likelihood is the same between a sequence given a wPST model and a PST model with the same order, since we eliminate nodes that have the same probabilities as their parent nodes when constructing the wPST models. Therefore, we prefer a wPST model over a PST model because it is purely data-driven, flexible, and takes less space. Note that the PST models can be pruned to remove some low probability nodes [29]; which will lead to information loss. Unlike PST, our approach prunes the low probability nodes and puts the sequence corresponding to those nodes and all its descendant nodes into the candidate pattern anomalies set caps, which can improve the accuracy and reduce the false detection rate of our algorithm. Table 10 shows the confusion matrix obtained when adjusting the threshold MinCt. Based on the detection performances, the FPR for the wPST model on Poyang Lake dataset is 3.8% when MinCt = 5, which has the best tradeoff between FPR and TPR. As a comparison, the FPR for the PST model on Poyang Lake dataset is 21.9% when MinCt = 5, which has the best tradeoff between FPR and TPR. In addition, the miss rates for the PST and wPST models are 14.3% and 2.3%, respectively. The miss rates and false alarm rates are both relatively low for the wPST model. The detection results show that our proposed TFSAX_wPST algorithm is able to detect anomalies with a higher performance than that of the PST model.

Anomaly Detection Results Comparison
We compared the detection results of our algorithm with PST-based [41], HMM-based [31], OCSVM [47], FP_SAX-based [33], and Distance-based [34] algorithms on the same datasets. The detection results computed by the different algorithms on the Poyang Lake dataset are presented in Table 11. All results reported were averaged over 10 runs of both the representation learning and detection models. The comparison results are displayed in the receiver operating characteristic (ROC) [48] curves shown in Figure 10. By convention, the ROC curve displays sensitivity (TPR) on the vertical axis against the complement of specificity (1 − specificity or FPR) on the horizontal axis. The ROC curve then demonstrates the characteristic reciprocal relationship between sensitivity and specificity, expressed as a tradeoff between the TPR and FPR. This configuration of the curve also facilitates calculation of the area beneath it as a summary index of the overall test performance. Therefore, the larger the area under the ROC curve, the better the performance of the technique is.   Figure 10 reveals the AUC obtained from the different algorithms. For experimental datasets, the AUC of the proposed algorithm are satisfactory and stable. These facts support the idea that our algorithm can effectively and accurately detect pattern anomalies and get better performances than that of OCSVM, HMM-based, PST-based, FP_SAX-based and Distance-based algorithms. This result is expected since the trend feature of the series is taken into account during the time series symbolization process TFSAX. Furthermore, we propose the improved probability suffix tree wPST to store the symbol sequence after TFSAX symbolization. Meanwhile, putting the symbol sequences pruned during the construction of wPST into caps rather than discarding them directly avoids information loss, which will improve the performance and efficiency of the algorithm.

Computational Complexity Comparison
One important aspect of anomaly detection is efficiency. In the hydrological field, it is important to ensure that the pattern anomalies are computed in a short amount of time and with a minimum delay. In this section, we compare the execution time of the TFSAX_wPST, FP_SAX-based [33] and Distance-based [34] algorithms along with the increase of the length of sequences on the Poyang Lake daily water level. The comparison results are shown in Table 12 and Figure 11.  Figure 10. AUC for the different algorithms. Figure 10 reveals the AUC obtained from the different algorithms. For experimental datasets, the AUC of the proposed algorithm are satisfactory and stable. These facts support the idea that our algorithm can effectively and accurately detect pattern anomalies and get better performances than that of OCSVM, HMM-based, PST-based, FP_SAX based and Distance-based algorithms. This result is expected since the trend feature of the series is taken into account during the time series symbolization process TFSAX. Furthermore, we propose the improved probability suffix tree wPST to store the symbol sequence after TFSAX symbolization. Meanwhile, putting the symbol sequences pruned during the construction of wPST into caps rather than discarding them directly avoids information loss, which will improve the performance and efficiency of the algorithm.

Computational Complexity Comparison
One important aspect of anomaly detection is efficiency. In the hydrological field, it is important to ensure that the pattern anomalies are computed in a short amount of time and with a minimum delay. In this section, we compare the execution time of the TFSAX_wPST, FP_SAX-based [33] and Distance-based [34] algorithms along with the increase of the length of sequences on the Poyang Lake daily water level. The comparison results are shown in Table 12 and Figure 11.  From Table 12 and Figure 11, we can see that the execution time of TFSAX_wPST is obviously less than that of the FP_SAX-based method [33] and Distance-based method [34]. The main reason is that the FP_SAX-based method and Distance-based methods need to measure distance between patterns and result in relatively high time complexity. As discussed in Section 3.3, the time complexity of our approach is mainly concentrated on TFSAX representation and wPST construction. The time of TFSAX symbolization is slightly better than that of FP_SAX, but our algorithm does not need to calculate the distance between patterns, so the time complexity is greatly improved.

Conclusions
In this paper we have conducted in-depth research on time series anomaly patterns and their detection algorithms; particularly, a detailed analysis of the framework, advantages and disadvantages, as well as an improvement strategy for the wPST-based approach. Combining with the field of hydrology, we proposed an effective and accurate anomaly pattern detection approach TFSAX_wPST for hydrological time series. At present, it mainly uses a distance-based approach to detect anomalous patterns in hydrological time series; however, the time complexity to calculate the distance between each pattern is very high. In this work, we combined symbolization (TFSAX) of time series with the VMM model (wPST). Then, a new approach that is suitable for hydrological time series anomalous pattern detection is put forward, which makes the detection results accurate and efficient.
There are some parts that remain to be improved in the future. Firstly, in the candidate pattern anomalies mining step, the threshold Pr min or MinCt to prune the wPST is based on the experience of previous experiments. In the future we should consider a more scientific way of evaluation, which achieves the optimal value of Pr min or MinCt. Secondly, compared to the fixed-length segmentation method TFSAX, how to use variable-length segmentation to represent time series for hydrological feature extraction is a more meaningful and interesting question. Finally, our approach mainly analyzes univariate time series anomalous pattern detection; therefore, how to apply this approach to detect multivariate hydrological time series anomalous patterns is a topic for future research.