# Detecting Pattern Anomalies in Hydrological Time Series with Weighted Probabilistic Suffix Trees

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

#### 2.1. Time Series Pattern Anomaly Detection

_{1}, t

_{1}), (m

_{2}, t

_{2}) … (m

_{i}, t

_{i})…(m

_{N}, t

_{N})> i = 1,2…N

_{i}, t

_{i}) indicates that the pattern of TS is m

_{1}during 0–t

_{1}, m

_{2}during t

_{1}–t

_{2}, m

_{N}during t

_{N}

_{− 1}–t

_{N}, and so on and so forth.

_{i}, t

_{i}) on a given time series TS that met the definition of the pattern anomaly based on different application areas and purposes [32].

#### 2.2. PST-Based Anomaly Detection

_{1}: abbabbabaaba over the alphabet Σ = {a, b} and tree depth L = 2. In PST, each edge is labelled by a unique symbol σ in Σ. Each node has at most two (|Σ|) children and records a string representing a path from the node to the root. The node also records a probability distribution vector corresponding to the conditional probabilities of seeing a symbol right after the label string in the dataset [40]. PST models the normal behave using the maximum likelihood criterion likelihood ratio. For a given sequence S and its PST T, the total likelihood-ratio of the observations can be expressed mathematically as L = Pr (S|T). If the probability of the observation sequence given the model has the largest likelihood ratio (or exceeding a certain preset threshold θ), then an anomaly is detected [29,41].

_{A}(a|ab)) is equal to that of subsequence B (P

_{B}(a|ab)). However, the frequency of event ab occurring in A (P

_{A}(ab) = 4/11) is higher than that in B (P

_{B}(ab) = 2/11). Hence, if it only uses probability to represent the sequence for anomaly detection tasks, it may lead to an erroneous analysis result.

## 3. A Novel Time Series Anomaly Detection Approach TFSAX_wPST

#### 3.1. Time Series Pattern Anomaly Based wPST Model

^{T}(s)= P

^{T}(σ

_{i}|σ

_{1}σ

_{2}…σ

_{i}

_{−1}) × w

_{i}

_{i}is the frequency weighting of the subsequence σ

_{1}, σ

_{2}…σ

_{i}

_{− 1}.

_{1}in Figure 1. Compared to PST, each node in the figure stores the conditional probability distribution vector of the subsequent symbol as well as the frequency weight corresponding subsequence, and thus can better present the feature information of the sequence.

**Definition**

**1.**

_{min}and MinCt represent the predefined minimum occurrence probability and minimum occurrence number for the conditional occurrence probability of σ under the condition of s, respectively. If the conditional occurrence probability of σ under the condition of s is satisfied:

- (1)
- Pr (σ|s) < Pr
_{min}and - (2)
- occ_num(σ) ≤ MinCt,

**Definition**

**2.**

#### 3.2. TFSAX_wPST Algorithm

#### 3.2.1. TFSAX Representation

#### 3.2.2. wPST Construction

_{before}and A

_{after}. Moreover, The arrays’ size is the size of the alphabet, and the value of each element of A

_{before}is the current count of σs′, where s′ is the element of hash.keys and σ is a character in the alphabet. Similarly, A

_{after}will store the count of s′σ. Thus, we can update all the counts at each level of the tree after one scan. After a level of the wPST is constructed, the current hash map is destroyed and a new hash map for the next level is initialized. For example, assuming that we have a sequential database consisting of one sequence {abba}, in one scan we can update the counts of ab→ b, a← bb, bb→ a and b← ba. The formal description of constructing wPST is shown in Algorithm 1.

Algorithm 1wPST construction. Build_wPST(S,H) |

Input: Sequence S, Maximum depth H |

Output: wPST T |

1. Initialize: T← root; k = 0; |

2. k = 1, S_{1}← {σ |σ∈Σ ∧ occ_count(σ) > 0} |

3. HM_{1}← HASHMAP(S_{1}) |

4. While k≤ H Do |

5. Foreach (s′∈S_{k}) |

6. A_{before}[|Σ|], A_{after}[|Σ|]←0; |

7. For i = 1 to len(s) –k + 1 |

8. ForEach (s_{[i,i + k−1]} ∈ S) |

9. If s_{[i,i + k−1]} ∈ HM_{k}.keys then |

10. Update(occ_times(s_{[i,i + k−1]})); |

11. ForEach(σ∈Σ | |σ′∈Σ) |

12. If (s[i + k] =σ) then Update(A_{after}(s[i + k])); |

13. If (s[i−1] =σ′) then Update(A_{before}(s[i−1])); |

14. ForEach (s′∈ S_{k}) |

15. T.Add(represent(u, s′)); |

16. w(represent(u, s′)) = occ_times(s′)/(len(S)-k); |

17. ForEach (σ∈Σ) |

18. compute Pr(σ|s′) using A_{after}; |

19. smooth Pr(σ|s′); |

20. Mine_candidate_Anomaly (T, MinCt, Pr_{min}); |

21. HM_{k + 1}← HASHMAP(S_{k + 1}); |

22. Return T |

#### 3.2.3. Candidate Anomalies Pattern Set Generation

^{L −}

^{1}without pruning while wPST is constructed. Therefore, the total complexity of this implementation is O(NmL) + O(L × |Σ|

^{L −}

^{1}) [29]. Thus, we can prune the wPST by using Pr

_{min}or MinCt, which only increases the number of nodes exponentially at first a few levels and then decreases and converges to some constant C. However, using the Pr

_{min}or MinCt to perform the pruning operation during the wPST construction process may result in the loss of the anomalous subsequence. In order to solve the above problem, this paper proposes a strategy to put the sequence corresponding to the node whose occurrence number is less than MinCt or occurrence probability is less than Pr

_{min}into the candidate pattern anomalies set, and then analyzes and mines the candidate set to obtain pattern anomalies that meets the user’s requirements.

_{min}, then it puts the node corresponding to the sequence and all its descendant nodes into the candidate pattern anomalies set. The formal description of the candidate anomaly mining algorithm Mine_Candidate_ Anomaly is shown in Algorithm 2.

Algorithm 2 Candidate anomaly pattern mining. Mine_Candidate_Anomaly (wPST T, int MinCt, real Pr_{min}) |

Input: wPST T, MinCt, Pr_{min} |

Output: candidate pattern anomaly set cpas |

1. Initialize: cpas←∅ |

2. ForEach represent(u,X) ∈T |

3. occ_times(u).Cal(); Pr (suffix(u)). Cal (); |

4. If (occ_times(u) < MinCt || Pr (u) < Pr_{min}) |

5. caps.Add(represent(u,X)); |

6. caps.Add(descendants (represent(u,X))); |

7. T.Prune(represent(u,X)); |

8. T.Prune(descendants (represent(u,X))); |

9. Return caps |

#### 3.2.4. Pattern Anomalies Verification

_{1}corresponding to node u and pattern s

_{2}corresponding to node v in the cpas, if pattern s

_{2}is a substring of the pattern s

_{1}, add pattern s

_{1}to the pattern anomalies set pas.

_{1}corresponding to node u and pattern s

_{2}corresponding to node v in the cpas, if pattern s

_{1}and pattern s

_{2}have the longest common substring s

_{3}; furthermore, s

_{3}is the true suffix of pattern s

_{1}and pattern s

_{2,}then merge pattern s

_{1}; (s

_{2}-s

_{3}) becomes the new pattern s′ and is added to the pattern anomalies set pas; else add s

_{3}to pas, where ′-′ in (s

_{2}-s

_{3}) means deleting pattern s

_{3}from s

_{2}.

_{i}s

_{i}corresponding to node u

_{i}(1 ≤ i ≤ |Σ|) and its parents node u in wPST, if pattern sσ corresponding to node u does not include in caps but all σ

_{i}s

_{i}is included in caps, prune the parent node u corresponding to pattern sσ from wPST and add sσ to the pattern anomalies set pas.

_{1}corresponding to node u

_{1}and pattern s

_{2}corresponding to node u

_{2}in the pas, if the occurrence number of s

_{1}equals the occurrence number of s

_{2}and the node u

_{1}is closer to root than u

_{2}, it seems that s

_{1}has a higher probability to be an anomalous pattern than s

_{2}. Therefore, the top-k anomalous patterns can be gained by using this rule to sort the patterns in pas.

Algorithm 3 Anomalies Pattern Mining. Mine_Anomaly (CAPS caps) |

Input: candidate pattern anomaly set caps |

Output: pattern anomaly set aps |

1. Initialize: aps←∅ |

2. Pattern_Filter(caps); |

3. Pattern_Merge(caps); |

4. Pattern_Extend(caps); |

5. Pattern_Valid (aps); |

6. Pattern_Sort(aps); |

7. Return aps |

#### 3.3. Algorithm Analysis

_{min}or MinCt, thus the number of nodes only increases exponentially at first a few levels and then decreases and converges to some constant C [29]. Therefore, the total cost of constructing the wPST is approximately equal to O(NmL) + O(L × |Σ|

^{α}) + O(LC), where N is the total length of S, m is the average length of the sequence of S, α is a fixed integer, which depends upon the pruning parameters (usually less than 4), and C is a constant. Since the probability of pattern anomalies is small, the number of nodes included in the candidate pattern anomalies set is far less than |Σ|

^{L −}

^{1}. Therefore, the time complexity required for candidate pattern anomalies generation and pattern anomalies mining will be much lower than that of wPST construction. Hence, the time complexity of TFSAX_wPST is mainly concentrated on TFSAX representation and wPST construction. Theoretically, the performance and efficiency of our algorithm are effectively improved compared to PST-based methods.

## 4. Case Studies

#### 4.1. NWIS Dataset

#### 4.1.1. Research Area

^{3}/s (October) to 452 ft

^{3}/s (February).

#### 4.1.2. TFSAX Representation

#### 4.1.3. wPST Construction

_{b}means the water level is in state E (high water level between 13.48 and 16.56 feet) and the trend feature is in state b (the water level drops rapidly, and the trend feature angle is −45°–30°) is a rare pattern in the time series. It will be added to the candidate pattern anomaly set according to TFSAX_wPST. In order to analyze the symbolized sequence, we used the wPST construction algorithm Build_wPST to construct the wPST for the sequences shown in Table 3. For the convenience of description, it uses A

_{d}with the constraint of the depth of tree L ≤ 3 to illustrate the construction of wPST. The constructed wPST is shown in Figure 6.

#### 4.1.4. Detection Results and Analysis

_{d}should be A

_{c}, A

_{d}and A

_{e}. Hence, it may indicate an anomalous event occurred if state A

_{f}or B

_{g}appears right after state A

_{d}. Here we use the algorithm Mine_Candidate_Anomaly and Mine_Anomaly to detect those patterns that meet the anomaly pattern definition in Definition (2).

_{min}= 0.01 and MinCt = 5. When wPST is constructed, any node whose occurrence probability is less than Pr

_{min}or occurrence number is less than MinCt will be pruned from wPST. Moreover, the sequences corresponding to those nodes and all of its descendant nodes will be put into the candidate pattern anomalies set caps. For example, the node A

_{f}A

_{d}and all its descendant nodes will be pruned from the wPST shown in Figure 6, and all the sequences that contain patterns A

_{d}A

_{f}(e.g., A

_{d}A

_{d}A

_{f}) will be put into the caps.

_{d}B

_{g}for instance: we checked and analyzed the original data shown in Figure 5 and find that the pattern A

_{d}B

_{g}C

_{g}corresponds to the anomalous rain event from 15 August 2013 to 17 August 2013 in the Echeconnee Creek basin. On August 15, 16 and 17, the precipitation of this station was 1.41 in, 0.98 in and 1.45 in, respectively. As a result, the water level sharply rose 2.22 ft, 2.79 ft, 1.27 ft and 1.14 ft on 15 August–18 August, and the water level state represented by TFSAX changed drastically from A

_{d}to B

_{g}and then to C

_{g}. Our method can quickly and accurately detect the pattern corresponding to this time series as an anomalous pattern. Similarly, the algorithm can also detect pattern anomalies, such as A

_{c}A

_{e}, A

_{c}A

_{g}, A

_{e}B

_{d}and A

_{f}B

_{g}, in a given time series.

#### 4.2. Poyang Lake Data Set

#### 4.2.1. Research Area

#### 4.2.2. TFSAX Representation

#### 4.2.3. wPST Construction

_{a}means when the water level is in state B (dry season, water level is 8–11 m), the trend feature is in state a (the water level drops rapidly, and the trend feature angle is −90°–−30°). It will be a rare pattern in the time series if the subsequent state of B

_{a}is C (normal, water level is between 11 and 15 m) and the subsequent trend feature of B

_{a}is e (water level rises rapidly, the trend feature angle is 30°–90°). It will be added to the candidate pattern anomaly set according to TFSAX_wPST.

_{d}under the constraint that the depth of tree L ≤ 3 to illustrate the construction of wPST. The constructed wPST is shown in Figure 9.

#### 4.2.4. Detection Results and Analysis

_{d}(dry season, water level rises slowly) means the water level of Poyang Lake starts to rise slowly and its subsequent patterns is most likely to be B

_{d}, B

_{e}and C

_{e}. So, it may indicate that an anomalous event occurred if states B

_{b}or B

_{c}appears right after state B

_{d}. In order to detect those patterns that meet the anomaly pattern definition in Definition (2), we set parameters Pr

_{min}= 0.02 and MinCt = 4.

_{min}or occurrence number is less than MinCt will be pruned from wPST. Moreover, the sequences corresponding to those nodes and all of its descendant nodes will be put into the candidate pattern anomalies set caps. For example, the node B

_{c}B

_{d}and all its descendant nodes will be pruned from the wPST shown in Figure 9. Meanwhile, all the sequences that contain pattern B

_{d}B

_{c}(e.g., B

_{d}B

_{c}B

_{e}) will be put into the caps.

_{d}C

_{e}C

_{c}for instance: we checked and analyzed the original data shown in Figure 8. It shows that the pattern B

_{d}C

_{e}C

_{c}corresponds to the flood event from April to August 1974 in the Xingzi water level time series. Due to the influence of upper stream inflow from Ganjiang, Fuhe, Xinjiang, Raohe and Xiushui during the rainy season, the monthly mean water level of Xingzi Station soared from 10.13 m in April 1974 to 14.21 m in May, and dropped slightly in June to 14.05 m; then, in July, it rose to 18.41 m (the highest water level is 20.1 m). Our method can quickly and accurately detect the pattern corresponding to this time series as an anomalous pattern. Similarly, our algorithm can also detect other pattern anomalies, such as B

_{d}B

_{e}B

_{b}corresponding to drought events at Poyang Lake from September 2006 to May 2007, and B

_{d}B

_{c}C

_{e}corresponding to drought events at Poyang Lake from December 2007 to January 2008.

#### 4.3. Analysis and Discussion

#### 4.3.1. Construction Algorithm Comparison

#### 4.3.2. Anomaly Detection Results Comparison

#### 4.3.3. Computational Complexity Comparison

## 5. Conclusions

_{min}or MinCt to prune the wPST is based on the experience of previous experiments. In the future we should consider a more scientific way of evaluation, which achieves the optimal value of Pr

_{min}or MinCt. Secondly, compared to the fixed-length segmentation method TFSAX, how to use variable-length segmentation to represent time series for hydrological feature extraction is a more meaningful and interesting question. Finally, our approach mainly analyzes univariate time series anomalous pattern detection; therefore, how to apply this approach to detect multivariate hydrological time series anomalous patterns is a topic for future research.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Chen, L.; Wang, L. Recent advance in earth observation big data for hydrology. Big Earth Data
**2018**, 2, 86–107. [Google Scholar] [CrossRef] [Green Version] - Guo, H.; Wang, L.; Chen, F.; Liang, D. Scientific big data and digital earth. Chin. Sci. Bull.
**2014**, 59, 5066–5073. [Google Scholar] [CrossRef] - Azimi, S.; Moghaddam, M.A.; Monfared, S.A. Anomaly Detection and Reliability Analysis of Groundwater by Crude Monte Carlo and Importance Sampling Approaches. Water Resour. Manag.
**2018**, 32, 4447–4467. [Google Scholar] [CrossRef] - Rougé, C.; Ge, Y.; Cai, X. Detecting gradual and abrupt changes in hydrological records. Adv. Water Resour.
**2013**, 53, 33–44. [Google Scholar] [CrossRef] [Green Version] - Hawkins, D.M. Identification of Outliers; Chapman and Hall: London, UK, 1980. [Google Scholar]
- Chandala, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. CSUR
**2009**, 41, 1–58. [Google Scholar] [CrossRef] - Gupta, M.; Gao, J.; Aggarwal, C.; Han, J. Outlier detection for temporal data. Synth. Lect. Data Min. Knowl. Discov.
**2014**, 5, 1–129. [Google Scholar] [CrossRef] - USGS. Interagency Advisory Committee on Water Data. In Guidelines for Determining Flood Flow Frequency: Bulletin 17 B; U.S. Geological Survey, Office of Water Data Coordination: Reston, VA, USA, 1982. [Google Scholar]
- Stedinger, J.R.; Griffis, V.W. Flood frequency analysis in the united states: Time to update. J. Hydrol. Eng.
**2008**, 13, 199–204. [Google Scholar] [CrossRef] [Green Version] - Chebana, F.; Daboniang, S.; Ouarda, T.B. Exploratory functional flood frequency analysis and outlier detection. Water Resour. Res.
**2012**, 48, 1–20. [Google Scholar] [CrossRef] [Green Version] - Sarraf, A.P. Flood outlier detection using PCA and effect of how to deal with them in regional flood frequency analysis via L-moment method. Water Resour.
**2015**, 42, 448–459. [Google Scholar] [CrossRef] - Amin, M.T.; Rizwan, M.; Alazba, A.A. Comparison of mixed distribution with EV1 and GEV components for analyzing hydrologic data containing outlier. Environ. Earth Sci.
**2015**, 73, 1369–1375. [Google Scholar] [CrossRef] - Yu, Y.; Zhu, Y.; Li, S.; Wan, D. Time series outlier detection based on sliding window prediction. Math. Probl. Eng.
**2014**. [Google Scholar] [CrossRef] - Ng, W.W.; Panu, U.S.; Lennox, W.C. Chaos based analytical techniques for daily extreme hydrological observations. J. Hydrol.
**2007**, 342, 17–41. [Google Scholar] [CrossRef] - Zhao, Q.; Zhu, Y.; Wan, D.; Yu, Y.; Cheng, X. Research on the Data-Driven quality control method of hydrological time series data. Water
**2018**, 10, 1712. [Google Scholar] [CrossRef] [Green Version] - Nyeko-Ogiramoi, P.; Willems, P.; Ngirane-Katashaya, G. Trend and variability in observed hydrometer- orological extremes in the Lake Victoria basin. J. Hydrol.
**2013**, 489, 56–73. [Google Scholar] [CrossRef] - Wang, C.; Zhao, Z.; Gong, L.; Zhu, L.; Liu, Z.; Cheng, X. A distributed anomaly detection system for in-vehicle network using HTM. IEEE Access
**2018**, 6, 9091–9098. [Google Scholar] [CrossRef] - Van Vlasselaer, V.; Bravo, C.; Caelen, O.; Eliassi-Rad, T.; Akoglu, L.; Snoeck, M.; Baesens, B. APATE: A novel approach for automated credit card transaction fraud detection using network-based extensions. Decis. Support Syst.
**2015**, 75, 38–48. [Google Scholar] [CrossRef] [Green Version] - Golmohammadi, K.; Zaiane, O.R. Time series contextual anomaly detection for detecting market manipulation in stock market. In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA), Paris, France, 19–21 October 2015; pp. 1–10. [Google Scholar]
- Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6479–6488. [Google Scholar]
- Keogh, E.; Lin, J.; Fu, A. HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In Proceedings of the IEEE International Conference on Data Mining, Houston, TX, USA, 27–30 November 2005; IEEE Computer Society: Washington, DC, USA, 2005; pp. 226–233. [Google Scholar]
- Candelieri, A. Clustering and support vector regression for water demand forecasting and anomaly detection. Water
**2017**, 9, 224. [Google Scholar] [CrossRef] - Yu, Y.; Zhu, Y.; Wan, D.; Liu, H.; Zhao, Q. A Novel Symbolic Aggregate Approximation for Time Series. In Proceedings of the 13th International Conference on Ubiquitous Information Management and Communication, IMCOM 2019, Phuket, Thailand, 4–6 January 2019; Springer: Cham, Switzerland, 2019; pp. 805–822. [Google Scholar]
- Ding, Z.; Fei, M. An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc. Vol.
**2013**, 46, 12–17. [Google Scholar] [CrossRef] - Budalakoti, S.; Srivastava, A.N.; Otey, M.E. Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev.
**2009**, 39, 101–113. [Google Scholar] [CrossRef] - Safin, A.M.; Burnaev, E. Conformal kernel expected similarity for anomaly detection in time-series data. Adv. Syst. Sci. Appl.
**2017**, 17, 22–33. [Google Scholar] - Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection for discrete sequences: A survey. IEEE Trans. Knowl. Data Eng.
**2010**, 24, 823–839. [Google Scholar] [CrossRef] - Keogh, E.; Lonardi, S.; Chiu, B.Y. Finding surprising patterns in a time series database in linear time and space. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; pp. 550–556. [Google Scholar]
- Sun, P.; Chawla, S.; Arunasalam, B. Mining for Outliers in Sequential Databases. In Proceedings of the SIAM International Conference on Data Mining, Bethesda, MD, USA, 20–22 April 2006; 2006; pp. 94–105. [Google Scholar]
- Klerx, T.; Anderka, M.; Büning, H.K.; Priesterjahn, S. Model-based anomaly detection for discrete event systems. In Proceedings of the International Conference on Tools with Artificial Intelligence, Limassol, Cyprus, 10–12 November 2014; pp. 665–672. [Google Scholar]
- Zohrevand, Z.; Glasser, U.; Shahir, H.Y.; Tayebi, M.A.; Costanzo, R. Hidden Markov based anomaly detection for water supply systems. In Proceedings of the International Conference on Big Data, Washington, DC, USA, 5–8 December 2016; pp. 1551–1560. [Google Scholar]
- Pimentel MA, F.; Clifton, D.A.; Clifton, L.; Tarassenko, L. A review of novelty detection. Signal Process.
**2014**, 99, 215–490. [Google Scholar] [CrossRef] - Wan, D.; Xiao, Y.; Zhang, P.; Feng, J.; Zhu, Y.; Liu, Q. Hydrological time series anomaly mining based on symbolization and distance measure. In Proceedings of the 2014 IEEE International Congress on Big Data, Beijing, China, 27 June–2 July 2014; pp. 339–346. [Google Scholar]
- Zhang, P.; Leung, H.; Xiao, Y.; Feng, J.; Wan, D.; Li, W.; Leung, H. A New Symbolization and Distance Measure Based Anomaly Mining Approach for Hydrological Time Series. Int. J. Web Serv. Res.
**2016**, 13, 26–45. [Google Scholar] [CrossRef] [Green Version] - Wu, H.; Li, X.; Qian, H. Detection of Anomalies and Changes of Rainfall in theYellow River Basin, China, through Two Graphical Methods. Water
**2018**, 10, 15. [Google Scholar] [CrossRef] [Green Version] - Ye, N. A markov chain model of temporal behavior for anomaly detection. In Proceedings of the 2000 IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop, West Point, NY, USA, 6–7 June 2000; Volume 166, p. 169. [Google Scholar]
- Ron, D.; Singer, Y.; Tishby, N. The power of amnesia: Learning probabilistic automata with variable memory length. Mach. Learn.
**1996**, 25, 117–149. [Google Scholar] [CrossRef] [Green Version] - Bejerano, G.; Yona, G. Variations on probabilistic suffix trees: Statistical modeling and prediction of protein families. Bioinformatics
**2001**, 17, 23–43. [Google Scholar] [CrossRef] - Yang, J.; Wang, W. CLUSEQ: Efficient and effective sequence clustering. In Proceedings of the 19th International Conference on Data Engineering, Bangalore, India, 5–8 March 2003; pp. 101–112. [Google Scholar]
- Kholidy, H.A.; Yousof, A.M.; Erradi, A.; Abdelwahed, S.; Ali, H.A. A Finite Context Intrusion Prediction Model for Cloud Systems with a Probabilistic Suffix Tree. In Proceedings of the 2014 European Modelling Symposium, Pisa, Italy, 21–23 October 2014; pp. 526–531. [Google Scholar]
- Li, Y.; Thomason, M.; Parker, L.E. Detecting time-related changes in Wireless Sensor Networks using symbol compression and Probabilistic Suffix Trees. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, China, 18–22 October 2010; pp. 2946–2951. [Google Scholar]
- Farahani, I.V.; Chien, A.; King, R.E.; Kay, M.G.; Klenz, B. Time Series Anomaly Detection from a Markov Chain Perspective. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1000–1007. [Google Scholar]
- Keogh, E.; Chakrabarti, K.; Pazzani, M.; Mehrotra, S. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst.
**2000**, 3, 263–286. [Google Scholar] [CrossRef] - Hu, Q.; Feng, S.; Guo, H.; Chen, G.; Jiang, T. Interactions of the Yangtze river flow and hydrologic processes of the Poyang Lake, China. J. Hydrol.
**2007**, 347, 90–100. [Google Scholar] [CrossRef] - Li, X.; Zhang, Q.; Ye, X. Dry/wet conditions monitoring based on TRMM rainfall data and its reliability validation over Poyang Lake Basin, China. Water
**2013**, 5, 1848–1864. [Google Scholar] [CrossRef] - Han, J.; Jian, P.; Micheline, K. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2011. [Google Scholar]
- Ghafoori, Z.; Erfani, S.M.; Rajasegarar, S.; Karunasekera, S.; Leckie, C.A. Anomaly Detection in Non-stationary Data: Ensemble based Self-Adaptive OCSVM. In Proceedings of the International Joint Conference on Neural Networks, Vancouver, BC, Canada, 25–29 July 2016. [Google Scholar]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett.
**2006**, 27, 861–874. [Google Scholar] [CrossRef]

**Figure 2.**Framework of the Trend Feature Symbolic Aggregate approximation and weighted PST approach (TFSAX_wPST).

Symbol | Meaning (Water Level) |
---|---|

A | 5.46 ft–7.45 ft |

B | 7.46 ft–9.43 ft |

C | 9.46 ft–10.39 ft |

D | 10.48 ft–13.44 ft |

E | 13.48 ft−16.56 ft |

Symbol | Trend Feature | Meaning |
---|---|---|

a | (−90°–−45°) | water level drops sharply |

b | (−90°–−30°) | water level drops rapidly |

c | (−30°–−5°) | water level drops slowly |

d | (−5°–5°) | water level remains stable |

e | (5°–30°) | water level rises slowly |

f | (30°–45°) | water level rises rapidly |

g | (45°–90°) | water level rises sharply |

Symbol | Frequency | Symbol | Frequency | Symbol | Frequency | Symbol | Frequency | Symbol | Frequency |
---|---|---|---|---|---|---|---|---|---|

E_{b} | 1 | C_{c} | 3 | E_{a} | 5 | B_{f} | 11 | B_{d} | 33 |

E_{e} | 1 | C_{f} | 3 | D_{c} | 5 | A_{f} | 12 | B_{e} | 37 |

E_{c} | 1 | E_{g} | 3 | D_{f} | 6 | B_{b} | 13 | A_{e} | 103 |

A_{b} | 2 | A_{g} | 4 | B_{a} | 6 | D_{g} | 13 | B_{c} | 125 |

E_{f} | 2 | D_{b} | 4 | C_{a} | 9 | D_{a} | 7 | A_{c} | 149 |

D_{e} | 3 | C_{b} | 4 | C_{g} | 9 | B_{g} | 8 | A_{d} | 495 |

Pattern | Subsequence | Corresponding Event Description |
---|---|---|

B_{g}C_{c}B_{a} | 1 Dec 2010–3 Dec 2010 | Daily water level is +1.88, −0.3, −1.19 feet, respectively. |

B_{g}D_{e}D_{f}D_{g}E_{f}E_{a}D_{a}C_{b} | 2 Feb 2011–9 Feb 2011 | Daily water level is +1.68, +0.24, +1.33/2, +1.53, + 1.39/2, −1.53, −2.42, −0.8 feet, respectively. |

B_{f}C_{g}D_{e}C_{a}B_{c} | 9 Mar 2011–12 Mar 2011 | Daily water level is +2.13, +0.36, −2.35, −0.53 feet, respectively. |

C_{g}D_{g}D_{b}D_{c}D_{f}D_{b}D_{a}B_{b}B_{c}B_{g}C_{f}C_{a} | 27 Mar 2011–7 Apr 2011 | Daily water level is +3.38, +1.6, −0.86, −0.31, + 0.9, −0.65, −1.74, −0.7, −0.39, +1.09, +0.61, −1.25 feet, respectively. |

B_{g}B_{c}B_{b} | 23 Sep 2011–25 Sep 2011 | Daily water level is +2.34/2, −0.51, −0.72 feet, respectively. |

A_{g}B_{g} | 21 Jan 2012–22 Jan 2012 2012.1.21–1.22 | Daily water level is +1.3, +1.1 feet, respectively. |

A_{f}C_{g}D_{f}D_{a}B_{c}B_{b} | 18 Feb 2012–23 Feb 2012 | Daily water level is +0.67, +2.67, +0.79, −2.1, −0.99 feet, respectively. |

B_{f}B_{g}B_{a}B_{c} | 26 Dec 2012–28 Dec 2012 | Daily water level is +0.65, +1.49, −1.55 feet, respectively. |

A_{g}C_{g}C_{a}B_{a}C_{g}D_{g}E_{e}E_{c}E_{a}D_{a}C_{a}B_{c} | 7 Feb 2013–18 Feb 2013 | Daily water level is +1.87, +1.78, −1.06, −1.14, +3.19, +3.44, +0.26, −0.32, −1.59, −2.04, −1.47, −0.57 feet, respectively. |

B_{g}D_{g}E_{g}E_{a}D_{c}D_{c}E_{f}E_{a}D_{a}C_{a}B_{c} | 22 Feb 2013–2 Mar 2013 | Daily water level is +1.14, +2.86, +1.85, −1.14, −0.2, +1.31, −1.24, −2.1, −1.15, −0.46 feet, respectively. |

C_{g}D_{g}E_{a}D_{a}C_{b} | 24 Mar 2013–28Mar 2013 | Daily water level is +3.25, 3.18/2, −1.62, −2.66, −0.88 feet, respectively. |

B_{g}D_{g}D_{a}B_{b}B_{c} B_{g}D_{g}E_{g}D_{a}D_{a}B_{b} | 29 Apr 2013–9 May 2013 | Daily water level is +2.04, +1.91, −1.98, −0.88, −0.31, +1.05, +2.55, +2.09/2, −1.31, −2.68, −0.9 feet, respectively. |

C_{g}E_{g}D_{a}B_{a} | 23 May 2013–26 May 2013 | Daily water level is +4.21, +2.1/2, −3.97, −1.41 feet, respectively. |

B_{g}D_{c}B_{f}D_{g}D_{f}D_{b}D_{a}B_{c}D_{g}D_{f}D_{a} | 3 Jun 2013–13 Jun 2013 | Daily water level is +3.03, −0.24, +0.78, +2.17, +0.57, −0.8, −1.7, 0.35, +2.42, +0.6, −3.46 feet, respectively. |

B_{f}C_{f}D_{a}D_{e}C_{a}D_{g}D_{a}B_{a} | 3 Jul 2013–10 Jul 2013 | Daily water level is +0.52, +0.46, −1.25, +0.25, −1.2, +1.61, −1.16, −1.13 feet, respectively. |

C_{g}D_{g}D_{f}D_{a}B_{a} | 12 Jul 2013–16 Jul 2013 | Daily water level is +1.97, +1.2, +0.57, −2.76, −1.11 feet, respectively. |

A_{g}B_{g}C_{a}B_{b} | 2013.7.31–8.3 | Daily water level is +1.01, +2.43, −1.52, −0.92 feet, respectively. |

B_{g}C_{g}D_{g}D_{g}E_{b} | 15 Aug 2013–19Aug 2013 | Daily water level is +2.22, +2.79, +1.27, +1.14 feet, respectively. |

Symbol | Meaning (Water Level) |
---|---|

A | 7.28 m–8 m |

B | 8.01 m–10.99 m |

C | 11.03 m–15 m |

D | 15.04 m–19 m |

E | 19.01 m−21.96 m |

Symbol | Trend Feature | Meaning |
---|---|---|

a | (−90°–−30°) | water level drops rapidly |

b | (−30°–−5°) | water level drops slowly |

c | (−5°–5°) | water level remains stable |

d | (5°–30°) | water level rises slowly |

e | (30°–90°) | water level rises rapidly |

Symbol | Frequency | Symbol | Frequency | Symbol | Frequency | Symbol | Frequency |
---|---|---|---|---|---|---|---|

A_{a} | 12 | B_{c} | 5 | C_{d} | 28 | D_{e} | 82 |

A_{b} | 9 | B_{d} | 31 | C_{e} | 101 | E_{a} | 4 |

A_{c} | 1 | B_{e} | 37 | D_{a} | 50 | E_{b} | 3 |

A_{d} | 4 | C_{a} | 89 | D_{b} | 31 | E_{d} | 4 |

B_{a} | 73 | C_{b} | 24 | D_{c} | 9 | E_{e} | 16 |

B_{b} | 29 | C_{c} | 12 | D_{d} | 30 |

Pattern | Subsequence | Corresponding Event Description |
---|---|---|

D_{e}E_{e}E_{e}E_{b}E_{a}D_{a}D_{a}C_{a} | Mar 1954–Dec 1954 | Extraordinary floods in the Yangtze River Basin, monthly mean water levels are 17.07, 19.84, 21.47, 21.23, 20.24, 18.5, 15.93, 11.48 m |

A_{a}A_{b} | Dec 1958–Feb 1959 | Extreme drought season, monthly mean water levels are 7.95, 7.48, 11.16 m |

Jan 1963–Oct 1963 | Drought year, monthly mean water levels are 7.89, 7.45, 8.01, 8.94, 14.99, 14.59, 14.51, 15.45, 15.99, 14.9 m, highest water level occurred in September | |

Dec 1971–Mar 1972 | Extreme drought event, monthly mean water levels are 7.9, 7.68, 9.4, 9.83 m | |

Dec 1979–Feb 1980 | Extreme drought event, monthly mean water levels are 7.96, 7.76, 8.63, 12.4 m | |

A_{a}A_{d} | Jan 1965–Apr 1965 | Extreme drought event, monthly mean water levels are 7.81, 8, 8.62, 12.31 m |

Jan 1968–Apr 1968 | Extreme drought event, monthly mean water levels are 7.68, 7.9, 9.1, 13.68 m | |

Dec 2007–Mar 2008 | Extreme drought event, monthly mean water levels are 7.54, 7.72, 8.5, 8.62 m | |

D_{b}D_{e}E_{e}E_{b}D_{a}C_{a} | Jul 1980–Oct 1980 | Flood events, monthly mean water levels are 18.03, 19.41, 19.19, 16.26 m |

E_{e}E_{a}D_{a}D_{c}C_{a} | Jul 1983–Oct 1983 | Flood events, monthly mean water levels are 20.85, 19.22, 17.9, 17.77 m |

C_{c}E_{e}D_{a} | Jun 1968–Aug 1968 | Flood events, monthly mean water levels are 14.53, 19.27, 17.71 m |

C_{d}C_{b}C_{a}C_{a}C_{e}C_{d}B_{a}B_{d} | Jun 1972–Feb 1973 | Drought year with a gentle overall trend, monthly mean water levels are 14.68, 14.55, 13.64, 11.73, 13.25, 13.45, 10.53, 10.98, 10.81 m |

B_{a}B_{b}A_{b}B_{e} | Dec 1986–Feb 1987 | Drought season, monthly mean water levels are 8.84, 8.06, 7.66, 8.34 m |

D_{e}E_{e}E_{d}E_{a}D_{a}B_{a} | Jun 1998–Nov 1998 | Extreme flood event, monthly mean water levels are 17.12, 21.4, 21.96, 20.17, 15.77, 10.98 m |

D_{e}E_{e}E_{a}E_{b}D_{a}C_{a} | Jun 1999–Nov 1999 | Flood events, monthly mean water levels are 16.69, 21.12, 19.63, 19.36, 15.63, 12.89 m |

C_{a}B_{a}B_{a}A_{b}A_{b}B_{e} | Nov 2003–Mar 2004 | Extreme drought event, monthly mean water levels are 8.5, 7.65, 7.28, 9.38 m |

E_{e}E_{d} | Jul 1996–Oct 1996 | Flood events monthly mean water levels are 19.46, 19.97 m |

B_{d}C_{e}C_{c}D_{e} | Apr 1974–Jun 1974 | No rainy season, monthly mean water levels are 10.13, 14.21, 14.05, 15.6 m |

B_{a}B_{a}B_{b}B_{c}B_{a}B_{d}B_{e}B_{b}B_{e} | Sep 2006–May 2007 | Extreme drought year, monthly mean water levels are 10.72, 9.29, 9.05, 9.02, 8.06, 8.28, 10.6, 10.4, 10.98 m |

Approach | Order | Numbers of Nodes | −Log-Likelihood |
---|---|---|---|

PST-based | 1 | 10 | −0.0152 |

2 | 46 | −0.0108 | |

5 | 138 | −0.0068 | |

wPST-based | 1 | 10 | −0.0152 |

2 | 41 | −0.0108 | |

5 | 84 | −0.0068 |

Approach | MinCt | FNR (Miss Rate) | FPR (False Alarm) |
---|---|---|---|

PST-Based (order = 5) | 1 | 25.2% | 64.9% |

2 | 22.7% | 42.5% | |

5 | 14.3% | 21.9% | |

10 | 12.5% | 37.6% | |

wPST-Based (order = 5) | 1 | 12.6% | 25.6% |

2 | 8.4% | 10.9% | |

5 | 2.3% | 3.8% | |

10 | 5.3% | 8.4% |

Algorithm | PST-based | HMM-based | OCSVM | FP_SAX-based | Distance-based | TFSAX_wPST | |
---|---|---|---|---|---|---|---|

Metric | |||||||

Accuracy | 0.912 | 0.928 | 0.874 | 0.936 | 0.947 | 0.976 | |

Precision | 0.926 | 0.922 | 0.896 | 0.927 | 0.935 | 0.964 | |

Recall | 0.925 | 0.932 | 0.902 | 0.944 | 0.951 | 0.969 | |

F_{1}-score | 0.926 | 0.927 | 0.918 | 0.935 | 0.943 | 0.966 | |

AUC | 0.924 | 0.931 | 0.915 | 0.938 | 0.949 | 0.971 |

Num | Time | Sequence Lengths | Total Length | FP_SAX-based | Distance-based | TFSAX_wPST |
---|---|---|---|---|---|---|

1 | Jul–Aug | 62 | 3534 | 0.372 s | 0.324 s | 0.264 s |

2 | Jun–Aug | 92 | 5244 | 0.719 s | 0.708 s | 0.532 s |

3 | Jun–Sep | 122 | 6954 | 1.133 s | 1.012 s | 0.793 s |

4 | Jun–Oct | 153 | 8721 | 1.471 s | 1.274 s | 0.946 s |

5 | May–Oct | 184 | 10,488 | 2.641 s | 2.416 s | 1.387 s |

6 | Apr–Nov | 244 | 13,908 | 3.325 s | 2.782 s | 1.764 s |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yu, Y.; Wan, D.; Zhao, Q.; Liu, H.
Detecting Pattern Anomalies in Hydrological Time Series with Weighted Probabilistic Suffix Trees. *Water* **2020**, *12*, 1464.
https://doi.org/10.3390/w12051464

**AMA Style**

Yu Y, Wan D, Zhao Q, Liu H.
Detecting Pattern Anomalies in Hydrological Time Series with Weighted Probabilistic Suffix Trees. *Water*. 2020; 12(5):1464.
https://doi.org/10.3390/w12051464

**Chicago/Turabian Style**

Yu, Yufeng, Dingsheng Wan, Qun Zhao, and Huan Liu.
2020. "Detecting Pattern Anomalies in Hydrological Time Series with Weighted Probabilistic Suffix Trees" *Water* 12, no. 5: 1464.
https://doi.org/10.3390/w12051464