# Unsupervised Bayesian Nonparametric Approach with Incremental Similarity Tracking of Unlabeled Water Demand Time Series for Anomaly Detection

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- Results by the unsupervised method may have low accuracy in practice because anomalies are rare by definition, unexpected, and also dependent on the season such as summer or winter, weekday or weekend [15]. This is due to the fact that water demand does not remain stationary all the time and instead follow a specific periodic pattern [14] and thus the definition of anomalies changes over time.
- Unsupervised methodologies may not explain in clear details on why demand is anomalous, and hence, their results may not be trustworthy [15].

- A preliminary real-time detection of the anomaly by examining the hourly time step, rate of change, and shape of the trend simultaneously with a minimal amount of historical data which in this paper, a month of data;
- Eliminating the need to choose an optimal cluster number and providing a subtle solution to “reserve” an empty cluster for anomaly through the application of BNP.

## 2. Water Demand Data Description

## 3. Proposed Approach

- Similarity in time—to cluster series that varies in a similar way at each time step;
- Similarity in change—to cluster series by the similarity in how they vary from each time step;
- Similarity in shape of the trend—to cluster series with common shapes together.

- Similarity in time—to cluster points that are relatively similar at each time step;
- Similarity in change—to cluster points at each time step by the similarity in how they vary from each time step;
- Similarity in shape of the trend—to compare the incoming points with a reference shape for online anomalous trend detection.

#### 3.1. Data Preparation

#### 3.2. Dirichlet Process Mixture Model

#### 3.3. Incremental Similarity Tracking Using Time Warp Edit Distance

- Among the weekday and weekend series deemed to follow a normal trend, determine the median, 20th, and 80th percentile for each hour;
- Based on the 20th and 80th percentile, compute the interquartile range which is to determine the difference between the two percentiles;
- Calculate the lower and upper bound for each hour as follows:$$\mathrm{a}.\hspace{1em}LowerBound=20thPercentile-1.5\times InterquartileRange$$$$\mathrm{b}.\hspace{1em}UpperBound=80thPercentile+1.5\times InterquartileRange$$
- Form a reference series using all median found at each hour;
- Form a lower bound series using all lower bound calculated at each hour;
- Form an upper bound series using all upper bound calculated at each hour;
- Compute the similarity between the weekday reference series and weekday lower bound series at the different time of the day:
- Do for n ← 1:24;
- If n = 1;
- Calculate the Euclidean distance between the first point of reference series and first point of lower bound series;
- Else if n > 1;
- Z-score normalizes the first n points of reference series and lowers bound series, respectively. Subsequently, compute the similarity between these two partial series using the TWED;
- End if;
- End for;
- At the end of for loop, there are 24 points, each representing the level of similarity at a different time of the day. Concatenate the points to form a weekday similarity matrix, M1.

- Using a similar procedure, calculate the similarity between the reference series and the upper bound series to obtain the second similarity matrix, M2;
- Find the mean of M1 and M2 at a different time of the day to obtain the maximum weekday allowable dissimilarity vector of size 24 × 1. This is to take the dissimilarity between the reference series with both the lower and upper bound series into consideration;
- Repeat Step 7 to 9 to find the maximum weekend allowable dissimilarity matrix;
- For every new day starting with data collected at 01:00, perform Steps 7a to 7g to calculate the similarity between the new day and the reference series. If the new day is a weekday, then the reference series used should be the weekday reference series;
- Find all points in the new day that gives similarity value that is higher than the value in the maximum allowable dissimilarity matrix. Such points are considered as anomalies.

#### 3.4. Rationale

## 4. Results and Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Chan, T.K.; Chin, C.S.; Zhong, X. Review of current technologies and proposed intelligent methodologies for water distributed network leakage detection. IEEE Access
**2018**, 6, 78846–78867. [Google Scholar] [CrossRef] - Romano, M.; Kapelan, Z. Adaptive water demand forecasting for near real-time management of smart water distribution systems. Environ. Model. Soft.
**2014**, 60, 265–276. [Google Scholar] [CrossRef] [Green Version] - Cheifetz, N.; Noumir, Z.; Samé, A.; Sandraz, A.C.; Féliers, C.; Heim, V. Modeling and clustering water demand patterns from real-world smart meter data. Drink. Water Eng. Sci.
**2017**, 10, 75–82. [Google Scholar] [CrossRef] [Green Version] - McKenna, S.A.; Fusco, F.; Eck, B.J. Water Demand Pattern Classification from Smart Meter Data. In Proceedings of the 12th International Conference on Computer Control for the Water Industry (CCWI2013), Perugia, Italy, 2–4 September 2013. [Google Scholar]
- Noiva, K.; Fernandez, J.E.; Wescoat, J.L., Jr. Cluster analysis of urban water supply and demand: Toward large-scale comparative sustainability planning. Sustain. Cities Soc.
**2016**, 27, 484–496. [Google Scholar] [CrossRef] [Green Version] - Padulano, R.; Giudice, G.D.; Giugni, M.; Fontana, N.; Uberti, G.S.D. Identification of annual water demand patterns in the city of Naples. Proceedings
**2018**, 2, 587. [Google Scholar] [CrossRef] - Bennett, C.; Stewart, R.A.; Beal, C.D. ANN-based residential water end-use demand forecasting model. Expert Syst. Appl.
**2013**, 40, 1014–1023. [Google Scholar] [CrossRef] [Green Version] - Nasseri, M.; Moeini, A.; Tabesh, M. Forecasting monthly urban water demand using extended Kalman filter and genetic programming. Expert Syst. Appl.
**2011**, 38, 7387–7395. [Google Scholar] [CrossRef] - Herrera, M.; Torgo, L.; Izquierdo, J.; Pérez-Garcia, R. Predictive models for forecasting hourly urban water demand. J. Hydrol.
**2010**, 387, 141–150. [Google Scholar] [CrossRef] - Avni, N.; Fishbain, B.; Shamir, U. Water consumption patterns as a basis for water demand modeling. Water Resour. Res.
**2015**, 51, 8165–8181. [Google Scholar] [CrossRef] [Green Version] - Candelieri, A. Clustering and support vector regression for water demand forecasting and anomaly detection. Water
**2017**, 9, 224. [Google Scholar] [CrossRef] - Liu, J.; Cheng, W.; Zhang, T. Principal factor analysis for forecasting diurnal water-demand pattern using combined rough-set and fuzzy-clustering technique. J. Water Resour. Plan. Manag.
**2013**, 139, 23–33. [Google Scholar] [CrossRef] - Wu, Y.; Liu, S.; Wu, X.; Liu, Y.; Guan, Y. Burst detection in district metering area using a data driven clustering algorithm. Water Res.
**2016**, 100, 28–37. [Google Scholar] [CrossRef] [PubMed] - Wu, Y.; Liu, S.; Smith, K.; Wang, X. Using correlation between data from multiple monitoring sensors to detect bursts in water distribution systems. J. Water Resour. Plan. Manag.
**2018**, 144, 1–10. [Google Scholar] [CrossRef] - Patabendige, S.; Cardell-Oliver, R.; Wang, R.; Liu, W. Detection and interpretation of anomalous water use for nonresidential customers. Environ. Model. Soft.
**2018**, 100, 291–301. [Google Scholar] [CrossRef] - Gershman, S.J.; Blei, D.M. A tutorial on bayesian nonparametric models. J. Math. Psychol.
**2012**, 56, 1–12. [Google Scholar] [CrossRef] - Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X.; Simoudis, E.; Han, J.; Fayyad, U.M. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996. [Google Scholar]
- Orbanz, P.; Teh, Y.W. Bayesian nonparametric model. In Encyclopedia of Machine Learning and Data Mining; Sammut, C., Webb, G.I., Eds.; Springer: Berlin, Germany, 2017; pp. 1–14. [Google Scholar]
- Ahmed, M.E.; Song, J.B.; Han, Z.; Suh, D.Y. Sensing-Transmission edifice using bayesian nonparametric traffic clustering in cognitive radio networks. IEEE Trans. Mob. Comput.
**2013**, 13, 2141–2155. [Google Scholar] [CrossRef] - Hu, W.; Li, X.; Tian, G.; Maybank, S.; Zhang, Z. An incremental DPMM-based method for trajectory clustering, modeling, and retrieval. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1051–1065. [Google Scholar] [PubMed] - Zuanetti, D.A.; Muller, P.; Zhu, Y.; Yang, S.; Ji, Y. Bayesian nonparametric clustering for large data sets. Stat. Comput.
**2019**, 29, 203–215. [Google Scholar] [CrossRef] - Chen, J.; Boccelli, D.L. Real-time forecasting and visualization toolkit for multi-seasonal time series. Environ. Model. Soft.
**2018**, 105, 244–256. [Google Scholar] [CrossRef] - Ye, G.; Fenner, R.A. Weighted least squares with expectation-maximization algorithm for burst detection in U.K. water distribution systems. J. Water Resour. Plan. Manag.
**2014**, 140, 417–424. [Google Scholar] [CrossRef] - Zhang, X.; Liu, J.; Du, Y.; Lv, T. A novel clustering method on time series data. Expert Syst. Appl.
**2011**, 38, 11891–11900. [Google Scholar] [CrossRef] - Mounce, S.R.; Mounce, R.B.; Boxall, J.B. Novelty detection for time series data analysis in water distribution systems using support vector machines. J. Hydroinf.
**2011**, 13, 672–686. [Google Scholar] [CrossRef] - Teh, Y.W. Dirichlet process. In Encyclopedia of Machine Learning and Data Mining; Sammut, C., Webb, G.I., Eds.; Springer: Berlin, Germany, 2017. [Google Scholar]
- Neal, R.M. Markov chain sampling methods for dirichlet process mixture models. J. Comput. Graph. Stat.
**2000**, 9, 249–265. [Google Scholar] - Pitman, J. Combinatorial Stochastic Processes; Springer: Berlin, Germany, 2006. [Google Scholar]
- Marteau, P.F. Time warp edit distance with stiffness adjustment for time series matching. IEEE Trans. Pattern Anal. Mach. Intell.
**2009**, 31, 306–318. [Google Scholar] [CrossRef] [PubMed] - Lin, J.; Keogh, E.; Lonardi, S.; Chiu, B. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD ’03), San Diego, CA, USA, 13 June 2003. [Google Scholar]

**Figure 14.**Original data points (blue star) with imputed anomalies (red star) and missed detection (black circle).

Time | Data Points Collected for Comparison | Similarity with Lower Bound | Similarity with Upper Bound | Mean/Max Allowable Dissimilarity |
---|---|---|---|---|

01:00 | 0:100 | 0.132 | 0.121 | 0.127 |

02:00 | 01:00–02:00 | 0.000 | 0.000 | 0.000 |

03:00 | 01:00–03:00 | 1.278 | 0.768 | 1.023 |

04:00 | 01:00–04:00 | 2.705 | 1.253 | 1.979 |

05:00 | 01:00–05:00 | 5.245 | 1.535 | 3.390 |

06:00 | 01:00–06:00 | 8.382 | 2.152 | 5.267 |

07:00 | 01:00–07:00 | 11.434 | 2.732 | 7.083 |

08:00 | 01:00–08:00 | 13.648 | 3.352 | 8.500 |

09:00 | 01:00–09:00 | 14.778 | 3.869 | 9.323 |

10:00 | 01:00–10:00 | 15.763 | 4.246 | 10.005 |

11:00 | 01:00–11:00 | 16.210 | 4.301 | 10.255 |

12:00 | 01:00–12:00 | 17.439 | 4.232 | 10.835 |

13:00 | 01:00–13:00 | 19.514 | 4.455 | 11.984 |

14:00 | 01:00–14:00 | 19.848 | 4.922 | 12.385 |

15:00 | 01:00–15:00 | 19.777 | 5.293 | 12.535 |

16:00 | 01:00–16:00 | 19.726 | 5.620 | 12.673 |

17:00 | 01:00–17:00 | 19.977 | 6.107 | 13.042 |

18:00 | 01:00–18:00 | 21.080 | 6.677 | 13.878 |

19:00 | 01:00–19:00 | 22.343 | 7.094 | 14.719 |

20:00 | 01:00–20:00 | 23.690 | 7.213 | 15.452 |

21:00 | 01:00–21:00 | 21.897 | 7.609 | 14.753 |

22:00 | 01:00–22:00 | 22.762 | 8.547 | 15.655 |

23:00 | 01:00–23:00 | 24.089 | 9.263 | 16.676 |

24:00 | 01:00–24:00 | 25.994 | 9.402 | 17.698 |

Time | Data Points Used for Comparison | Similarity with Lower Bound | Similarity with Upper Bound | Mean/Max Allowable Dissimilarity |
---|---|---|---|---|

01:00 | 01:00 | 0.171 | 0.201 | 0.186 |

02:00 | 01:00–02:00 | 4.243 | 0.000 | 2.121 |

03:00 | 01:00–03:00 | 7.299 | 0.788 | 4.043 |

04:00 | 01:00–04:00 | 9.417 | 1.066 | 5.241 |

05:00 | 01:00–05:00 | 12.303 | 1.314 | 6.809 |

06:00 | 01:00–06:00 | 15.135 | 2.473 | 8.804 |

07:00 | 01:00–07:00 | 17.011 | 4.773 | 10.892 |

08:00 | 01:00–08:00 | 18.186 | 6.513 | 12.349 |

09:00 | 01:00–09:00 | 15.332 | 10.081 | 12.706 |

10:00 | 01:00–10:00 | 11.948 | 13.785 | 12.867 |

11:00 | 01:00–11:00 | 10.256 | 16.312 | 13.284 |

12:00 | 01:00–12:00 | 9.415 | 18.780 | 14.098 |

13:00 | 01:00–13:00 | 9.494 | 20.881 | 15.187 |

14:00 | 01:00–14:00 | 10.064 | 22.350 | 16.207 |

15:00 | 01:00–15:00 | 11.088 | 23.184 | 17.136 |

16:00 | 01:00–16:00 | 12.326 | 23.616 | 17.971 |

17:00 | 01:00–17:00 | 13.581 | 24.527 | 19.054 |

18:00 | 01:00–18:00 | 14.583 | 26.270 | 20.426 |

19:00 | 01:00–19:00 | 15.369 | 27.239 | 21.304 |

20:00 | 01:00–20:00 | 16.850 | 27.191 | 22.021 |

21:00 | 01:00–21:00 | 17.409 | 27.353 | 22.381 |

22:00 | 01:00–22:00 | 18.879 | 28.326 | 23.603 |

23:00 | 01:00–23:00 | 19.750 | 29.570 | 24.660 |

24:00 | 01:00–24:00 | 21.236 | 30.313 | 25.775 |

No | Month | Day of Week | Date | Time | Water Demand (Mega Cubic Meter) | First Derivative (Mega Cubic Meter) |
---|---|---|---|---|---|---|

1 | 4 | 3 | 25 | 09:00 | 0.486 | −0.1574 |

2 | 4 | 1 | 30 | 23:00 | 0.5882 | −0.1246 |

3 | 6 | 2 | 26 | 23:00 | 0.4013 | −0.0725 |

5 | 8 | 1 | 6 | 24:00 | 0.3740 | −0.1602 |

6 | 10 | 1 | 8 | 24:00 | 0.2828 | −0.0927 |

4 | 7 | 7 | 8 | 07:00 | 162.261 | −0.436 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chan, T.K.; Chin, C.S.
Unsupervised Bayesian Nonparametric Approach with Incremental Similarity Tracking of Unlabeled Water Demand Time Series for Anomaly Detection. *Water* **2019**, *11*, 2066.
https://doi.org/10.3390/w11102066

**AMA Style**

Chan TK, Chin CS.
Unsupervised Bayesian Nonparametric Approach with Incremental Similarity Tracking of Unlabeled Water Demand Time Series for Anomaly Detection. *Water*. 2019; 11(10):2066.
https://doi.org/10.3390/w11102066

**Chicago/Turabian Style**

Chan, Teck Kai, and Cheng Siong Chin.
2019. "Unsupervised Bayesian Nonparametric Approach with Incremental Similarity Tracking of Unlabeled Water Demand Time Series for Anomaly Detection" *Water* 11, no. 10: 2066.
https://doi.org/10.3390/w11102066