# SIMIT: Subjectively Interesting Motifs in Time Series

^{*}

## Abstract

**:**

## 1. Introduction

- -
- Novel definitions of motifs as probabilistic patterns (Section 3).
- -
- A quantification of their Subjective Interestingness (SI), based on how much information a user gains when observing this pattern (Section 4).
- -
- A relaxation of the exact setting and an algorithm to efficiently mine the most interesting subsequence patterns for a user (Section 5).
- -
- Several speedup techniques that result in a computationally more efficient algorithm (Section 6).
- -
- Empirical evaluation of this algorithm on one synthetic dataset and two real-world datasets, to investigate its ability to encode the user’s prior beliefs and identify interesting subsequence patterns (Section 7).

## 2. Related Work

## 3. Motifs and Motif Templates

#### 3.1. Motif

#### 3.2. Motif Template

**Definition**

**1**

**.**A motif template is a probability distribution over the space of motif instances, i.e., ${\mathbb{R}}^{l}$.

**Definition**

**2**

**.**A mean-variance motif template is a multivariate Gaussian distribution $\mathcal{N}(\mathit{\mu},\Sigma )$ over the space of motif instances. Σ is the diagonal matrix with the values of standard deviations as the main diagonal and zero elsewhere. Hence, this distribution can be essentially parameterized by a tuple $(\mathit{\mu},\mathit{\sigma})$, where $\mathit{\mu}$ is a vector of means and $\mathit{\sigma}$ is a vector of standard deviations, both of length l.

## 4. Formalizing the Subjective Interestingness

#### 4.1. The Background Distribution

#### 4.1.1. The Initial Background Distribution

**Problem**

**1.**

#### 4.1.2. Updating the Background Distribution

#### 4.2. A Remark about No Independence Assumption

**Remark**

**1.**

#### 4.3. The Subjective Interestingness Measure

#### 4.4. Finding the Most Subjectively Interesting Motif Template

## 5. Method

- Model the user’s prior belief by the initial background distribution;
- Seed by finding a small set of instances that optimizes Objective 2;
- Grow that set by adding an instance that optimizes Objective 1 and iterate.

**Remark**

**2.**

#### 5.1. Step 2: Finding a Seed Motif ${\mathbb{T}}^{(0)}$ with k Instances

**Problem**

**2.**

#### 5.2. Step 3: Greedily Searching for a New Instance

## 6. Speedup Techniques

#### 6.1. Speeding Up Step 2

#### 6.1.1. Strategy 1: Bounding Objective 2 and Finding the Submatrix with the Maximal Sum

**Problem**

**3.**

#### 6.1.2. Strategy 2: Pruning

#### 6.2. Speeding Up Step 3

## 7. Experiments

#### 7.1. Data

**Synthetic time series:**We synthesized a time series of length 15,000. This series included 2 sorts of motif trends, and their prototypes were taken from 2 subsequence instances in the UCRTrace Data [21]. Both instances were of the same length as 275, but belonged to different classes. Subsequences for each motif were generated by sampling from a Gaussian distribution with the mean as the corresponding instance and a reasonably small variance as 0.01. There were in total 12 subsequences for each motif. The remaining were standard Gaussian noises, and they constituted a major part of the whole series. More details about the data synthesizing process are described in the pseudocode Procedure 1 in Appendix B.**MIT-BIHarrhythmia ECG recording:**This dataset was Recording #205 in the MIT-BIH Arrhythmia DataBase [22]. This recoding was created from digitizing the ECG signals at 360 samples per second. We chose a part of 20 s (7200 samples) to experiment on that included normal heartbeats and ventricular tachycardia beats.**Belgium Power Load Data:**This dataset was taken from Open Power System Data [23]. The primary source of these data was ENTSO-E Data Portal/Power Statistics [24]. Open Power System Data then resampled and merged the original data in a large CSV file with hourly resolution. The part we selected to experiment on recorded the total load in Belgium during the year 2007, for a total length of $24\times 365=8760$.

#### 7.2. Pruning and Scalability

#### 7.3. Results

#### 7.3.1. Synthetic Data

#### 7.3.2. ECG Time-Series

#### 7.3.3. Belgium Power Load Data

## 8. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Solving Problem 1

**Problem**

**A1.**

## Appendix B. Pseudocode for Generating the Synthetic Data

Procedure 1: Synthetic time series generation. |

input: Trace Instance 1, Trace Instance 2 output: A synthesized time series $\widehat{\mathbf{x}}$ 1 $n\leftarrow 15000$ // The length of the synthesized time series; 2 $l\leftarrow 275$ // The length of each subsequence in a motif whose prototype is taken from Trance Instance 1 or 2; 3 $\mathbf{S}\leftarrow $ An $n\times n$ diagonal matrix with each diagonal entry as $0.001$; 4 ${\mathbb{Q}}_{\mathrm{prototype}1}\leftarrow $ The set containing the beginning indices for 12 subsequences for Prototype 1; 5 ${\mathbb{Q}}_{\mathrm{prototype}2}\leftarrow $ The set containing the beginning indices for 12 subsequences for Prototype 2; 6 ${\mathbb{Q}}_{\mathrm{others}}\leftarrow $ The set containing indices that are not covered by subsequences for Prototype 1 or 2; 7 // Generating subsequences for Prototype 1 by sampling$\phantom{(}$ 8 for $i\in {\mathbb{Q}}_{\mathrm{prototype}1}$ do 9 ⌊$\phantom{(}$${\widehat{\mathbf{x}}}_{i,l}\sim \mathcal{N}(\mathrm{Trace}\phantom{\rule{4.pt}{0ex}}\mathrm{Instance}\phantom{\rule{4.pt}{0ex}}1,\mathbf{S})$; 10 // $\phantom{(}$ Generating subsequences for Prototype 2 by sampling$\phantom{(}$ 11 for $i\in {\mathbb{Q}}_{\mathrm{prototype}2}$ do 12 ⌊$\phantom{(}$${\widehat{\mathbf{x}}}_{i,l}\sim \mathcal{N}(\mathrm{Trace}\phantom{\rule{4.pt}{0ex}}\mathrm{Instance}\phantom{\rule{4.pt}{0ex}}2,\mathbf{S})$; 13 // $\phantom{(}$ Making the remaining standard Gaussian noises$\phantom{(}$ 14 for $i\in {\mathbb{Q}}_{\mathrm{others}}$ do 15 ⌊$\phantom{(}$${\widehat{\mathbf{x}}}_{i,1}\sim \mathcal{N}(0,1)$ |

## References

- Mueen, A. Time series motif discovery: dimensions and applications. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
**2014**, 4, 152–159. [Google Scholar] [CrossRef] - Mueen, A.; Keogh, E.J.; Zhu, Q.; Cash, S.; Westover, M.B. Exact Discovery of Time Series Motifs; SDM: Sparks, NV, USA, 2009. [Google Scholar]
- Yeh, C.M.; Zhu, Y.; Ulanova, L.; Begum, N.; Ding, Y.; Dau, H.A.; Silva, D.F.; Mueen, A.; Keogh, E. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 1317–1322. [Google Scholar]
- Mueen, A.; Chavoshi, N. Enumeration of time series motifs of all lengths. Knowl. Inf. Syst.
**2015**, 45, 105–132. [Google Scholar] [CrossRef] - Lin, J.; Keogh, E.; Lonardi, S.; Patel, P. Finding Motifs in Time Series. In Proceedings of the ACM SIGKDD, Edmonton, AB, Canada, 23–26 July 2002; pp. 53–68. [Google Scholar]
- Chiu, B.; Keogh, E.; Lonardi, S. Probabilistic Discovery of Time Series Motifs. In Proceedings of the ACM SIGKDD, Washington, DC, USA, 24–27 August 2003; pp. 493–498. [Google Scholar]
- Rakthanmanon, T.; Campana, B.J.L.; Mueen, A.; Batista, G.E.A.P.A.; Westover, M.B.; Zhu, Q.; Zakaria, J.; Keogh, E. Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings of the ACM SIGKDD, Beijing, China, 12–16 August 2012; pp. 262–270. [Google Scholar]
- Yoon, C.E.; O’Reilly, O.; Bergen, K.J.; Beroza, G.C. Earthquake detection through computationally efficient similarity search. Sci. Adv.
**2015**, 1, e1501057. [Google Scholar] [CrossRef] [PubMed] - Senin, P.; Lin, J.; Wang, X.; Oates, T.; Gandh, S.; Boedihardjo, A.P.; Chen, C.; Frankenstein, S. GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns. ACM TKDD
**2018**, 12, 10:1–10:28. [Google Scholar] [CrossRef] - Linardi, M.; Zhu, Y.; Palpanas, T.; Keogh, E. Matrix Profile X: VALMOD—Scalable Discovery of Variable-Length Motifs in Data Series. In Proceedings of the SIGMOD, Houston, TX, USA, 10–15 June 2018; pp. 1053–1066. [Google Scholar]
- Yeh, C.M.; Kavantzas, N.; Keogh, E. Matrix Profile VI: Meaningful Multidimensional Motif Discovery. In Proceedings of the IEEE ICDM, New Orleans, LA, USA, 18–21 November 2017; pp. 565–574. [Google Scholar]
- Mueen, A.; Keogh, E. Online Discovery and Maintenance of Time Series Motifs. In Proceedings of the ACM SIGKDD, Washington, DC, USA, 24–28 July 2010; pp. 1089–1098. [Google Scholar]
- Lin, J.; Li, Y. Finding approximate frequent patterns in streaming medical data. In Proceedings of the IEEE International Symposium on CBMS, Perth, Australia, 12–15 October 2010; pp. 13–18. [Google Scholar]
- Keogh, E.; Wei, L.; Xi, X.; Lee, S.; Vlachos, M. LB_Keogh Supports Exact Indexing of Shapes under Rotation Invariance with Arbitrary Representations and Distance Measures. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006; pp. 882–893. [Google Scholar]
- Torkamani, S.; Lohweg, V. Survey on time series motif discovery. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
**2017**, 7, e1199. [Google Scholar] [CrossRef] - De Bie, T. An information-theoretic framework for data mining. In Proceedings of the ACM SIGKDD, San Diego, CA, USA, 21–24 August 2011; pp. 564–572. [Google Scholar]
- De Bie, T. Subjective interestingness in exploratory data mining. In Proceedings of the IDA, London, UK, 17–19 October 2013; pp. 19–31. [Google Scholar]
- De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Discov.
**2011**, 23, 407–446. [Google Scholar] [CrossRef] - Jensen, J.L.W.V. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math.
**1906**, 30, 175–193. [Google Scholar] [CrossRef] - Google. Google Optimization Tools(OR-Tools). Available online: https://github.com/google/or-tools (accessed on 15 January 2018).
- Chen, Y.; Keogh, E.; Hu, B.; Begum, N.; Bagnall, A.; Mueen, A.; Batista, G. The UCR Time Series Classification Archive. 2015. Available online: www.cs.ucr.edu/~eamonn/time_series_data/ (accessed on 15 January 2018).
- Moody, G.B.; Mark, R.G. The Impact of the MIT-BIH Arrhythmia Database. IEEE Eng. Med. Biol. Mag.
**2001**, 20, 45–50. [Google Scholar] [CrossRef] [PubMed] - Open Power System Data. Data Package Time Series. Version 2018-03-13. 2018. Available online: https://data.open-power-system-data.org/time_series/ (accessed on 15 March 2018).
- ENTOSO-E. Detailed Hourly Load Data for All Countries 2006–2015. Available online: https://www.entsoe.eu/data/data-portal/ (accessed on 15 March 2018).
- Contreras-Reyes, J.E. Renyi entropy and complexity measure for skew-Gaussian distributions and related families. Phys. A Stat. Mech. Appl.
**2015**, 433, 84–91. [Google Scholar] [CrossRef] - Liu, C. Nonsymmetric entropy and maximum nonsymmetric entropy principle. Chaos Solitons Fractals
**2009**, 40, 2469–2474. [Google Scholar] [CrossRef] - Potra, F.A.; Wright, S.J. Interior-point methods. J. Comput. Appl. Math.
**2000**, 124, 281–302. [Google Scholar] [CrossRef][Green Version]

n | l | Time (s) | n | l | Time (s) | n | l | Time (s) | ||
---|---|---|---|---|---|---|---|---|---|---|

1800 | 100 | 9.96 | 3600 | 100 | 50.12 | 7200 | 100 | 369.92 | ||

7200 | 25 | 328.09 | 7200 | 50 | 350.65 | 7200 | 100 | 369.92 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Deng, J.; Lijffijt, J.; Kang, B.; De Bie, T. SIMIT: Subjectively Interesting Motifs in Time Series. *Entropy* **2019**, *21*, 566.
https://doi.org/10.3390/e21060566

**AMA Style**

Deng J, Lijffijt J, Kang B, De Bie T. SIMIT: Subjectively Interesting Motifs in Time Series. *Entropy*. 2019; 21(6):566.
https://doi.org/10.3390/e21060566

**Chicago/Turabian Style**

Deng, Junning, Jefrey Lijffijt, Bo Kang, and Tijl De Bie. 2019. "SIMIT: Subjectively Interesting Motifs in Time Series" *Entropy* 21, no. 6: 566.
https://doi.org/10.3390/e21060566