SIMIT: Subjectively Interesting Motifs in Time Series

Numerical time series data are pervasive, originating from sources as diverse as wearable devices, medical equipment, to sensors in industrial plants. In many cases, time series contain interesting information in terms of subsequences that recur in approximate form, so-called motifs. Major open challenges in this area include how one can formalize the interestingness of such motifs and how the most interesting ones can be found. We introduce a novel approach that tackles these issues. We formalize the notion of such subsequence patterns in an intuitive manner and present an information-theoretic approach for quantifying their interestingness with respect to any prior expectation a user may have about the time series. The resulting interestingness measure is thus a subjective measure, enabling a user to find motifs that are truly interesting to them. Although finding the best motif appears computationally intractable, we develop relaxations and a branch-and-bound approach implemented in a constraint programming solver. As shown in experiments on synthetic data and two real-world datasets, this enables us to mine interesting patterns in small or mid-sized time series.


Introduction
There exists a myriad of data mining methods for time series data, ranging from fully-automated change detection, classification, and prediction methods, to exploratory techniques such as clustering and motif detection. Change and motif detection are related in the sense that local patterns (motifs) and longer-running changes in the profile of the time series need to be evaluated against a prior that specifies what the expected profile is, typically in the form of a probability distribution.
Prior work on time series motif detection tended to evaluate a motif's interestingness by assessing its significance against some objectively-chosen prior distribution for the time series (either explicitly or implicitly). The result is that the most "interesting" motifs found are often trivial, implied by the user's prior expectations. In contrast to this, we introduce an approach to identify recurring subsequence patterns that are subjectively interesting, i.e., interesting when contrasted with the user's prior expectations. A recurring subsequence is a subsequence that is found at several positions within the time series with some variation, and this will be called a motif.
To achieve this, we define subsequence patterns as local probabilistic models. The subjective interestingness of a subsequence pattern is then defined in terms of the amount of information (in an information-theoretic sense) contained in this local model, when contrasted with a background distribution that represents the user's expectations. Initially, the background distribution is computed as the distribution of maximum entropy subject to any prior user expectations as constraints, such as constraints on the expected mean, variance, and co-variance between neighboring points in the time series. Upon revealing the presence of a subsequence pattern, the background distribution is updated to account for this new knowledge, such that it continues to represent the (now updated) expectations of the user as subsequence patterns are revealed throughout an iterative analysis. The amount of information gained by the time series can be computed by contrasting the prior distribution and the updated distribution.
To find the most informative motifs and outliers efficiently, we develop relaxations and propose an effective search algorithm implemented in a constraint programming solver. Together with an additional heuristic pruning technique, this enables one to mine subsequence patterns relatively efficiently.
Our specific contributions are: -Novel definitions of motifs as probabilistic patterns (Section 3). -A quantification of their Subjective Interestingness (SI), based on how much information a user gains when observing this pattern (Section 4). - A relaxation of the exact setting and an algorithm to efficiently mine the most interesting subsequence patterns for a user (Section 5). -Several speedup techniques that result in a computationally more efficient algorithm (Section 6). -Empirical evaluation of this algorithm on one synthetic dataset and two real-world datasets, to investigate its ability to encode the user's prior beliefs and identify interesting subsequence patterns (Section 7).

Related Work
Time series motifs usually hint at useful information about seasonal or temporal associations between events, and detecting such patterns can be very useful in practice. A myriad of techniques for motif discovery have been proposed. These can be categorized from different perspectives, starting with the definition of the interestingness measure for a motif. In general, two main aspects for judging the interestingness of a motif exist in the literature, namely the similarity among instances and the support (i.e., the number of instances in a motif) [1]. More specifically, one prioritizes a motif whose instances exhibit maximum similarity, or even more strictly, defining a motif as the most similar subsequence pair (e.g., [2][3][4]); whereas the second prioritizes one with the highest support given a minimum similarity between all instances of a motif (e.g., [5,6]).
For existing work adopting either similarity-based or support-based interestingness, the similarity measure plays a key role in the motif discovery algorithms, and typical ones include Euclidean distance and dynamic time warping. Regarding the massive computational cost, some efforts have been made to represent time series in low dimensional space. Examples of such representations include Symbolic Aggregate Approximation (SAX), DFT, and random projections. A review of motif discovery algorithms based on their similarity measure and representation was provided by Mueen [1].
In addition to these aspects, there exist several challenging issues in this pattern discovery problem, including scalability [3,7,8], the detection of motifs with various lengths [9,10], multi-dimensional time series [11], coping with streaming data [12,13], and handling distortions [14]. For a more comprehensive review of existing publications regarding these issues, we refer the interested readers to Torkamani and Lohweg [15].
Our work explores a new aspect, shedding light on the essence of the interestingness for a motif, which we believe depends on a user's prior knowledge. Previous measures that prioritized either the similarity or support were all objective. However, for a user with prior information about the time series (a common situation), the resulting motifs may be trivial. Hence, we propose a novel subjective interestingness measure, which enables ones to identify motifs that contradict their prior expectations and are truly interesting to them. Additionally, the information-theoretic view that we take immediately provides a balance between the similarity and numerosity for a set of subsequences to form a motif.

Motifs and Motif Templates
We denote a time series asx (x 1 , . . . ,x n ) ∈ R n , i.e., an ordered collection of n real numbersx i ∈ R, where i ∈ [n] = [1, . . . , n]. We writex i,l forx for the subsequence of length l ≤ n − i + 1 starting from position i. That is,x i,l (x i , . . . ,x i+l−1 ) ∈ R l . By sliding a window of size l alongx and extracting each subsequence, we can obtain a set containing all the subsequences of length l. We denote this set as S l , i.e., S l = {x i,l |i = 1, 2, . . . , n − l + 1}. Note that hatted symbols represent empirical values, and their non-hatted equivalents are used to denote the respective random variables.

Motif
A motif of length l denoted by T l is a subset of S l containing more than two non-overlapping subsequences. That is, T l ⊆ S l , |T l | ≥ 2, |i − j| ≥ l, ∀x i,l ,x j,l ∈ T, and i = j.
Each subsequence in a motif is said to be an instance of the motif. As we focus on identifying motifs of a fixed length (i.e., l), we write T for T l in the rest of the paper for convenience. Not every motif is equally interesting. The criterion by which we judge the quality of a motif is explained below.
The index set of a motif T is denoted as I T , i.e., I T = {i|x i,l ∈ T}.

Motif Template
Our general aim is to find subjectively interesting "motifs". However, what one typically means is not actually a set of subsequences that are similar, but a general subsequence pattern that is recurring in a time series. To avoid working with a set of subsequences, one could use a single exemplar. Here, we introduce a probabilistic local model as the target object, the motif template, instead.

Definition 1 (Motif template).
A motif template is a probability distribution over the space of motif instances, i.e., R l .
More concretely, we propose a template where we capture the mean and variance statistics of instances and call this a mean-variance motif template. We deem the roles played by these two statistics essential, as the mean serves as a figure about the motif shape, and the variance tells the extent of the similarity among these instances. A typical choice of model is a multivariate Gaussian distribution parameterized by the mean and variance statistics.
It is in principle straightforward to also use covariance statistics, but such a model has O(l 2 ) parameters and is not interpretable. Thus, we define a mean-variance motif template as: Definition 2 (Mean-variance motif template). A mean-variance motif template is a multivariate Gaussian distribution N (µ, Σ) over the space of motif instances. Σ is the diagonal matrix with the values of standard deviations as the main diagonal and zero elsewhere. Hence, this distribution can be essentially parameterized by a tuple (µ, σ), where µ is a vector of means and σ is a vector of standard deviations, both of length l.
In this paper, we take µ, σ as the maximum likelihood parameters over the set of instances in a motif. We denote the parameter tuple for the motif template learned from the motif T as (µ T , σ T ). That is, Examples are given in Figures 1-3.

Formalizing the Subjective Interestingness
Previous motif discovery work tended to quantify the interestingness in an objective way (see Section 2). For a data analyst with prior knowledge about the time series, which we believe is common, the discovered patterns may be trivial to the end user and could be easily implied. To preempt this, we propose to use a more flexible subjective measure of interestingness.

The Background Distribution
We follow the so-called FORSIED (an acronym for "Formalizing Subjective Interestingness in Exploratory Data mining") framework [16,17] to quantify the subjective interestingness of a motif. The basic procedure is that a background distribution is defined over the space of all possible datasets, which here would be all possible realizations of a time series x. Since x ∈ R n , the background distribution is defined by a probability density function p. The background distribution essentially encodes the beliefs and expectations of the user about the data. More specifically, it assigns a probability density to each possible data value according to how tenable the user thinks this value is. It was argued that a good choice for the background distribution is the maximum entropy distribution subject to constraints that capture the user's prior expectations about the data.

The Initial Background Distribution
We wish to define constraints and compute a maximal entropy distribution such that these constraints are preserved in expectation. For the initial background distribution, we consider three kinds of constraints. They respectively express the user's prior knowledge about the mean and the variance of each data point, as well as the first order difference in x. Notice that these expectation values can be anything; here, we equate them to the empirical values. With these three constraints, the initial background distribution is the solution to Problem 1 as stated as follows: and 1 is an n-dimensional vector with all the entries as one.
The solution to Problem 1 is a multivariate Gaussian distribution parameterized by an n-dimensional mean vector m and an n × n covariance matrix V. The values of m and V can be derived by applying the Lagrange multiplier method. We further improve the computation efficiency by using the property that maximizing the entropy and maximizing the likelihood are the dual of each other in the class of exponential form distributions [18]. The computation details are given in Appendix A.

Updating the Background Distribution
Once a motif template along with its instances is identified and shown to the user, the user's belief state changes, and the background distribution needs to be updated. The background distributions p for all prior belief types discussed in this paper are essentially multivariate Gaussian distributions each of which is parametrized by m and V. As mentioned, the motif template is also described by a multivariate Gaussian distribution, N (µ T , Σ T ). To make the updated background distribution reflect the user's newly-acquired knowledge, we simply set the blocks of current m and V corresponding to the subsequence instances equal to µ T and Σ T and the off-diagonal elements of V corresponding to instances equal to zero. We denote the background distribution having incorporated T as p T .

Remark 1. We do not assume independence between time points.
While in the local motif model (i.e., the mean-variance motif template), time points are indeed independently distributed (see Section 3.2), this is not the case for the model of the whole time series x (indeed, the full covariance matrix is not necessarily diagonal). Moreover, it is important to realize that the background distribution is a model for the user's belief state; it is not a model for the stochastic source of the data. In other words, if the background distribution does not exhibit a certain dependency, this does not mean that the data may not come from a stochastic source that exhibits this dependency. It only means that the user whose belief state is modeled by this background distribution is not yet aware of it. As the covariance matrix is not diagonal, it is indeed the case that updating the expected value even for a single point in the time series can ripple across the sequences and modify the expected values throughout.

The Subjective Interestingness Measure
Intuitively, a good motif is one whose instances are strongly similar to each other and together account for a considerable portion on the whole time series. Consider such a good motif T. If all instances are similar to each other, it directly follows that the values of µ T are similar to those of each instance, and the diagonal entries of Σ T are small. After revealing the motif to the user, the background distribution is updated to be p T . Since the parameters of p T consist of µ T and Σ T , the new background distribution p T will thus be a more accurate model for the time series. More precisely, the probability of the data under p T is larger. To quantify the amount of information gained by the motif, we can compare this probability to the one under the previous background distribution p. The more strongly they differ, the more this motif enhances with the user's beliefs about the data.
Mathematically, we define the Information Content (IC) of a motif as the difference between the log probability for the whole time seriesx under p T and that under p: The rationale is that minus the log probability of the data represents the number of bits of information the data contain with respect to the probability distribution, so this difference corresponds to the amount of information (in bits) the user has gained by seeing the motif.
Note that the expected value of IC(T) with respect to p T (x) takes the same form as the Kullback-Leibler divergence, but this does not mean that IC and KL-divergence are equivalent concepts. The KL-divergence measures the difference between two probability distributions, but here, the p T (x) and p(x) in the definition of IC(T) are probabilities rather than distributions.

Finding the Most Subjectively Interesting Motif Template
Now, we can formalize our goal of finding the most interesting motif in a time series as an optimization problem with the following objective: Objective 1 accounts for the probability of the whole of the data. This probability depends on the parameter updating of p (i.e., m and V ) from incorporating subsequences and can thus embody the quality of the choice for template instances. Note that the key changes of m and V only take place on the part of their entries that represent instances in T. That means, the rise in the probability of the whole of the data is mostly related to the probability of those instances in T. Based on this observation, we propose a relaxed version of Objective 1, which only depends on the probability of instances in T. This objective is similar to Objective 1, but is more straightforward to optimize efficiently.

Method
In this work, we adopted a greedy search algorithm to identify the most interesting motif. The general idea is to first seed T by finding a small set of k instances according to Objective 2 and then greedily grow that set using Objective 1.
The algorithm consists of three major steps: 1. Model the user's prior belief by the initial background distribution; 2. Seed by finding a small set of instances that optimizes Objective 2; 3. Grow that set by adding an instance that optimizes Objective 1 and iterate.

Remark 2.
Although the three basic steps are for finding a single motif (i.e., the most interesting one to the user), our algorithm is not limited to that. A new search for another motif can be triggered by running Step 2 and Step 3 again based on an updated background distribution, the one that has already incorporated the user's knowledge of the previous motif.
How we compute the initial background distribution (i.e., Step 1) is described in the above (see Section 4.1.1). In the following, we go into more details of Steps 2 and 3.

5.1.
Step 2: Finding a Seed Motif T (0) with k Instances The search starts by finding k non-overlapping optimal instances that constitute a seed set for T. We denote such a seed set by T (0) . The most subjectively interesting T (0) is identified by optimizing Objective 2. This problem can be formulated as: where The superscript of a vector or matrix symbol is used to denote the corresponding entry. Using the expression for the multivariate Gaussian distribution, we can write Equation (6) as: log N x i,l |m (i:i+l−1) , V (i:i+l−1,i:i+l−1) Note that the parts related to the choice of instances in T (0) are underbraced and numbered respectively as I and I I. By taking a closer look, we can see that part I I is essentially the sum of all the individual negative log probabilities ofx i,l under p, and the values for parameters m and V are not subject to which instances to incorporate. This allows gaining some computational benefits by simply pre-computing each log probability. Nevertheless, I expresses a mutual relationship among all the instances in T (0) , due to its being in the summation form for the logarithm of a summation. Pre-computation is not trivial, which makes the search for optimal instances computationally demanding, reaching O(n k k 2 ). We thus adopted a strategy to mitigate a certain factor of this time complexity, as well as a heuristic to prune the search space. A detailed description is provided in Section 6.

Step 3: Greedily Searching for a New Instance
The algorithm then continues to search for a new subsequence that optimizes Objective 1. The search stops when no new subsequence exists such that incorporating it can increase the probability of the time series under the background distribution, i.e., i ∈ To gain some speedup, we pruned subsequences that posed little potential according to a heuristic (see Section 6).

Speedup Techniques
In this section, we describe some speedup techniques applied to Step 2 (Section 6.1) and Step 3 (Section 6.2). Recall only terms I and I I in the objective of Problem 2 (i.e., Equation (7)) are affected by the chosen instances for T (0) , and the term I makes the search computationally expensive. To mitigate the time complexity, we considered optimizing a relaxed objective of Problem 2 based on bounding the term I. Via applying Jensen's inequality [19], the term I can be upper bounded by a summation form taken from all the instance pairs: Substituting Equation (8) into Equation (7) yields: Finding the maximal value for the objective (Equation (9)) is essentially the same as maximizing term I I I+ term I I. Then, we constructed a matrixM, with rows and columns representing subsequence candidates, the ith diagonal entryM i,i being the part of the term I I inside the summation (Equation (11) below) and the entry at the ith row and jth columnM i,j being the part of the term I I I inside the outer summation (Equation (12) below).
Solving Problem 2 corresponds to finding the upper triangular matrix insideM with the maximum sum, as expressed in the following problem: |i − j| ≥ l, ∀i, j ∈ I T (0) and i = j.
The fourth set of constraints (Equation (14)) is to ensure instances in T (0) are non-overlapping with each other. This speedup technique enables us to compute the matrixM in advance and then do the search using Constraint Programming (CP). The time complexity of a relaxed Problem 2 is O(n k ), a factor of k 2 less than Problem 2. Clearly, it still appears intractable for real-world applications. To counter this, we deliberately reduced the search space so that each element of T (0) was constrained to be in a pruned range, denoted by PrunedSubsequenceSet (Equation (13)). The way we constructed PrunedSubsequenceSet is described in the following.

Strategy 2: Pruning
The exhaustive search for a solution to the relaxed Problem 2 is still computationally demanding for a largeM. We thus adopted a heuristic strategy so that the search was in a considerably reduced space, but the quality of the found motifs was guaranteed.
It appears that an off-diagonal entry at the ith row and jth columnM i,j models a sort of similarity between the subsequencex i,l andx j,l . As the transition property of the similarity suggests, ifM i,v and M j,v are large, then so isM i,j . We can deduce that all the entries inM mapped from I T (0) should have a relatively larger value. Hence, we can deliberately perform the search in a pruned range of subsequences whose indices corresponding to largest entries inM. Specifically, we fixed the first instance to be a certain subsequence and searched the others among subsequences corresponding to the largest 1% entries at a row ofM corresponding to this instance (i.e., pruning factor = 99%). To find the globally optimal k instances for T (0) , we fixed the first instance to be each possible subsequence and solved the relaxed Problem 2 each time. The final solution should be the one that leads to the maximal objective value.

Speeding Up Step 3
In Step 3, the exhaustive search for a new optimal instance requires checking the result of Objective 1's value for incorporating every possible subsequence, which is apparently time consuming for largex. Clearly, incorporating subsequences that bear strong similarity with instances in T (0) can result in a high value for Objective 1. As the off-diagonal entries inM encode a similarity between subsequence pairs, we applied a heuristic pruning strategy based on entries inM to reduce the search space.
Assume we are in the stage of having incorporated all the k instances in T (0) . Let us denote the currentM asM k . The new optimal instance must be among those that can produce a relatively large value of the objective for Problem 3, but based onM k+1 , whose entry at the ith row and jth column (i = j) is j,l ) 2 } = k−1 kM k i,j (recall thatM k i,j is computed by Equation (12)). The objective for Problem 3 is in the form of summing some entries ofM k+1 that correspond to instances in T (0) and the new subsequence (e.g.,x r,l ). Thus, the potential ofx r,l (i.e., Potential(x r,l )) can be captured by how much the value of the objective for Problem 3 (Equation (10)) increases if incorporatingx r,l : The algorithm ranks all possible subsequences in descending order according to their potential values. Then, the search is implemented in a greatly reduced space (i.e., among those in the top 1% of the rank). Let us denote the optimal subsequence that leads to the highest Objective 1 value by T k+1 . We first check whether the probability of the time series increases under the new background distribution. If so, we include T k+1 in T. Nevertheless, for incorporating the next subsequence, a further check is performed. First, we update the search domain, as well as the potential rank by deleting all the subsequences overlapped with T k+1 . We do Step 3 to identify the new optimal one. If this subsequence is still among the top three of the potential rank and incorporating it did not trigger the stop condition, we make it the (k + 2)th instance of T. Otherwise, we recompute the potential and rank all the subsequences again, according toM k+2 , the one considering T k+1 as an incorporated instance. Then, Step 3 is done again among an updated search domain. However, there might occur a situation where the new optimal one is still not ranked among the top three. In this case, if the stop condition is not reached, we make it an instance of T anyway. By this lazy greedy strategy, the search space is significantly reduced, while a good quality of the incorporated instance is ensured to a satisfiable extent.

Experiments
This section describes the evaluation of our proposed algorithm on a synthetic and two real-world datasets. In the following, we first describe the datasets (Section 7.1). Second, we empirically analyze how the pruning percentage in the initial set selection affects the quality of the result in terms of the SI (Section 7.2). Then, we discuss the properties of the discovered motifs in each dataset to assess their validity (Section 7.3).
All experiments were conducted on a PC with Ubuntu OS, Intel(R) Core(TM) i7-7700K 4.20-GHz CPUs, and 32 GB of RAM. The main algorithm was implemented in MATLAB R2016b. The step of identifying the initial motif template was coded in Python 3.5, in which the open source software OR-Tools 6.10 [20] developed by Google was used as the constraint programming solver. All the computer codes are available at https://bitbucket.org/ghentdatascience/simit-public/src.

Data
• Synthetic time series: We synthesized a time series of length 15,000. This series included 2 sorts of motif trends, and their prototypes were taken from 2 subsequence instances in the UCRTrace Data [21]. Both instances were of the same length as 275, but belonged to different classes. Subsequences for each motif were generated by sampling from a Gaussian distribution with the mean as the corresponding instance and a reasonably small variance as 0.01. There were in total 12 subsequences for each motif. The remaining were standard Gaussian noises, and they constituted a major part of the whole series. More details about the data synthesizing process are described in the pseudocode Procedure 1 in Appendix B.
• MIT-BIHarrhythmia ECG recording: This dataset was Recording #205 in the MIT-BIH Arrhythmia DataBase [22]. This recoding was created from digitizing the ECG signals at 360 samples per second.
We chose a part of 20 s (7200 samples) to experiment on that included normal heartbeats and ventricular tachycardia beats.
• Belgium Power Load Data: This dataset was taken from Open Power System Data [23]. The primary source of these data was ENTSO-E Data Portal/Power Statistics [24]. Open Power System Data then resampled and merged the original data in a large CSV file with hourly resolution. The part we selected to experiment on recorded the total load in Belgium during the year 2007, for a total length of 24 × 365 = 8760.

Pruning and Scalability
For all the experiments, we first identified an initial motif T (0) with k = 4 instances. As mentioned above, our algorithm searches among a space pruned in a particular heuristic way to gain some relative amount of efficiency. The effects of pruning in the initial set identification were tested on the synthetic time series, for which the correct answers were known. The results indicated that the optimal one was still found even with the heaviest pruning (99.9%). Therefore, we used 99% pruning in the experiments on the real-world datasets. The scalability of our algorithm, with respect to the length of the motif template and to the length of the time series, was evaluated on the ECG recording. Table 1 shows that the length of the motif template did not influence the computational cost that much, but the influence of the time series length was more than quadratic.

Synthetic Data
In this experiment, we specified the length of the motif instance the same as the length of the subsequence synthesized by sampling (i.e., l = 275). As expected, our algorithm identified two motifs embedded in this synthetic time series, the result of which is shown in Figure 1. The whole time series is plotted in Figure 1a, and subsequences incorporated into the first motif set are exactly those sampled from the same Gaussian distribution (in red). Figure 1b illustrates the first motif template by plotting the mean of all the instances incorporated into this motif, as well as the error bars indicating the variance of each point. Our algorithm also correctly identified the second motif (marked in green in Figure 1a). We modeled the user's knowledge about the first motif by triggering the new search on a new original background distribution, the one that took into account of all the instances for the first motif. The second motif is displayed in Figure 1c.

ECG Time-Series
We analyzed the ECG data by identifying motifs with length 100, corresponding to a duration of 0.28 s. In this fairly short recording (see Figure 2a), our algorithm identified three motifs. The first two motifs corresponded to normal heartbeats (highlighted with red and green; templates shown in Figure 2b,c). We can see that their shapes mostly coincide, with a horizontal shift. Normal heart beats are deemed to be similar to each other, but within each one, there may exist a particular subsection that bears more similarity than other subsections. Since the motif length was set to be less than a period of a normal heart beat, our algorithm was prone to regarding those subsections that bore the similarity to a different extent to be in different motif sets. Another motif identified by the algorithm lied in the area of the ventricular tachycardia (pink sections). The instances did not cover all the ventricular tachycardia heart beats, but the small error bars in Figure 2d indicate that these instances were uncannily similar to each other, and the reason why other ventricular tachycardia subsequences lost the membership for this motif set was their smaller similarity.

Belgium Power Load Data
We analyzed these data searching for motifs of length 24 (one day). The first four motifs discovered by our algorithm are displayed in Figure 3. The first motif covered many weekdays, except for Fridays, during cold seasons (highlighted with red in Figure 3a). All these 24-h periods started at 15:00. Note that not all the Mondays-Thursdays during these months were identified as the motif instance, for example those blue sections both at the very beginning and the end of this whole series. The reason could be that they corresponded to holidays rather than normal workdays. As for other workdays in winter excluding Friday that did not belong to this motif, these are very interesting for energy analysts to analyze their reason. After modeling the user's knowledge about this motif, our algorithm then identified the second motif, corresponding to Monday-Thursday as well, but during hot seasons (highlighted with green in Figure 3a). Most days in July were not instances of this motif. This might be due to them being during summer holiday time (a noticeable blue and pink section that divides the green section in Figure 3a). Actually, part of these days (i.e., Monday-Thursday in the last two weeks of July) constituted the third motif (pink sections in Figure 3a). The first 3 motifs were all related to normal workdays excluding Friday, but in different temperature conditions. It seems that power consumption in hot seasons is less regular than that during cold seasons, as the normal workday pattern relating to cold periods was identified first (i.e., the first motif). This phenomenon could be very interesting for energy analysts to investigate. By incorporating these 3 motifs into the user's belief model, our algorithm identified the fourth motif, corresponding to some Sundays from the middle of April to the beginning of October (black sections in Figure 3a). All the instances belonging to the same motif corresponded to daytime starting at exactly the same hour, and they were strongly similar to each other, as reflected by the small error bars in the illustration of each motif template (Figure 3b-d).

Conclusions
We proposed a new methodology for motif discovery called SIMIT (for Subjectively Interesting Motifs in Time Series) and a concrete implementation for a specific type of motifs where the interestingness score can incorporate prior beliefs, and hence are subjectively interesting. We developed a relaxation of this interestingness score with bounds that can be optimized relatively efficiently using constraint programming. An empirical evaluation demonstrates the potential of the proposed approach.
For future work, it would be useful to develop a motif template that incorporates a form of time warping. Secondly, the length of the subsequences considered is currently a parameter and could be optimized as well. To make this possible, further speedup techniques should be developed. In contrast to motifs, outliers are subsequences that are unusual and nonrecurring in a time series. Identifying subjective outliers can also be interesting. Moreover, the proposed motif templates are based on multivariate Gaussian distributions. An extension to multivariate non-Gaussian distributions [25] with the use of non-symmetrical entropy [26] seems promising. Finally, an extension towards multivariate time series is useful.

Appendix A. Solving Problem 1
The solution of Problem 1 follows directly from applying the method of Lagrange multipliers. Let us use Lagrange multipliers λ 1 , λ 2 , λ 3 for constraints (Equations (2)-(4)), respectively, λ for the vector containing all Lagrange multipliers. The Lagrangian is formulated as: Differentiating Equation (A1) w.r.t. p(x) and renormalizing by dropping the dx factor yields: Equating Equation (A2) to zero and solving for p(x) gives: where Z is the normalization variable, as the implicit normalization constraint p(x)dx = 1 can be imposed constructively by setting Z to be: By expanding quadratic terms in Equation (A3) and reorganizing, we obtain: 2λ 3 x i x i+1 + λ 2 nm 2 .
After some algebra: where m =m1 V −1 1,1 = V −1 n,n = λ 2 + λ 3 , V −1 i,i = λ 2 + 2λ 3 for i = 2, . . . , n − 1 V −1 i,j = −λ 3 for |i − j| = 1, V −1 i,j = 0 for |i − j| > 2, As is clearly seen from Equation (A4), p(x) is essentially a multivariate Gaussian distribution with the mean vector m and the covariance matrix V. Note that the inverse of the covariance matrix V −1 , known as the precision matrix, in our case is a tridiagonal matrix whose nonzero elements can be determined by λ 2 and λ 3 . As maximizing the entropy and maximizing the likelihood are the dual of each other in the class of exponential families [18], the optimal values of the Lagrange multipliers λ 2 , λ 3 can be found by maximizing the likelihood over the observations L(x|λ). This problem can be formulated as: Problem A1 is convex and can be solved by using standard techniques for convex optimization (e.g., the interior point method [27]).