Transitional SAX Representation for Knowledge Discovery for Time Series

: Numerous dimensionality-reducing representations of time series have been proposed in data mining and have proved to be useful, especially in handling a high volume of time series data. Among them, widely used symbolic representations such as symbolic aggregate approximation and piecewise aggregate approximation focus on information of local averages of time series. To compensate for such methods, several attempts were made to include trend information. However, the included trend information is quite simple, leading to great information loss. Such information is hardly extendable, so adjusting the level of simplicity to a higher complexity is difﬁcult. In this paper, we propose a new symbolic representation method called transitional symbolic aggregate approximation that incorporates transitional information into symbolic aggregate approximations. We show that the proposed method, satisfying a lower bound of the Euclidean distance, is able to preserve meaningful information, including dynamic trend transitions in segmented time series, while still reducing dimensionality. We also show that this method is advantageous from theoretical aspects of interpretability, and practical and superior in terms of time-series classiﬁcation tasks when compared with existing symbolic representation methods.


Introduction
Most of the real-world applications, such as financial assessment, weather monitoring, medical data examination, and multimedia systems generate huge amounts of time-series data daily. One of the main characteristics of time series data is high-dimensionality, which leads to the development of efficient data representation techniques that not only reduce the high dimensionality but also preserve the meaningful characteristics. In addition, a desirable distance measure for the reduced time series representation needs to be defined carefully for various data-mining tasks, such as indexing, searching, classification, clustering, motif discovery, anomaly detection, and rule discovery.
Some of the well-known data representations for time series with dimensionality reduction are discrete Fourier transform (DFT) [1], discrete wavelet transform (DWT) [2], discrete cosine transform (DCT) [3], singular value decomposition (SVD) [4], piecewise aggregate approximation (PAA) [5], adaptive piecewise constant approximation (APCA) [6], and symbolic aggregate approximation (SAX) [7]. Most of the above mentioned techniques except for SAX bring forth real-valued representations that are more expensive in terms of storage and computational complexity than symbolic representations for high dimensional time series data. The SAX method transforms real-valued time series data into a symbolic string following two main steps: (1) transforming the original time series to piecewise aggregate approximation (PAA), and (2) converting the PAA represented values into alphabetic symbols based on the assumption that the given normalized data follow normal distribution. Symbolic representations make possible the use of various string-based algorithms, already available, and diverse data structures in time series mining tasks. In addition, the distance measure corresponding to SAX attains a lower bound than popular distance measures defined on the original data. Due to its good performance in storage efficiency, time efficiency, and answer-set correctness (no false dismissals), SAX has been widely used in various applications, such as semantic sensor networks [8], mobile data management [9], and data visualization tools [10].
Though SAX is widely adopted in time series representation for its simplicity and efficiency, it undergoes considerable information loss. The traditional SAX method, however, removes trend and shape information in a time series, assuming that a portion of an arbitrary time series contains intermingled up-and-down trends. SAX basically uses averaged values of subsequences while ignoring trend information. Noticeably, SAX discretization does not guarantee equally probable symbols owing to its intermediate PAA [11]. As PAA is applied before SAX representation, the distribution of the data is altered and results in a shrinking standard deviation. This shrinking distribution negatively affects the symbolic representation of the time series deviating from the target distribution. Recently, researchers have improved SAX representations and the associated distance measures from various aspects to compensate for its information loss. The original SAX representation is integrated with, for example, a modified lookup table and a slope by regression. We explore some improvements related to the SAX representation and the distance measures.
Genetic-algorithm SAX (GASAX) was proposed to determine breakpoints using a genetic algorithm [12]. The objective of GA is to find the nearly optimal configuration of breakpoints that gives the best fitness. The authors argued that the normality assumption oversimplifies the problem of SAX representations and may result in high error when performing time-series mining tasks. Although GASAX works well on both normalized and non-normalized time series data, it needs to define suitable control parameters for its operators and fails to include trend information. Extended SAX (ESAX) [13] enhanced SAX by adding two new points, the maximum and minimum, to the original SAX representation. Using financial time-series data, the research showed that representations of ESAX are more precise than those of SAX without losing the symbolic nature of the original SAX. On the one hand, the storage cost of ESAX is triple that of the original SAX, since it necessarily locates the maximum and minimum along with the sample mean for each segment. Since SAX representations have low accuracy when distinguishing time series with similar average values but different trends, several attempts were made to qualitatively define a few trends, such as slight up/down and substantial up/down. Sun et al. defined a SAX-based trend distance (SAX-TD) quantitatively by using the starting and ending points of a segment and improved the original SAX distance [14].
Yin et al. proposed trend feature symbolic approximation (TFSA) using a two-step segmentation technique for rapid segmentation in long time series data [15]. TFSA, satisfying a lower bound criterion, showed better segmentation and classification accuracy. Malinowski et al. also represented a time series as a sequence of symbols consisting of the average and trend for each segment [16]. Basically, it is an application of linear regression to time series sub-segments, and symbols take into account information on the sample averages and slope values. This method, called 1d-SAX, improved retrieval performance, while the compression ratio remained similar to the original SAX.
In this paper, we propose a new symbolic representation method that incorporates transitional information of values according to time, enabling the method to easily track the direction in which a current symbolic representation moves toward the symbolic aggregate approximation. We aimed to capture important patterns in a systemic and meaningful fashion and append them to a piecewise representation method, such as SAX or PAA, for time series. We chose the SAX method to associate with the proposed method with because of its popularity and performance. Since neither SAX nor PAA suffer from low classification accuracy due to a high level of information compression or information loss, the proposed representation improves classification tasks and preserves interpretability.
The remaining part of this article is organized as follows: Section 2 contains the background of SAX. Section 3 describes the proposed approach for improving SAX with trend information. Section 4 shows experimental designs and results to verify the performance and interpretability of the proposed representation. Section 5 concludes the research with future research directions.

Preliminary: PAA and SAX
In this section, we briefly explain preliminary information of SAX. SAX is a time series representation method using piecewise aggregate approximation (PAA) of time-series subsequences. Given a time series X := {x(t)} t=1,...,N , the PAA divides X into n equally sized segments, X p := {x(t)} t=K(p−1)+1,...,Kp , p = 1, . . . , n, where N is divisible by n (n N) and K = N/n. It evaluates its local average for the p-th segment, X p : Next, the method transforms X into a representation vector {x p } p=1,...,n , an efficient dimensionality-reduction from N to n, which weakens the noise influence in x(t). The SAX method mapsx p into a symbol in consideration of the value space of X. For the mapping, it further divides the value range or value space of segments X p into several non-uniform regions under the normality assumption and assigns a symbol to each region.

The Proposed Method, Transitional SAX
Starting with the definition and transition of value spaces in detail, we introduce the proposed method.

Transitional Information in Sub Value Spaces
We first assume that the values x(t) in time series X follow a normal distribution through normalization and detrending, widely adopted in the literature [7,17]. Notice that we choose the original time series X rather than segment X p to increase the validity of the assumption. We divide the value space into regions with equal probability. To describe the regions in detail, we define a sub value space to be an interval S k = [y k−1 , y k ), k = 1, . . . , α, such thatΦ(y k ) −Φ(y k−1 ) = 1/α, in whichΦ is the cumulative probability function of a normal distribution with the sample average and the sample variance of x(t). The parameter α is the number of sub value spaces. In the following experiments, we show how to set α through a cross-validation procedure with training datasets. Observe that y 0 is the minimum of all values x(t) and y α is the maximum. Now given the sub value spaces S k , various feature reduction and extraction approaches are possible for expressing the time series X. For example, the SAX method assigns a symbol for a sub value space, reducing a numerical piecewise approximation to a symbol. In this paper, we aim to include transitional trend information by extracting the transition counts. For segments X p , we count the number of transitions, denoted by γ i,j where i, j = 1, . . . , α, from sub value space S i to S j as follows: where I(A) is 1 if the relation A is true, and 0 otherwise. Let us definex (i) to be the average of x(t) in sub value space S i : When applying all combinations of sub value spaces, S 1 , S 2 , . . . , S α , we form a transition matrix, γ = [γ i,j ] of size α × α: element γ i,j means the number of transitions from S i to S j . The use of transition matrix γ enables us to state terms relating to trend. We call the collection of γ i,j in which |i − j| is constant a trend. In particular, ∑ α i=1 γ i,i defines a sojourn trend as the sum of each piece of sojourn information γ i,i . In Figure 1, for instance, α is set to 3, and three sub value spaces, S 1 , S 2 , and S 3 , exist.
For the first segment X 1 , the transitional information of values moving in sub value spaces is stored in γ = [[4, 0, 0], [1,16,1], [0, 1,40]]. While all transitional elements γ i,j are worthwhile, we focus on one-step upward and downward transitional information with sojourn information. For example, γ 2,1 (= 1) represents the frequency of one-step upward transitions from the sub value space S 2 , and γ 2,3 (= 1) represents that of one-step downward transitions from S 2 . The diagonal elements, γ 1,1 (= 4), γ 2,2 (= 16), and γ 3,3 (= 40), represent the sojourn trends in the three subspace spaces, respectively. If the sum of one-step upward transitions γ 2,1 + γ 3,2 is zero, it means non-existence of upward trend and possible downward or steady trend. Observe that 0 ≤ γ i,j ≤ n − 1 since the most continuous transition pattern is to remain in a sub value space. We apply the above-mentioned transitional information to SAX, denoting the proposed approach transitional SAX. The overall algorithm is summarized in Algorithm 1. We notice that, as shown in Algorithm 1, a new representation for time series X of size N is the output in Algorithm 1, V = [W, Z] of size n(1 + α 2 ) since each segment produces a symbol for the local average plus α 2 of the transitional information. Unless N is large enough, meaning a long time series, we recommend the use of one-step upward and downward transitions, α − 1 counts for each, plus the sojourn transitions instead of all α 2 counts, which brings the dimensionality of V to n(3α − 1). In contrast with SAX which is able to handle incremental data, the proposed transitional SAS is not fully online but segment-wise online, since it is able to append a SAX word and a transition vector of a segment to its representation V. One needs to avoid a quite large n to prevent possible delay for online usage. On the contrary, a quite small n hardly captures transitional movements in the sub value spaces.
As shown in line 12 of Algorithm 1, we use letter symbols α 1 = k 1 and , respectively, and their distance is given by The letter distance, if located in either the same value sub space, is zero, and otherwise, it is set by the intermediary value spaces between two sub spaces S k 1 and S k 2 . For example, ifx p 1 ∈ S 3 and x p 2 ∈ S 3 , i.e., in the same value sub space, the distance becomes zero. For value sub spaces right adjacent to each other,x p 1 ∈ S 3 andx p 2 ∈ S 4 , the distance becomes zero since the in-between sub value space does not exist. Forx p 1 ≥x p 2 , it straightforwardly follows that Create sub value spaces S k , k = 1, . . . , α, with equal probability 1/α by fitting x(t) values to normal distribution 6: for p = 1, . . . , K do 7: end for 10: for each segment X p , p = 1, . . . , K do 12: Compute local averagex p and assign a symbol α k (= w) ifx p ∈ S k 13: for each i, j = 1, . . . , α do 15: Compute transitional information We evaluate the proposed method by how closely the new representation V = [W, Z] of a time series, X, approximates the original time series X. The new distance measure associated with the new representation needs to satisfy a lower-bounding property to ensure no false dismissals [17,18]. For that purpose, we propose the following distance measure. Let us suppose that the transitional SAX method produces new representations V (1) = [W (1) , Z (1) ] and V (2) = [W (2) , Z (2) ] for X 1 and X 2 , respectively, with the same size, in which τ is the size of Z (1) or Z (2) , τ = |Z (1) | = |Z (2) |; we define D(·, ·) to be: We compare the distance measure (4) with the Euclidean distance of the original time series to verify that it satisfies the lower-bound condition: we show that D(V 1 , V 2 ) ≤ D Euclidean (X 1 , X 2 ). The right-hand side of the inequality becomes: Since the sum of values centered by the average is zero, that is to say, The last inequality (x 1,p −x 2,p ) 2 ≥ (W (1) (p) − W (2) (p)) 2 holds true by the letter-distance definition given in Equation (3). The value Z(j) difference is lower bounded by |Z (1) The combination of the right-hand sides of equations and (6) and (7) produces finalizing the proof of D(V 1 , V 2 ) ≤ D Euclidean (X 1 , X 2 ). By admitting that tight lower-bounds bring forth better contractive property, the lower-bound relation of the associated distance measure implies the utility of the proposed transitional information in distance computation. We will elaborate on its attributes in more detail in the Experiments section.

Dataset
We used twenty UCR time series benchmarking datasets [19] to compare the proposed method with the previous algorithms. Table 1 describes the characteristics of the datasets, such as the number of classes, the size, and so forth. We split each dataset into training and testing sets as described in the table. The number of classes varied from 2 to 50. Training and testing set sizes were various from two dozen to thousands. The length of the time series ranged from 60 to 637.

Methods in Comparison and Parameter Settings
We compared the classification accuracy of our proposed method on one of the major time series data mining tasks symbolic, aggregate approximation, with the transition matrix (denoted as SAX-TM), the classic Euclidean distance (ED), SAX [10], SAX-TD [14], and SAX-SD [18]. We chose classification by one nearest neighbor (1NN) as the performance criterion, following most studies in time series representation [7,10,14]. The advantage of 1NN in time series representation is that the underlying distance measure is critical to the performance of the 1NN classifier. Therefore, the error rate of the 1NN classifier directly reflects the effectiveness of distance measures. Besides, the 1NN classifier is directly comparable with diverse distance measures, since it is parameter-free.
To obtain the best accuracy for each method, we used all training data to search for the best parameters n and α. For a given time series of length N, we chose the two parameters n and α using the following criteria. To make the comparison fair, the criteria were the same as those in [17]: for n, we searched from 2 up to N/2, doubling the value each time; for α, we searched from 3 up to 10. If two sets of parameter settings produced the same classification error rate, we chose the smaller set. We mention that, given labeled data, a training phase will boost not only SAX-TM but also other SAX methods; the traditional SAX needs to set the number of letters among other parameters. With the absence of labeled data, one needs to set the parameters for the SAX methods, including SAX-TM, according to other criteria in an unsupervised manner.

Experimental Results
The overall classification results for the testing datasets are listed in Table 2, where the lowest classification error is highlighted. Clearly, SAX-TM has the lowest error in most of the datasets (14/20), followed by the SAX-SD (6/20). On average, the classification error for SAX-TM is lower than half of that for the original SAX in 19 datasets among the 20 datasets. The number of sub value spaces, i.e., the dimensionality reduction ratio, α, for SAX-TM is smaller than those for the others except SAX-SD. Figure 2 shows comparisons of SAX-TM with ED, SAX, SAX-TD, and SAX-SD, respectively, in terms of error rates of 1NN classification. Figure 3a depicts changes of n parameters among SAX, SAX-SD, SAX-TD, and SAX-TM, and Figure 3b illustrates changes of parameter α among the comparative algorithms. In addition to the classification performance, we present the computation time of the proposed method in comparison with the original SAX. The comparison bears significance that SAX-TM requires a memory of size n(1 + α 2 ) for symbols, whereas the original SAX requires that of size n, as mentioned in Section 3.1. For this comparison, we used three datasets (Lighting2, SpaceShuttle, ECG2) of lengths 637, 5000, and 21, 600; see the results in Table 3. The environment for the comparison was Matlab R2020b and Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz with α = 10. We observed that SAX is quite a lot faster than SAX-TM, which is reasonable since SAX-TM requires additional point-wise testing and storage for sub value spaces. The computation times for both methods, increasing according to the dataset length, are reasonably fast. Noticeably, the speed of SAX-TM is relatively robust against n; SAX becomes quite much slow when n changes from 128 to 256 and SAX-TM hardly changes in speed; the standard deviation of SAX-TM is smaller than that of SAX in SpaceShuttle and ECG2.

Information Analysis
Next, we evaluate the performance of SAX-TM from the viewpoint of information. SAX has been regarded as a de facto standard to reduce the dimensionality of time series data. Despite its popularity and universality, the structural properties of SAX from the information viewpoint have been rarely researched to the best of our knowledge.
Among the statistical facets, Song et al. proposed in the investigation of time-series dimensionality reduction [20] that we focus on information loss and efficiency of information embedding. Both minimizing the loss of useful information and preserving useful information in a raw time series are practical goals. Thus, we adopted procedures to discover intrinsic properties of the proposed method from the perspective of information loss and information embedding.  For this purpose, we calculated the information loss, denoted by L in f o , by mean squared error (MSE) between a raw signal and reconstructed symbolic words: in which T is raw signal andT is a reconstructed one. To conduct the comparison, we scaled raw time series and SAX words to [0, 1]. We also calculated the Kullback-Leibler (KL) divergence, which is a non-symmetric similarity measure between two different probability distributions. For distributions P and Q with k points, the KL divergence is defined as follows: Table 2. 1NN classification error rates of ED (Euclidean distance); 1NN best classification error rates, length n, and dimensionality reduction ratio α of the SAX, SAX-TD, SAX-SD, and SAX-TM on 20 datasets. The lowest error rates are highlighted in bold. In our experiments, we take P as the distribution of the original signal T and Q as that of the reconstructed signalT by a histogram with α as the number of bins.

No. ED
Information loss measures the amount of information abandoned when converting the original time series to a symbolic representation. KL-divergence represents the closeness between the distribution of a raw signal and that of a reconstructed signal. To combine the two measures, Song et al. defined information embedding cost (IEC) as a ratio of KL-divergence and information loss as follows: Given a time series T in distribution P and reconstructed signalT with distribution Q, the IEC score describes the number of extra bits needed to transform the outputT when information loss incurs by one unit, revealing how much useful information is abandoned when transforming a raw signal [20]. A higher value of information loss and a lower value of KL-divergence imply that the reconstruction preserves a large quantity of information while reducing complexity. Hence, we prefer a representation method with lower IEC.
For intuitive understanding, we graphically compare SAX with the proposed method using one representative coffee dataset, providing only performance summaries for the others. Figure 4a shows the raw time series together with representations by SAX and SAX-TM for the coffee dataset. SAX and SAX-TM affordably follow up the shape of the raw signal. In Figure 4b, the information loss of SAX-TM is higher than that of SAX. This means SAX-TM lost much more information than SAX. However, the KL-divergence value of SAX-TM is lower than that of SAX, as shown in Figure 4c. The tendency is also preserved in the IEC scores shown in Figure 4d.  We used twelve datasets in total, including the coffee one , to see the amount of useful information preserved in terms of the information loss, KL-divergence, IEC score, and 1NN classification error. The comparative results between SAX and the proposed method are shown in Table 4, where N is the data length. We applied the same parameter settings for SAX as in [20] and applied the combination of parameters from Table 2 for SAX-TM. Overall, in Table 4, information loss of SAX is lower than that of SAX-TM. That is, SAX loses smaller quantities of information than SAX-TM. However, the KL-divergence values of SAX-TM are mostly lower than those of SAX: the number of lower KL-divergence for SAX-TM (7/12) is larger than that for SAX (5/12). This tendency is preserved in the IEC score. Even the average IEC scores for SAX-TM are lower than those for SAX. That is, SAX-TM loses less useful information than SAX. Nevertheless, the 1NN classification error of SAX-TM is considerably lower than that of SAX. By appending transitional information to the original SAX, we obtained substantial gains in accuracy.

Conclusions
In this work, we described the popularity and universality of SAX, which is a symbolic aggregate approximation in the field of dimensionality reduction for time series data. The original SAX barely captures trend information from the perspective of time-series shape. Therefore, we proposed a symbolic aggregate approximation with transitional information, which can represent trend information by appending transition information to basic SAX.
In a given time window, a SAX word is created, and we can trace how data points travel from the current quantile region to the next location. We call this moving behavior from the current location to the next location a transition. When in a current location, data points in a window can choose from three movements-upward transition, downward transition, and sojourn transition. These movements are saved in the data format of a matrix. First, we conducted experiments to verify the effectiveness of SAX-TM compared with other state-of-the-art methods such as SAX-TD and SAX-SD. The experimental results show SAX-TM has the lowest 1NN classification error among the algorithms. Next, we identified intrinsic statistical properties of SAX-TM. From [20], we selected information loss, KL-divergence, and information embedding cost as important measurements. Overall, the information loss of SAX is lower than that of SAX-TM. However, the number of datasets with lower KL-divergence for SAX-TM is slightly larger than that for SAX. This tendency is also preserved in terms of the IEC score. Nonetheless, SAX-TM substantially reduces classification error compared with SAX. SAX-TM shows explicit increases in accuracy even while appending transition information to SAX.
In spite of the aforementioned advantages, the proposed algorithm has several limitations. Basically, SAX compresses raw data for smoothing. However, SAX-TM increases the complexity of SAX representation by appending a transition matrix. We plan to investigate the minimal effective information to add to SAX and compare it with well-known non-SAX methods. In addition, future research directions include the theoretical aspects of the transition information in several time-series models.