Similarity Measurement and Classiﬁcation of Temporal Data Based on Double Mean Representation

: Time series data typically exhibit high dimensionality and complexity, necessitating the use of speciﬁc approximation methods to perform computations on the data. The currently employed compression methods suffer from varying degrees of feature loss, leading to potential distortions in similarity measurement results. Considering the aforementioned challenges and concerns, this paper proposes a double mean representation method, SAX-DM (Symbolic Aggregate Approximation Based on Double Mean Representation), for time series data, along with a similarity measurement approach based on SAX-DM. Addressing the trade-off between compression ratio and accuracy in the improved SAX representation, SAX-DM utilizes the segment mean and the segment trend distance to represent corresponding segments of time series data. This method reduces the dimensionality of the original sequences while preserving the original features and trend information of the time series data, resulting in a uniﬁed representation of time series segments. Experimental results demonstrate that, under the same compression ratio, SAX-DM combined with its similarity measurement method achieves higher expression accuracy, balanced compression rate, and accuracy, compared to SAX-TD and SAX-BD, in over 80% of the UCR Time Series dataset. This approach improves the efﬁciency and precision of similarity calculation.


Introduction
Time series data refer to a sequence of data points arranged in chronological order. The widespread use of smartphones, various sensors, RFID, and other devices in recent years has laid a solid foundation for the generation of massive amounts of time series data. For efficient data mining and querying of time series data, it is crucial to employ a rational and effective representation method. Time series data representation methods can be broadly classified into four categories: data-adaptive representation methods, nondata-adaptive representation methods, model-based representation methods, and dataindicative representation methods, as described in Table 1. The content of this overview is derived from the literature [1].  N: author not listed. T1: non-data-adaptive representation methods. T2: data-adaptive representation methods. T3: data-indicative representation methods. T4: model-based representation methods.
(1) Data-adaptive representation methods express the original time series data as a combination of arbitrary-length segments, aiming to minimize the error in global representation. For example, the commonly used Singular Value Decomposition (SVD) method [4] is a typical data-adaptive representation method. It seeks c representative orthogonal vectors of dimensionality k (c ≤ k) and maps the original time series data into a smaller space, achieving dimensionality reduction. Another approach, the Piecewise Linear Approximation (PLA) method proposed by Keogh et al. in 1998 [11], fits the original time series data using line segments. Furthermore, the improved Piecewise Constant Approximation (PCA) method [12] approximates each time series data segment using a constant. Similarly, the Adaptive Piecewise Constant Approximation (APCA) method [13] divides the original time series data into variable-length segments and represents each segment using mean and time-scale values. Symbolic Aggregate Approximation (SAX) [14,15] partitions the original data and uses means to represent each segment. Then, based on the normal distribution of the values in the time series data, the means are mapped to symbols, transforming the complete time series data into a string.
Data-adaptive representation methods effectively capture the characteristics of the original time series data. However, due to the unequal lengths of the segments, similarity measurement based on this representation method becomes challenging.
(2) In non-data-adaptive representation methods, the non-data-adaptive representation methods based on frequency domain transformation convert time series data from the time domain to the frequency domain and represent the time series data using the spectral information in the frequency domain. Commonly used methods include Discrete Fourier Transform (DFT) [2], Discrete Wavelet Transform (DWT) [3], Discrete Cosine Transform (DCT) [4], and others. These methods approximate the original time series data by taking the first k coefficients after a certain transformation, achieving compression and reducing the dimensionality of the time series data. Although parameter determination for domain-transformation-based representation methods is challenging and requires extensive parameter tuning experiments, non-data-adaptive methods based on piecewise approximation have also emerged. Examples include Piecewise Aggregate Approximation (PAA) [5], Indexable Piecewise Linear Aggregation (IPLA) [8], and other methods. Non-data-adaptive representation methods use equally sized segments to approximate the time series data. While the approximation quality may not be as good as that of data-adaptive representation methods, using such methods for similarity measurement is relatively straightforward and simple.
Non-data-adaptive representation methods use equally sized segments to approximate the time series data. Although the approximation quality may not be as good as that of data-adaptive representation methods, using these methods for similarity measurement is more direct and simpler.
(3) Model-based representation methods assume that time series data are stochastic and use models such as AutoRegressive (AR) [24], AutoRegressive Moving Average (ARMA) [25], and Hidden Markov Models (HMMs) [26] for fitting. This approach requires the data to conform to certain assumptions and mathematical deduction theories; otherwise, distortion may occur. Additionally, time series data are complex in structure and contain a significant amount of noise, which poses significant challenges in constructing accurate models.
(4) The aforementioned three types of representation methods allow users to customize the data compression ratio (the ratio of the original sequence length to the processed sequence length). However, determining this parameter is often challenging and can significantly affect the quality of the approximation. In contrast, data-indicative representation methods can automatically define the compression ratio based on the original time series data. Common methods in this category include data pruning [21] and tree-based representation methods [22].
Among these, the SAX (Symbolic Aggregate Approximation) family of representation methods has received significant attention from researchers. As a data-adaptive representation method, SAX is known for its simplicity and comprehensibility. Its implementation mainly involves three steps: Z-normalizing the dataset, applying Piecewise Aggregate Approximation (PAA) for dimensionality reduction, and finally, performing symbolic representation. SAX has two important parameters: the dimensionality reduction parameter (w) and the alphabet size (α). Figure 1 illustrates the process of symbolization approximation for time series data of dimensionality 128. The original time series data are transformed into a string of length 8, represented as "baabccbc," which contains three characters (a, b, c).
However, considering the complexity and diversity of time series data, using only the mean of time series data segments to represent their information may overlook some important features, resulting in limited expressive power of SAX. Therefore, several improved representation methods based on SAX have been proposed, such as SAX-TD [16] and SAX-BD [17]. These methods enhance the accuracy of similarity measurement by incorporating additional key feature information for each time series data subsegment during compression and dimensionality reduction. However, considering the complexity and diversity of time series data, using only the mean of time series data segments to represent their information may overlook some important features, resulting in limited expressive power of SAX. Therefore, several improved representation methods based on SAX have been proposed, such as SAX-TD [16] and SAX-BD [17]. These methods enhance the accuracy of similarity measurement by incorporating additional key feature information for each time series data subsegment during compression and dimensionality reduction.
The SAX-TD method represents the trend distance using the distance from the endpoint value to the mean, but it has limitations for sequences with complex variations. The SAX-BD method represents the trend distance using the distance from the maximum and minimum values to the mean, but it requires more symbol representations and has poor dimensionality reduction performance.
In this paper, we propose a new method called SAX-DM, where "DM" stands for Double Mean. To better quantify the trend of the subsegments, we further divide the segments obtained by the PAA algorithm into two parts. We use the distance from the mean of the left part to the overall mean as the symbol of trend distance, which intuitively represents the trend of the subsegments. However, considering the issue of positive and negative values, we represent the trend direction by the difference between the left mean and the time average. Since the PAA algorithm has already normalized the time series data, the range of the trend distance is Δq ∈ [−1, 1].
Compared to SAX-TD and SAX-BD, SAX-DM provides a better representation of the overall trend of time series data segments with complex variations. Additionally, SAX-DM only adds an extra character on top of SAX, requiring fewer symbols. In our experiments, we propose a novel similarity measurement method based on the SAX-DM representation. The experimental results demonstrate that SAX-DM outperforms SAX-BD and SAX-TD in terms of effectiveness. Furthermore, we prove that our novel distance measurement method guarantees a lower bound on the Euclidean distance while maintaining a more compact lower bound than the original SAX method.

Related Work
Assuming that time series data follow a normal distribution, according to mathematical principles, we can transform the original sequence data into a sequence that conforms to the standard normal distribution. The SAX representation method draws inspiration from this idea. It first segments the transformed sequence data and calculates the mean for each segment. Then, these means are mapped to corresponding probability intervals based on their magnitudes. If we assign a letter to each interval, we obtain a sequence of letters, which forms the basis of the SAX. Through this process, the SAX is able to transform the original continuous time series data into a discrete symbol sequence, thereby simplifying the representation and processing of the data. The SAX-TD method represents the trend distance using the distance from the endpoint value to the mean, but it has limitations for sequences with complex variations. The SAX-BD method represents the trend distance using the distance from the maximum and minimum values to the mean, but it requires more symbol representations and has poor dimensionality reduction performance.
In this paper, we propose a new method called SAX-DM, where "DM" stands for Double Mean. To better quantify the trend of the subsegments, we further divide the segments obtained by the PAA algorithm into two parts. We use the distance from the mean of the left part to the overall mean as the symbol of trend distance, which intuitively represents the trend of the subsegments. However, considering the issue of positive and negative values, we represent the trend direction by the difference between the left mean and the time average. Since the PAA algorithm has already normalized the time series data, the range of the trend distance is ∆q ∈ [−1, 1].
Compared to SAX-TD and SAX-BD, SAX-DM provides a better representation of the overall trend of time series data segments with complex variations. Additionally, SAX-DM only adds an extra character on top of SAX, requiring fewer symbols. In our experiments, we propose a novel similarity measurement method based on the SAX-DM representation. The experimental results demonstrate that SAX-DM outperforms SAX-BD and SAX-TD in terms of effectiveness. Furthermore, we prove that our novel distance measurement method guarantees a lower bound on the Euclidean distance while maintaining a more compact lower bound than the original SAX method.

Related Work
Assuming that time series data follow a normal distribution, according to mathematical principles, we can transform the original sequence data into a sequence that conforms to the standard normal distribution. The SAX representation method draws inspiration from this idea. It first segments the transformed sequence data and calculates the mean for each segment. Then, these means are mapped to corresponding probability intervals based on their magnitudes. If we assign a letter to each interval, we obtain a sequence of letters, which forms the basis of the SAX. Through this process, the SAX is able to transform the original continuous time series data into a discrete symbol sequence, thereby simplifying the representation and processing of the data.

Distance Calculation by SAX
The implementation of SAX mainly consists of three steps: Z-normalization of the dataset, dimensionality reduction using PAA, and finally symbolization. First of all, it is common to normalize the time series data so that each time series has a mean of 0 and a variance of 1 because time series data with different offsets and amplitudes are not directly comparable. Then, using a sliding-window approach, the original time series data are divided into w equal-length subsequences, and each subsequence is represented by its mean value. The normalized time series data follow a normal distribution, which provides mathematical support for dividing the probability distribution corresponding to the time series data into equal area regions and can ensure that the probability of the time series data falling into each interval is equal. The number of regions is determined by the input parameter α, which also represents the size of the character set. Each region is then represented by a symbol. Finally, based on the region in which the mean value of each subsequence falls, the corresponding symbol for that region is used to replace the subsequence, resulting in the symbolization approximation representation of the time series data.
Assuming we have two time series data Q and C with the same length n, which is divided into w time segments represented as q and c,Q andĈ are the symbol strings obtained after applying the SAX algorithm. The SAX distance between Q and C can be calculated as the sum of the distance between each corresponding symbol and can be expressed as follows:

SAX-TD
In order to enhance the accuracy of the SAX representation, it is important to preserve the trend information of the time series data during the dimensionality reduction process. For instance, the authors of reference [16] propose storing a value and a symbol in SAX to improve the distance calculation. They introduce an improvement method called SAX-TD, which utilizes the start and end points of segments to calculate the trend distance. There are several scenarios for the trend variation of time series data segments, as illustrated in Figure 2. In this figure, t l represents the left endpoint, which is the starting endpoint, and t r represents the right endpoint, which is the end endpoint.
Δq(t r ) Δq(t r ) Δq(t r ) The trend is indeed an important feature of time series data and plays a crucial role in their classification and analysis. For instance, if the endpoint value is greater than the starting point value, it indicates an upward trend, while the opposite suggests a downward trend. To describe this trend more accurately, it is necessary to use the actual values instead of the symbolized mapping when calculating the trend distance.
For the given time series data q and c, their trend distance is calculated as follows: where and represents the left and right endpoint values of the time series segment, ∆q(t) represents the distance from the endpoint value to the average value of the line segment in the sequence, and q(∆c(t) represents the corresponding distance of another sequence), calculated using Formula (3): The trend is indeed an important feature of time series data and plays a crucial role in their classification and analysis. For instance, if the endpoint value is greater than the starting point value, it indicates an upward trend, while the opposite suggests a downward trend. To describe this trend more accurately, it is necessary to use the actual values instead of the symbolized mapping when calculating the trend distance.
For the given time series data q and c, their trend distance is calculated as follows: where t s and t e represents the left and right endpoint values of the time series segment, ∆q(t) represents the distance from the endpoint value to the average value of the line segment in the sequence, and q(∆c(t) represents the corresponding distance of another sequence), calculated using Formula (3): Although each segment has a starting point and an endpoint, in practice, the starting point of the next segment is actually the endpoint of the previous segment. Therefore, it is possible to embed the trend distance into the SAX symbolized sequence. As a result, the time series data Q and C can be expressed using the following representation: . .q w represents a sequence symbolized by SAX, and ∆q(1), ∆q (2), . . . , ∆q(x) represents the trend of the time series data represented by the distance between the endpoint values and the mean values. ∆q(x + 1) represents the change in the last point.
The distance measure formula for the time series data Q and C can be expressed as follows: where dist(q i ,ĉ i ) represents the distance calculated using the SAX distance measure algorithm, td(q i , c i ) represents the distance calculated using Equation (2), n represents the length of q and c, and w represents the number of time segments.

SAX-BD
Reference [17] and others suggest adding boundary distance as a new consideration instead of trend distance and propose the algorithm SAX-BD. This algorithm considers that, for each segmented time series fragment, there are maximum and minimum points, and their distances to the mean value are referred to as the boundary distance. The average of the segment's boundary distances contributes to a more accurate measurement of the different trends in the time series data. Details are provided below.
From Figure 3, we can observe that the maximum and minimum values within each time segment serve as the boundaries. The boundary distance of a is denoted as ∆q(t) max and ∆q(t) min , as shown in Equations (5) and (6): In fact, the SAX-BD algorithm computes the trend changes (i.e., the boundary distance) of a as ∆q(t min ) and ∆q(t max ), and these values are equivalent to ∆q(t s ) and ∆q(t e ). Therefore, it is evident that SAX-BD can also effectively distinguish between them. For cases a and b, the distance calculated using SAX-TD is 0. However, in the SAX-BD approach, the distance calculated using SAX-BD is not equal to 0, indicating the potential for differentiation between these two sequences. Regarding the situations c and d, according to the TD and BD methods, it is as follows: for case c, ∆q(t s ) = ∆q(t) max and ∆q(t e ) = ∆q(t) min for case d, Similar to the SAX-TD distance measurement concept, the following SAX-BD distance measurement formula can be used for temporal data Q and C: In fact, the SAX-BD algorithm computes the trend changes (i.e., the boundary distance) of a as ∆ ( ) and ∆ ( ), and these values are equivalent to ∆ ( ) and ∆ ( ). Therefore, it is evident that SAX-BD can also effectively distinguish between them. For cases a and b, the distance calculated using SAX-TD is 0. However, in the SAX-BD approach, the distance calculated using SAX-BD is not equal to 0, indicating the potential for differentiation between these two sequences. Regarding the situations c and d, according to the TD and BD methods, it is as follows: for case c, for case d, Similar to the SAX-TD distance measurement concept, the following SAX-BD distance measurement formula can be used for temporal data Q and C:

SAX-DM Design and Analysis
In the SAX algorithm, expressing the information of a time series segment solely based on its mean would overlook some important features, resulting in limited expressive power. The SAX-TD method represents trend distance using the distance from the endpoints to the mean, which has certain limitations for sequences with complex variations. The SAX-BD method represents trend distance using the distance from the maximum and minimum values to the mean, requiring more symbols for representation and resulting in poor dimensionality reduction.
In this paper, we propose a new approach that utilizes the mean and trend of time series subsegments as key information. To better quantify the trend of the subsegments, we further divide the segments obtained through the PAA algorithm into two parts. The distance from the mean of the left part to the overall mean is used as the trend distance, which intuitively represents the trend of the subsegment. However, considering the issue of positive and negative values, we use the difference between the left mean and the time average mean to represent the trend. Since the PAA algorithm already normalizes the time series data, the trend distance range is ∆q ∈ [−1, 1]. Figure 4 illustrates some examples of determining the trend distance. When the left mean in the segment is smaller than the overall mean, the trend is considered increasing, and the trend distance is positive, as shown in example in Figure 4a. When the left mean in the segment is larger than the overall mean, the trend is considered decreasing, and the trend distance is negative, as shown in example in Figure 4b. average mean to represent the trend. Since the PAA algorithm already normalizes the time series data, the trend distance range is Δq ∈ [−1, 1]. Figure 4 illustrates some examples of determining the trend distance. When the left mean in the segment is smaller than the overall mean, the trend is considered increasing, and the trend distance is positive, as shown in example in Figure 4a. When the left mean in the segment is larger than the overall mean, the trend is considered decreasing, and the trend distance is negative, as shown in example in Figure 4b Compared to SAX-TD and SAX-BD, SAX-DM can better represent the overall trend of time series data segments with complex variations, while requiring fewer symbols. For a segment with the mean value , its trend distance Δq can be represented within the range [−1, 1]. Therefore, when dividing time series data into n parts, it can be expressed as follows: Compared to SAX-TD and SAX-BD, SAX-DM can better represent the overall trend of time series data segments with complex variations, while requiring fewer symbols. For a segment with the mean valueq, its trend distance ∆q can be represented within the range [−1, 1]. Therefore, when dividing time series data into n parts, it can be expressed as follows: ∆q 1q 1 ∆q 2q 2 ∆q 3q3 . . . ∆q nq n

Similarity Measurement Based on SAX-DM Expression Method
In this paper, the Euclidean distance is employed as the fundamental method for measuring similarity. For a sequence approximated using SAX-DM, it can be viewed as a point in a w-dimensional space. Therefore, the computation of similarity between time series data can be transformed into calculating the distance between different points in this w-dimensional space.
Before proceeding, let us review the method for calculating Euclidean distance on the original time series data. Suppose we have two time series data, T = t 1 , t 2 , t 3 , . . . , t n , S = s 1 , s 2 , s 3 , . . . , s n . The Euclidean distance between them is the straight-line distance between the points represented by the two time series data objects in an n-dimensional space. It can be calculated using the following formula: Due to the high-dimensional nature of time series data, calculating distances on the original time series data can lead to significant memory pressure and a substantial number of computations. This process is also susceptible to noise and deformations in the data. Therefore, it is common to compress and reduce the dimensionality of the original time series data and extract features. One popular method for compression and dimensionality reduction is using Piecewise Aggregate Approximation (PAA). After applying PAA, the similarity distance can be calculated using the following formula: Each time series data subsegment uses the mean as a feature information: In this article, trend distance is used to represent temporal data fragments. Firstly, a representation based on trend distance is defined. For temporal data Q and C, the following expressions are defined: Q : ∆q 1q 1 ∆q 2q 2∆ q 3q3 . . . ∆q nq n C : ∆c 1ĉ1 ∆c 2ĉ2 ∆c 3ĉ3 . . . ∆c nĉn According to the above expression, the formula for calculating the trend distance based on the left mean value is defined as follows: The distance measurement can be represented by the following equation: In order to address the memory pressure and computational efficiency issues associated with high-dimensional time series data, it is generally necessary to compress and reduce the dimensionality of the original data and extract features. In this paper, the PAA method is employed to segment the original time series data, and the average value and change trend of subsegments are used as the key information for each subsegment. This approach aims to improve the compression ratio of the approximate representation while preserving the essential features of the original data as much as possible. Furthermore, by considering the trend of temporal data changes, the similarity measurement method based on the approximate representation proposed in this paper can calculate similarity more accurately.

Lower Bound
The SAX algorithm, when applied for dimensionality reduction, offers one of its most important features, that is providing a lower-bound distance measurement called the boundary distance. The lower bound is useful to control errors and accelerate computations. However, when performing spatial queries on the dimensionally reduced original sequence, there is a risk of false negatives. To reduce the occurrence of false negatives after dimensionality reduction, it is important to design algorithms that satisfy a good lower-bound property.
Next, we will prove that the distance we propose also serves as a lower bound for the Euclidean distance.
The lower bound of the PAA distance for the Euclidean distance is given by the following expression: To prove that DMIST is also a lower bound for the Euclidean distance, we reiterate some of the proofs here. Let Q and C be the means of the time series data Q and C, respectively. Firstly, we consider the single-frame case (i.e., w = 1), and according to Equation (15), we can obtain Recall that q is the average value of the temporal data, so q i can use q i = Q − ∆q i . This also applies to each point c in c i . Equation (15) can be rewritten as follows: we can obtain the following inequality (which clearly exists in the boundary distance ∆q i ): Substituting Equation (16) into Equation (17), we obtain MINDIST conducted a lower-bound analysis of the PAA distance, namely, In Equation (20),Q andĈ are, respectively, the symbolic representations of Q and C in the original SAX. By transitivity, the following inequality is correct: Recalling Equation (15), this means N frames can be obtained by applying a single-frame proof on each frame, namely, The quality of the lower boundary distances is usually measured by the compactness of the lower boundaries (TLB): The value of TLB is within the range [0, 1]. The higher the TLB value, the better the quality. Recalling the distance metric in the equation, we can obtain that TLB(BDIST) ≥ TLB(MI N IDIST), which means that SAX-DM has a tighter lower bound than the original SAX distance.

Experimental Validation
In this section, we compare the classification results of SAX-DM representation with other representations on time series data through experiments. Firstly, we introduce the experimental dataset, followed by an explanation of the experimental methodology and parameter settings. Finally, we evaluate the advantages of SAX-DM based on the comprehensive assessment of classification accuracy and error rate.

Datasets
The evaluation of SAX-DM's classification performance utilized the UCR Time Series Archive [28], which is a widely used collection of time series datasets in the field of time series data mining. This dataset was introduced in 2002 and has been continuously expanded over time. Due to its inclusion of time series datasets from various domains, it has become an important resource in the time series data mining community and is recommended by researchers working with time series data.

Experimental Results and Analysis
We compared the classification accuracy and error rate of SAX-DM with those of ED, SAX, SAX-TD, and SAX-BD. The experimental results show that in most of the datasets, SAX-DM achieved comparable accuracy to SAX-TD and SAX-BD, with the additional advantage of lower classification error rates. Figure 5 presents the comparison results between the SAX-DM and ED methods. Each point in the graph represents a dataset, where the x-axis represents the classification accuracy using the ED method, and the y-axis represents the classification accuracy using the SAX-DM method. Intuitively, the more points above the diagonal line, the more datasets in which the SAX-DM achieved higher accuracy compared to the ED method. After statistical analysis, it was found that SAX-DM had a slightly lower classification accuracy than ED in 67% of the datasets. This is because the Euclidean distance, which directly measures similarity on the original time series data without compression and dimension reduction, achieves high accuracy but puts a significant burden on computer memory and has lower computational efficiency. It is typically used as a comparative method to assess the viability of an approach and is not directly used for data mining tasks. Subsequent experiments on the data compression ratio did not require a comparison with the ED method. Although the SAX-DM may perform worse or on par with the Euclidean distance method in most datasets, the accuracy gap is within 0.1 for 80% of the datasets, indicating that it achieves effective data dimension reduction while maintaining a reasonably close accuracy level compared to the Euclidean distance method.
The comparison results between the SAX-DM and SAX methods in terms of classification accuracy are shown in Figure 6. The x-axis represents the classification accuracy using the SAX method, while the y-axis represents the classification accuracy using the SAX-DM method. The results represent the average classification accuracy for both representation methods under all α and w parameters, reflecting the overall classification performance. The results are evident, as using the SAX-DM method for approximate representation and similarity measurement in classification tasks yielded a higher accuracy in 98% of the datasets. The comparison results between the SAX-DM and SAX methods in terms of classification accuracy are shown in Figure 6. The x-axis represents the classification accuracy using the SAX method, while the y-axis represents the classification accuracy using the SAX-DM method. The results represent the average classification accuracy for both representation methods under all α and w parameters, reflecting the overall classification performance. The results are evident, as using the SAX-DM method for approximate representation and similarity measurement in classification tasks yielded a higher accuracy in 98% of the datasets. Compared to SAX, the SAX-DM further incorporates trend distance expression, allowing for better representation of the trend characteristics in time series data. However, this comes at the cost of sacrificing some data compression and dimension reduction rates. In practical usage, different methods can be chosen based on the application's requirements for similarity accuracy in queries. Figure 7 presents the comparison results between the SAX-DM and SAX-TD representation methods. From the figure, it can be observed that the two methods achieved  The comparison results between the SAX-DM and SAX methods in terms of classification accuracy are shown in Figure 6. The x-axis represents the classification accuracy using the SAX method, while the y-axis represents the classification accuracy using the SAX-DM method. The results represent the average classification accuracy for both representation methods under all α and w parameters, reflecting the overall classification performance. The results are evident, as using the SAX-DM method for approximate representation and similarity measurement in classification tasks yielded a higher accuracy in 98% of the datasets. Compared to SAX, the SAX-DM further incorporates trend distance expression, allowing for better representation of the trend characteristics in time series data. However, this comes at the cost of sacrificing some data compression and dimension reduction rates. In practical usage, different methods can be chosen based on the application's requirements for similarity accuracy in queries. Figure 7 presents the comparison results between the SAX-DM and SAX-TD representation methods. From the figure, it can be observed that the two methods achieved Compared to SAX, the SAX-DM further incorporates trend distance expression, allowing for better representation of the trend characteristics in time series data. However, this comes at the cost of sacrificing some data compression and dimension reduction rates. In practical usage, different methods can be chosen based on the application's requirements for similarity accuracy in queries. Figure 7 presents the comparison results between the SAX-DM and SAX-TD representation methods. From the figure, it can be observed that the two methods achieved similar accuracy in the classification task. When considering the reasons behind this, SAX-TD introduces a representation approach for capturing the trend by measuring the distance between the left and right endpoint values and the overall mean of the time series data. It combines this trend information with the mean value for symbolization. This approach is almost identical to the feature extraction methodology used in this paper, but it has certain limitations in practical applications. As the right endpoint value of one segment in time series data can be used as the left endpoint value of the next segment, when the number of segments is large, it is possible to approximate two symbols representing one segment.
However, when the time series data have a small number of segments, the number of symbols required for the approximation representation increases.
In contrast, the SAX-DM method proposed in this paper can represent any segment of the time series data with two symbols, while still achieving a comparable classification accuracy to SAX-TD. Therefore, the SAX-DM method is more suitable for subsequent similarity query tasks on massive time series data. Additionally, the mean and trend distance can be further encoded to enhance compression and dimensionality reduction, thereby improving the compression ratio of the time series data approximation method.
The comparison results between the SAX-DM and SAX-BD methods are shown in Figure 8. From the figure, it can be observed that the SAX-BD method performed better in the classification task on 60% of the datasets, and its classification accuracy varied significantly across different datasets. Considering the reasons behind this, the SAX-BD method incorporates the left and right extreme values in addition to the mean value to represent the time series data, while the SAX-DM method focuses on extracting the mean feature. Therefore, there is a significant difference in the expression effect of these two methods for data with smooth or drastic changes. Although SAX-BD achieves higher classification accuracy, it uses three features for dimensionality reduction, resulting in a lower compression ratio, which makes it unsuitable for subsequent similarity query tasks on large-scale time series data.
Algorithms 2023, 16, x FOR PEER REVIEW 16 of 20 similar accuracy in the classification task. When considering the reasons behind this, SAX-TD introduces a representation approach for capturing the trend by measuring the distance between the left and right endpoint values and the overall mean of the time series data. It combines this trend information with the mean value for symbolization. This approach is almost identical to the feature extraction methodology used in this paper, but it has certain limitations in practical applications. As the right endpoint value of one segment in time series data can be used as the left endpoint value of the next segment, when the number of segments is large, it is possible to approximate two symbols representing one segment. However, when the time series data have a small number of segments, the number of symbols required for the approximation representation increases. In contrast, the SAX-DM method proposed in this paper can represent any segment of the time series data with two symbols, while still achieving a comparable classification accuracy to SAX-TD. Therefore, the SAX-DM method is more suitable for subsequent similarity query tasks on massive time series data. Additionally, the mean and trend distance can be further encoded to enhance compression and dimensionality reduction, thereby improving the compression ratio of the time series data approximation method.
The comparison results between the SAX-DM and SAX-BD methods are shown in Figure  8. From the figure, it can be observed that the SAX-BD method performed better in the classification task on 60% of the datasets, and its classification accuracy varied significantly across different datasets. Considering the reasons behind this, the SAX-BD method incorporates the left and right extreme values in addition to the mean value to represent the time series data, while the SAX-DM method focuses on extracting the mean feature. Therefore, there is a significant difference in the expression effect of these two methods for data with smooth or drastic changes. Although SAX-BD achieves higher classification accuracy, it uses three features for dimensionality reduction, resulting in a lower compression ratio, which makes it unsuitable for subsequent similarity query tasks on large-scale time series data.  Figure 9 illustrates the comparison of classification error rates among the SAX, SAX-TD, SAX-BD, and SAX-DM methods using the classic ECG dataset as experimental data. Different α and w parameters were set for the evaluation. From the graph, it can be observed that the α parameter had a negligible impact on the classification accuracy, while the w parameter, when too large or too small, affected the accuracy. Based on the results depicted in the graph, the optimal value for the w parameter can be chosen as 16.
Since comparing compression ratios alone cannot fully demonstrate the advantages of expression methods, Figure 10 compares the classification accuracy of SAX, SAX-TD, SAX-BD, and SAX-DM at the same compression ratio, using SAX's compression ratio as the benchmark. A higher value in the graph indicates better overall performance in terms of compression ratio and accuracy. It is evident from the graph that SAX-DM method performed exceptionally well and had a more convenient index item conversion method. In general, the SAX-DM method is more suitable for similarity queries in time series data, especially in scenarios that require a large number of iterations, as it effectively reduces computer memory pressure and enhances data mining efficiency.  Figure 9 illustrates the comparison of classification error rates among the SAX, SAX-TD, SAX-BD, and SAX-DM methods using the classic ECG dataset as experimental data. Different α and w parameters were set for the evaluation. From the graph, it can be observed that the α parameter had a negligible impact on the classification accuracy, while the w parameter, when too large or too small, affected the accuracy. Based on the results depicted in the graph, the optimal value for the w parameter can be chosen as 16.  Figure 9 illustrates the comparison of classification error rates among the SAX, SAX-TD, SAX-BD, and SAX-DM methods using the classic ECG dataset as experimental data. Different α and w parameters were set for the evaluation. From the graph, it can be observed that the α parameter had a negligible impact on the classification accuracy, while the w parameter, when too large or too small, affected the accuracy. Based on the results depicted in the graph, the optimal value for the w parameter can be chosen as 16. Since comparing compression ratios alone cannot fully demonstrate the advantages of expression methods, Figure 10 compares the classification accuracy of SAX, SAX-TD, SAX-BD, and SAX-DM at the same compression ratio, using SAX's compression ratio as the benchmark. A higher value in the graph indicates better overall performance in terms of compression ratio and accuracy. It is evident from the graph that SAX-DM method performed exceptionally well and had a more convenient index item conversion method. In general, the SAX-DM method is more suitable for similarity queries in time series data, especially in scenarios that require a large number of iterations, as it effectively reduces computer memory pressure and enhances data mining efficiency. of expression methods, Figure 10 compares the classification accuracy of SAX, SAX-TD, SAX-BD, and SAX-DM at the same compression ratio, using SAX's compression ratio as the benchmark. A higher value in the graph indicates better overall performance in terms of compression ratio and accuracy. It is evident from the graph that SAX-DM method performed exceptionally well and had a more convenient index item conversion method. In general, the SAX-DM method is more suitable for similarity queries in time series data, especially in scenarios that require a large number of iterations, as it effectively reduces computer memory pressure and enhances data mining efficiency. Figure 10. SAX and its improved method accuracy and compression ratio.

Conclusions
The SAX-DM algorithm is proposed in this paper, taking into account the strengths and limitations of SAX-TD and SAX-BD. SAX-TD has limitations in handling complex sequences with varying patterns, while SAX-BD requires more symbols for representation. Our novel representation method adds an additional character to the original SAX, allowing for an intuitive representation of the change trend in subsegments.

Conclusions
The SAX-DM algorithm is proposed in this paper, taking into account the strengths and limitations of SAX-TD and SAX-BD. SAX-TD has limitations in handling complex sequences with varying patterns, while SAX-BD requires more symbols for representation. Our novel representation method adds an additional character to the original SAX, allowing for an intuitive representation of the change trend in subsegments. SAX-DM not only uses fewer characters to represent time series but also employs a new boundary distance measure to quantify time series similarity. Furthermore, we demonstrate that our distance measure maintains a lower bound on the Euclidean distance. Using this distance measure significantly reduces classification error rate, improves efficiency, and enhances accuracy in similarity calculations.

Data Availability Statement:
The data used in this experiment can be accessed through the URL https://www.cs.ucr.edu/~eamonn/time_series_data_2018/, accessed on 10 June 2023.