Distance- and Momentum-Based Symbolic Aggregate Approximation for Highly Imbalanced Classification

Time-series representation is the most important task in time-series analysis. One of the most widely employed time-series representation method is symbolic aggregate approximation (SAX), which converts the results from piecewise aggregate approximation to a symbol sequence. SAX is a simple and effective method; however, it only focuses on the mean value of each segment in the time-series. Here, we propose a novel time-series representation method—distance- and momentum-based symbolic aggregate approximation (DM-SAX)—that can secure time-series distributions by calculating the perpendicular distance from the time-axis to each data point and consider the time-series trend by adding a momentum factor reflecting the direction of previous data points. Experimental results for 29 highly imbalanced classification problems on the UCR datasets revealed that DM-SAX affords the optimal area under the curve (AUC) among competing time-series representation methods (SAX, extreme-SAX, overlap-SAX, and distance-based SAX). We statistically verified that performance improvements resulted in significant differences in the rankings. In addition, DM-SAX yielded the optimal AUC for real-world wire cutting and crimping process dataset. Meaningful data points such as outliers could be identified in a time-series outlier detection framework via the proposed method.


Introduction
A time-series is a collection of temporal data and is one of the most frequently generated data in real-world applications. Thus, time-series analysis has been a crucial task in real-world data-mining research since time-series can be easily obtained from various data sources. To appropriately analyze a time-series, the most important task is time-series representation, which involves the extraction of feature values from the time-series. Generally, time-series consists of continuous values with enormous lengths; thus, extracting feature values that can summarize the given time-series is a crucial task.
The most widely employed approach for time-series representation is dimensionality reduction [1][2][3][4][5][6]. One of the initially used dimensionality reduction approaches is sampling [1]. In this approach, a single data point is selected for each time-series segment and is considered as the feature value that represents the corresponding segment in the time-series. Although the sampling method is easy to implement, representing each segment of the time-series involving only a single data point is difficult, particularly when there are numerous data points in each time-series segment. To improve the sampling method, extracting a feature value that can effectively represent a set of data points in each time-series segment has received significant attention. One notable method is piecewise aggregate approximation (PAA) [2], which computes the mean value of each segment in a time-series to represent the corresponding set of data points. PAA has been demonstrated to be effective for time-series representations. Consequently, various extensions have been introduced in time-series representations [3][4][5][6].

Conventional SAX
The conventional SAX represents and preserves time-series information using alphabetical symbols. It is well known for its effective representation of high-dimensional timeseries while maintaining the properties of the given data points in the time-series [11]. Figure 1 presents the SAX procedure. The first phase is to employ dimensionality reduction using PAA [2]. As shown in Figure 1a, the time-series is divided into segments with a certain length, and each segment is summarized with the mean value of the data points that it includes. Therefore, the time-series vector X = [x 1 , . . . , x N ] with a length of N is converted into a PAA vector X PAA = x PAA where i ranges from 1 to S, and x j is the jth element of X. Here, a constant N S is called the time segment size (t_size), which is used as the main PAA hyperparameter.
The second phase involves discretizing the PAA values, as shown in Figure 1b. In this phase, a previously generated PAA vector by mapping each element of X SAX into one of the discretization regions in accordance with its value. Note that the discretization regions follow a standard normal distribution, with the size of each region being equal to satisfy the equiprobability. For instance, Figure 1b demonstrates a case with an alphabet size of 7, indicating that ±1.07, ±0.57, and ±0.18 are the 'breakpoints' of each separation, and that each alphabet (a, b, c, d, e, f, and g), following the standard normal distribution, occupies 14.3% of the area. Table 1 lists the breakpoints. Finally, the element is converted to an alphabetical symbol, becoming the represented value for its corresponding element of X SAX . At this point, the number of discretization regions is called the number of bins (n_bins), which is employed as the main SAX hyperparameter.

Real-World Applications of SAX
SAX is a popular time-series representation method that has been extensively studied in real-world applications. In general, there are two research topics related to SAX in real-world applications, such as pattern-discovery and prediction. With this, the pattern-discovery is an interesting research topic. Park and Jung [13] proposed a pattern-discovery framework that combined SAX with association rule mining (ARM). In the SAX-ARM method, time-series generated from sensors in a die-casting process are converted to symbols. Then, apriori, one of the most employed ARM algorithms, extracts the deviant patterns from those symbols. Ferreira et al. [14] suggested adaptive SAX (ASAX) to analyze heat-wave patterns from daily information. The suggested approach adopts SAX after time-series segments are automatically adapted by considering the difference between the current and average values. Similarly, Wu and Lee [15] introduced an algorithm called closed flexible patterns (CFP) to identify the mining of closed flexible patterns by utilizing SAX. CFP employs SAX to convert time-series into symbols. Subsequently, frequent patterns are extracted through a depth-first search. Ohsaki et al. [16] suggested a rule discovery support system for sequential medical data. The proposed system utilizes SAX to extract patterns of glutamic pyruvic transaminase (GPT) from data obtained from patients with hepatitis. In the medical field, Tseng et al. [17] proposed a SAX modification to identify novel genetic relationships by mining similar subsequences in microarray data. Ordóñez [18] proposed a novel pattern-visualization algorithm that can differentiate between medical conditions such as renal and respiratory failure. The proposed algorithm applies SAX to help interpret time-series data obtained from the pediatric intensive care unit (PICU). Yaik et al. [19] employed SAX to identify frequent patterns generated in CPU traces. By using SAX, the proposed method can predict longer steps ahead than the conventional prediction technique (i.e., network weather services (NWS)).
Another research topic is prediction. Pouget et al. [20] suggested an approach that can detect attacks that occurred on the Internet. This approach uses SAX to transform data collected in a honeypot platform into symbols, which are used to detect attacks by systematically identifying similarities between the time signatures of the attack tools. On the other hand, Zoumboulakis and Roussos [21] proposed a novel method to detect complex events in sensor networks. Here, the real-valued sensor data are converted to symbols via SAX representation, and complex events that are difficult or impossible to describe using conventional SQL-like languages are detected using distance metrics. Meanwhile, McGovern et al. [22] introduced a prediction system that can detect severe weather conditions such as tornados. The introduced system applies SAX to convert large multidimensional time-series into symbolic representations. Symbols that satisfy the predefined probability of detection (POD) and false alarm ratio (FAR) are selected to create rules that can identify tornados. Ciompi et al. [23] adopted a technique for the automatic detection of diseased regions of vessels using intravascular ultrasound (IVUS) sequences. Morphological profiles from IVUS were obtained using the proposed technique. Thereafter, SAX was applied to convert morphological profiles to discrete codewords, which were used in the selection of keyframes that can detect unhealthy regions of the vessel. Shie et al. [24] proposed an online treatment system for panic patients by combining biofeedback therapy and web technologies. Numerical biofeedback data are transformed into symbolized sequence data by employing SAX, and the classify-by-sequence (CBS) algorithm is applied to detect whether the treatment is suitable. Morgan et al. [25] proposed an anomaly detection algorithm for marine engines. Here, the measured iron concentrations from the cylinder of the engine are collected. Then, these measurements are converted to symbols by applying SAX, and support vector machines (SVM) are employed to detect unexpected concentrations in the engine. He et al. [26] proposed an analog circuit fault detection system using SAX. In the proposed system, data are collected from four op-amp bi-quad low-pass filter circuits and then converted to symbols to detect the type of fault.

Variations of SAX
SAX results in an appropriate time-series representation. However, as previously discussed, SAX is based on the PAA representation. Therefore, it only symbolizes the mean value of each segment in the time-series, and this representation might cause informa- Fuad and Marwan [27] proposed extreme-SAX (E-SAX), where the symbols can represent the segment more precisely than those of conventional SAX by considering only the minimum and maximum data points of the segment. Lkhagva et al. [28] used extended-SAX to reflect the trend of time-series containing a few critical data points, such as financial time-series. The proposed approach can offset the negative effect and only consider the mean value of the segment by adding the minimum and maximum values to the mean value. Lin et al. [29] proposed bag-of-patterns (BOP), which constructs a histogram of SAX words using the framework of bag-of-feature (BOF). Thereafter, classification is performed by comparing the histograms to identify the nearest neighbor located in the training set. One of the popular variations of BOP is symbolic aggregate approximation and vector space model (SAX-VSM) [30], which introduces term frequency-inverse document frequency (TF-IDF) to assign weights to SAX words. Each SAX word has a different weight for each class to optimize the similarity computation to a certain extent. The major contribution of SAX-VSM is the proposal of a parameter selection optimization method, DIRECT, to accelerate the SAX parameter search. Fuad and Marwan [31] suggested overlap-SAX (O-SAX) to include the trend information of a given time-series. The last data point in the previous segment and the first data point in the following segment are swapped to consider the trend of the data points. Song et al. [32] proposed a novel approach referred to as transitional-SAX (T-SAX) to incorporate transitional information into conventional SAX. To retain meaningful information, the proposed approach retains the upward and downward transitional information by tracing the data points traveling from the current quantile region to the next location. Sun et al. [33] suggested SAX-based trend distance (SAX-TD) to reflect the trend of the time-series using the first and last data points of a segment. Yin et al. [34] proposed the trend feature symbolic approximation (TFSA) to enhance the classification performance of SAX. In the proposed approach, a two-stage segmentation approach for fast segmentation of long time-series is applied, and the experimental results demonstrate that it achieves better segmentation and classification accuracy than SAX. Malinowski et al. [35] adopted a novel algorithm 1d-SAX that outperformed SAX, while retaining the compression ratio. In the algorithm, linear regression is applied in sub-segments of the time-series. Then, symbols are created via mean and slope values. Fuad and Marwan [36] proposed the genetic algorithm SAX (GASAX) to determine the breakpoints using a genetic algorithm. In the proposed algorithm, a genetic algorithm is employed to determine the nearly optimal configuration of breakpoints that provides the optimal fitness during the SAX process. Additional variations of SAX are described in [37][38][39][40][41].

D-SAX
The conventional SAX approach results in an appropriate time-series representation. However, SAX is based on the PAA representation, minimizing the dimensionality by calculating the mean values of equal-sized segments. This implies that the mean valuebased representation might overlook some important values in industrial time-series, such as outliers. In this section, we propose a two-stage time-series representation method that can summarize the time-series better than the conventional SAX algorithm.
The first stage of representing time-series in the proposed method involves the consideration of the distribution of the time-series by computing the perpendicular distance from the time-axis to each data point in the segment. It should be noted that the perpendicular distance from the time-axis to the data point implies the absolute value of the data point. For instance, the 2nd value in Figure 2 is −2; hence, the perpendicular distance from the time-axis to −2 is 2. By considering the distribution of the time-series, the information with important data points such as outliers can be preserved. Therefore where i ranges from 1 to S, x j is the jth element of X, and x j is the absolute value of the jth element of X. . In this phase, the same discretization and symbolization are processed similar to that of the SAX. In this study, we refer to this method as distance-based SAX (D-SAX). The process of D-SAX is shown in Figure 2.

DM-SAX
Although considering the distribution of data points in a time-series is an effective method, this method does not reflect the trend of the time-series. Data points in the first segment shows an increasing trend while data points in the second segment show a decreasing trend, as shown in Figure 3a. Considering only the distribution of data points by calculating the perpendicular distance from the time-axis to the data points is not sufficient to appropriately represent a given time-series.  The second stage to represent a time-series in the proposed method involves adding a momentum factor to consider the time-series trend. The equation for the momentum factor is, where t = N S i, m ( N S )(i−1)+1 = 0, and x t is the tth element of X. Note that a is the hyperparameter reflecting the direction of previous data points, and η is the hyperparameter controlling the gradient of the current and previous data points.
The trend of data points is effectively reflected by considering the trend of the timeseries via the momentum factor that can reflect the direction of the time-series. After adding the momentum factor to the D-PAA process, the time-series vector X = [x 1 , . . . , x N ] with length N is converted into a distance-and momentum-based PAA (DM-PAA) vector . In this phase, the same discretization and symbolization are processed in the same manner as in the SAX. In this study, we refer to this method as DM-SAX. Figure 3 shows the process of DM-SAX.

Experimental Validation
In this section, we experimentally evaluated whether the proposed DM-SAX is superior to other methods on various datasets provided by the University of California-Riverside (UCR) time-series classification archive [46], a well-known data repository for time-series data mining research, and real-world manufacturing processes.

Experimental Design
The comparative classification performances of five time-series representation methods (SAX, extreme-SAX (E-SAX), overlap-SAX (O-SAX), D-SAX, and DM-SAX) are presented on 29 different highly imbalanced datasets taken from the UCR time-series classification archive. This archive originally contained 128 datasets involving various numbers of data points, input features, and classes. For highly imbalanced classification, which is the scope of our study, we converted the class with the smallest number of data points to a positive class, whereas the other classes were converted to a negative class. Then, we calculated the imbalance ratio for each dataset (i.e., the proportion of the number of data points in the negative class to the number of data points in the positive class). Afterward, datasets with imbalance ratios greater than 10 were selected for this experiment, reducing the number of datasets from 128 to 29. Note that the datasets were originally divided into training and test set. Table 2 lists the datasets used.
The experiment was controlled such that a random forest with 20 iterations was used as a base classifier since it is well known for its stable predictive performances [47][48][49][50]. As previously discussed, t_size and n_bins are the two main hyperparameters of the SAX. In this experiment, we set t_size to 3 and 5 and n_bins to 4, 6, 8, and 10; thus, a total of 8 experiments were conducted. Note that classes containing a positive class are represented as positive classes in the PAA process. For example, classes 0, 1, 0, 0, 0, and 0 are converted to 1 and 0 if t_size is set to 3. For DM-SAX, a and η were fixed at 0.9 and 0.01, respectively. Note that the area under the curve (AUC) was employed as a performance measure because it is regarded as a comprehensive and balanced metric that better reflects the classification performance on highly imbalanced data [51,52].  Table 3 summarizes the results of the performance benchmarks. The AUCs were obtained by averaging the results from the validation repeated eight times, as mentioned above. The highest AUCs obtained for each dataset are highlighted in bold. On an average, DM-SAX achieved the highest AUC, 73.44(%), followed by D-SAX, E-SAX, O-SAX, and SAX. Moreover, DM-SAX demonstrated an optimal performance with a mean rank value of 2.24. Specifically, DM-SAX outperformed the other methods in 10 out of 29 datasets. Furthermore, we recognized that considering both the distribution and trend of the timeseries resulted in a more beneficial effect than solely considering the distribution of the time-series in 16 out of 29 datasets.
Note that DM-SAX was superior to conventional SAX particularly when the dataset was difficult to classify, with DistalPhalanxTW, MiddlePhalanxTW, Phoneme, PigArtPressure, and ProximalPhalanxTW being the typical cases in point. It may be hard to attribute these comparative results to a specific factor. Nevertheless, the results indicate that time-series representation by calculating the perpendicular distance from the time-axis to each data point and computing the trend of data points resulted in data representation that could appropriately deal with 'hard-to-classify' problems.
The Friedman omnibus test [53] was first performed on the rank values of the classification performances for each competing method across the datasets to verify the statistical significance of the difference between the methods. Therefore, the p-value (<0.5 × 10 −4 ) was demonstrated to be less than the alpha risk of 0.05, indicating statistically significant differences in the rankings between the AUCs of time-series representation methods. Subsequently, a post-hoc Wilcoxon rank test was employed to enforce the pairwise com-   Table 4 presents the test results. Although there was no statistically significant difference between DM-SAX and D-SAX, DM-SAX outperformed SAX, E-SAX, and O-SAX, whereas D-SAX was observed to be insignificant in contrast to DM-SAX. This indicates that the computation of the time-series trend redeemed the classification performance of the method that only considered the distribution of the time-series.  Figure 4 shows the ratio of each algorithm included in the top-n rank by AUC. DM-SAX is considered the top-performing algorithm in 31% (73/232) of repeated experiments among 29 datasets, and it was at least the 2nd ranked algorithm in 59% (137/232) of the results. Overall, DM-SAX showed a better classification performance than the other methods. Figure 4 shows the ratio of each algorithm included in the top-n rank by AUC. DM-SAX is considered the top-performing algorithm in 31% (73/232) of repeated experiments among 29 datasets, and it was at least the 2nd ranked algorithm in 59% (137/232) of the results. Overall, DM-SAX showed a better classification performance than the other methods.

Experimental Design
A manufacturing process dataset compiled from cutting and crimping process in the wiring harness manufacturing was used to further prove the applicability of the proposed DM-SAX. A wiring harness is used to transmit electrical signals between control devices in a vehicle. To produce a wiring harness, a cutting machine was used to cut the wire to a certain length. Then, both ends of the wire were connected to the terminals and were pressed using an applicator.
The dataset was collected from 20:38 19 July to 13:02 22 July 2021, with 285,297 data points, and each consecutive 100 data points represented approximately 1 min. Failures were recorded at 656 data points, and the imbalance ratio was 433, indicating a highly imbalanced ratio. In this section, three features (B/S, RCFA, and MPP) are used to predict whether the products prepared by wire cutting and crimping are normal or abnormal. Tables 5 and 6 lists a brief description and detailed statistical information on these features, respectively. There were two major differences although the overall experimental design was almost the same as that of the UCR datasets. One major difference is the training and test split criterion. As previously mentioned, training and test sets were originally divided in UCR datasets. In contrast, we arbitrarily divided the real-world dataset into training and test sets in a ratio of 0.7 and 0.3. The other difference is that we set t_size to 25, 50, 75, 100, and 150 for the real-world dataset, which is larger than those on the UCR dataset experiments. Table 7 summarizes the detailed similarities and differences between the experiments on the real-world and UCR datasets.  Table 8 lists the experimental results, and the best AUCs for each case are marked in bold. The results demonstrate that DM-SAX obtained the optimal AUC (98.88%), followed by D-SAX, E-SAX, O-SAX, and SAX. In addition, DM-SAX demonstrated the optimal performance while outperforming other methods in 10 out of 20 experimental cases, with a mean rank value of 1.15. Note that DM-SAX outperformed D-SAX, particularly when the t_size was larger than 100. This implies that the addition of a momentum factor resulted in a favorable effect when there were sufficient data points to reflect the overall trend of the time-series.

Conclusions
In this study, we developed a novel time-dimensionality representation method, called DM-SAX, and compared it with other well-known time-series representation methods. The proposed method secures the time-series characteristics by computing the perpendicular distance from the time-axis to data points and considers the trend of time-series by employing the momentum factor that can reflect the direction of previous data points.
The experimental results on 29 UCR problems proved that DM-SAX exhibited the optimum AUC among the competing methods. Moreover, we empirically verified that DM-SAX is superior to other methods using real-world wire cutting and crimping process data. Defect detection would be applicable in the real-time industrial process using the proposed method. To be more specific, if the symbols generated in the proposed method are located at both ends of the discretization region, one could easily determine that those symbols represent the defects. Furthermore, the proposed method can also be employed in unsupervised learning, such as for human behavior pattern discovery, traffic pattern discovery, and failure rule discovery.
As an extension of the proposed method, a new type of factor that can further represent the characteristics of a given time-series will be developed in the future. Here, an additional factor related to momentum factor that could better reflect the trend of the data will be considered. In addition, a heuristic method for selecting a and η may be another future research topic. The current configuration (a: 0.9, η: 0.01) may have overlooked the optimal trend of the time-series. Thus, it is necessary investigating various search algorithms.