A Metric Learning-Based Univariate Time Series Classification Method

: High-dimensional time series classification is a serious problem. A similarity measure based on distance is one of the methods for time series classification. This paper proposes a metric learning-based univariate time series classification method (ML-UTSC), which uses a Mahalanobis matrix on metric learning to calculate the local distance between multivariate time series and combines Dynamic Time Warping(DTW) and the nearest neighbor classification to achieve the final classification. In this method, the features of the univariate time series are presented as multivariate time series data with a mean value, variance, and slope. Next, a three-dimensional Mahalanobis matrix is obtained based on metric learning in the data. The time series is divided into segments of equal intervals to enable the Mahalanobis matrix to more accurately describe the features of the time series data. Compared with the most effective measurement method, the related experimental results show that our proposed algorithm has a lower classification error rate in most of the test datasets.


Introduction
Time series data are widely used in the real world, such as in the stock market [1], medical diagnosis [2], sensor detection [3], and marine biology [4]. With the deepening of studies on machine learning and data mining, time series is becoming a popular research field. Due to the high dimensionality and noise of time series data, in general, before analyzing the time series, dimension reduction and denoising of time series are very necessary. There are many common methods to reduce dimensionality and remove noise such as discrete wavelet transform (DWT) [5], discrete Fourier transform (DFT) [6], singular value decomposition (SVD) [7], piecewise aggregate approximation (PAA) [8], piecewise linear representation (PLR) [9], and symbolic aggregate approximation (SAX) [10].
Distance-based time series classification algorithms, such as the k-nearest neighbors (k-NN) [11] and support vector machines (SVM) [12], depend on the similarity measure of time series. The measure methods commonly used for time series include Euclidean distance, Mahalanobis distance [13], and DTW distance [14][15][16]. Euclidean distance is the most common method of calculating the point-to-point distance and is highly efficient and easy to calculate. However, the disadvantage is that it requires a series of equal lengths and intervals. Different from the Euclidean distance, DTW can calculate the distance between the series with different intervals. DTW seeks the shortest path between the series distances and calculates the similarity by stretching or shrinking the time series.
It can also incorporate series distortion or translation. However, the complexity of DTW is high, and the efficiency is low if high-dimensional sequences are calculated.
Mahalanobis distance is used to measure multivariable time series data. The traditional Mahalanobis matrix, based on covariance matrix inversion, is generally used to reflect the internal aggregation relations of data. However, in most classifications, it is not suitable for using a distance metric because it only reflects the internal aggregation, whereas it is more important to establish the relation between sample attributes and classification. In [17][18][19], metric learning is used to solve measurement problems in multivariate time series similarity, and a better result is obtained. Distance metric learning is used to obtain a Mahalanobis matrix that can reflect distances between data effectively by learning from training samples. In a new feature space, distributions of intraclass samples are closer, while interclass samples are spread further. Common distance metric learning methods include probabilistic global distance metric learning (PGDM) [20], large margin nearest neighbor learning (LMNN) [21], and information-theoretic metric learning (ITML) [22].
In recent years, many univariable time series classification methods have been proposed. The SAX and SAX_TD [23] algorithms are based on feature representations. In SAX and SAX_TD, intervals of equal probability are segmented based on PAA, and each of the intervals is represented with symbols to transform the time series into a symbol string. To some extent, SAX can compress the data length and reduce the dimensions. However, due to the adoption of PAA, the peak information is lost, resulting in low accuracy. Based on DTW deformation, LCSS [24] and EDR [25], similar to DTW, these have the problem of high-time complexity. Ye and Keogh [26] and Grabocka et al. [27] presented shapelet-based algorithms which require high-time complexity for generating a large number of shapelet candidates. We can conclude that there are three main problems with the above algorithms:  How to treat higher dimensional time series data.  How to find a suitable distance measure method to improve classification accuracy.  How to compare unequal time series.
To address these problems. A novel method, ML-UTSC, is proposed in this paper to classify univariate time series data. First, PLR was adopted to reduce the dimensions of the time series. Compared to PAA, the series tendency and peak information were maintained. Second, the mean value, variance, and slope of the fitting lines were calculated to form a triple. The univariate time series was transformed into a multivariate time series, and metric learning was used to learn the Mahalanobis matrix. Finally, the combination of the Mahalanobis matrix with DTW is used to calculate the multivariate time series distance.
In this work, we make three main contributions. First, the problem of classifying univariate time series data by metric learning is proposed for the first time. Second, to ensure the consistency of univariate feature representation, and that the time series is divided equally. Third, the experimental results show that the Mahalanobis matrix obtained by metric learning has a better classification effect.
The rest of the article is organized as follows. The related background knowledge is introduced in the second part. The ML-UTSC algorithm is described in the third part. The experimental comparison results and analysis are given in the fourth part. The fifth part concludes the manuscript.

Dimension Reduction
PLR is a method to represent piecewise linear fitting. It can compress a time series of length n into k straight lines (k < n), which may the make data storage and calculation more efficient. Least squares linear fitting is one of the most effective PLR methods. The linear regression is described using the following Equation: is the corresponding slope. The error is only related to ˆi y and i y . The fitting error of the leastsquares fitting method [8] is shown in the following Equation: Defining Q equal to (2), we can calculate the partial derivative of Q corresponding to 0  and 1  , according to the mean value theorem. Then, when it is set to 0, 0 0 then, the Equation in (3) yields a linear system easy to solve.

Metric Learning
In studies on metric learning [17], Mahalanobis distance is not defined as the inversion of covariance but should be obtained by metric learning. If there are two multivariate sequences xi and xj, a positive semidefinite matrix M is given called the Mahalanobis matrix. The Mahalanobis distance can be formalized as follows: DM (xi, xj) is the Mahalanobis distance between xi and xj. The distance metric learning obtains a metric matrix that reflects the distances between the data by learning a given training sample set. The goal of metric learning is to determine the matrix M. To ensure that the distance is nonnegative and to satisfy the triangle inequality, M should be a positive definite (semidefinite) symmetric matrix. That is, there is an orthogonal basis P with the property M = PP T .
PGDM is a typical algorithm that transforms metric learning into a constrained convex optimization problem. Taking the chosen pair constraints as a constraint condition of the training sample, the main idea is to minimize the distance between intraclass samples when the constrained distance between interclass sample pairs is greater than a certain value. The optimized model is as follows: If M is found using the Mahalanobis matrix, then, for any intraclass samples xi and xj, the squared sum of the distances is minimized. Additionally, the constrained condition is that the distance between the interclass samples xi and xj is greater than 1 and M is a positive semidefinite. The PGDM loss function is then The loss function is equivalent to the optimized model in (5) when they solve a convex optimization problem, which can be solved with methods, including Newton and quasi-Newton.

Dynamic Time Warping
For two time series q = {q1, q2,…, qm} and c = {c1, c2,…, cn}, a matrix D is constructed where dij is the Euclidean distance between qi and cj. DTW finds an optimal path w = {w1, w2,…, wK} where wk is the location of the corresponding elements and wk = (i,j)，i∈[1:m]，j∈[1:n]，k∈[1:K], so DTW of q and c is, The optimal path w can be obtained through dynamic programming with the distance matrix D: is the minimum distance between the time series q and c.

Least Squares Fitting
The univariate time series x and y are given as where m and n are the time series dimensions. To reduce the series dimensions, least-squares fitting is performed and the time series is divided into several segments. The bottom-up time series leas-squares fitting is divided into two steps. First, each point is taken as a basic unit and the adjacent points are combined, and then each segment is fit. If the fitting error is lower than the threshold max_error, combining of the time series continues until the error exceeds the threshold, and the combination stops when the fitting error (2) fits: The fitting results of some time-series data are shown in Figure 1. A group of data was selected in the 50 Words dataset, in which the length of the data was 270. Figure 1A is the original data set, and Figure  It can be seen from these figures that as the max_error increases, the number of fitting segments decreases, and the figures run from smooth to rough. In terms of the accuracy of the feature representation, as the max_error increases, the number of segments decreases, and the dimension reduction rate increases, which makes the feature representation accuracy lower. The pseudocode for bottom-up time-series least-squares fitting is given in Algorithm 1.

Feature Representation
To reduce the dimensions and eliminate the influence of noise, least-squares fitting is used to linearly represent the time-series segments. Here, segments are further characterized by the mean value E, variance V, and slope values S. Thus, the triple (E, V, S) was constructed to represent the time series segments. The triple matrix of time series x is where K is the number of segments in the time series fitting. Therefore, the features of univariate time series data can be represented by three variables. However, a problem may occur when different line segments have the same mean and slope values. As shown in Figure 2A, there are three parallel segments, l1, l2, and l3, with lengths of 5, 10, and 15, respectively. Their mean values and slope values are the same, while the variances are different. If the triple is used to calculate the distance between l1 and l2, the mean value and slope values have no meaning. However, it does not reflect their properties because the lengths of l1 and l2 are different. To reflect the feature of lines more accurately, dividing segments into equal intervals (weights) is the most feasible. Figure 2B shows that the time series is three black segments after least-squares fitting in which the lengths are 10, 5, and 8. Stipulating that the interval distance d is 5, the results of equal intervals are red segments. After equal interval segmentation, the mean value, variance, and slope values all change, and the fitting must be calculated again. The first segment is divided into two equal parts, and the second is not divided, while the third is divided into two equal parts. It can be seen from Figure 2B that the time lengths of the red divided and refitted segments are almost the same, and the weights are also almost identical. Using the 50 Words dataset and stipulating that max_error is 0.1, the least-squares fitting results are shown in Figure 1B. It can be seen from the figure that the time intervals of the segments are different, ranging from 4 to 24. With an interval distance d of 5, the results using equal intervals are shown in Figure 3. In addition, the segment time lengths are all approximately 5, with little difference in value. Compared with Figure 1B, the entire series segment is smoother. It cannot be guaranteed that each of the segments is the same after segmentation. For instance, the length of the third segment after segmentation is 4 in Figure 1B. However, the homogeneity of the segments can be guaranteed. In addition, the time series is represented as a matrix of triples after equal interval segmentation.

Metric Learning
DTW is often used to calculate the univariable time series distance. In [28], DTW was extended to a multivariable time series, and the Euclidean distance was used for the local distance. The Euclidean distance considers each variable without considering the relationship between variables and is affected by noise and irregularity. In [19], DTW based on the Mahalanobis distance was used to calculate the multivariable distance for the first time. The Mahalanobis distance assigns different weights for different variables, and the relationships between variables are considered.

Calculate Multivariate Local Distance
As described above, the features of the time series are represented as a matrix of triple (Ek, Vk, Sk). In a triple matrix, each point is a vector. Therefore, the local distance between the two triples is the distance between two vectors. The basic structure is shown in Figure 4, where Tx and Ty are two matrices of triple, the middle part of Figure 4 is the optimal path of DTW, and the local distance is calculated by the Mahalanobis matrix. In this paper, the Mahalanobis matrix based on measurement learning is used as the local distance, and the distance of the multivariable sequence is calculated by combining DTW.  Figure 4. The optimal path of the DTW and the local Mahalanobis distance.
As shown in (4), if there are triple matrices Tx and Ty, the local distance is calculated as: where Tx i and Ty j are the ith and jth columns of matrix Tx and Ty, respectively. Combining (8) and (11) gives where m is the column number of Tx and n is that of Ty. The difference from formula (4) is that the Mahalanob distance is used instead of the Euclidean distance. Thus, DTW(Tx,Ty) is equal to RM(m,n).

Learning A Mahalanobis Matrix
In (12), a good Mahalanobis matrix M was able to accurately reflect the multivariate measurement in certain spaces [17]. To obtain a better Mahalanobis matrix, PGDM was selected in this paper. However, PGDM is able to learn with unordered data but fails to process time-series data. To learn a "good" Mahalanobis matrix, PGDM and DTW were combined as a learning algorithm for the time-series data.
First, the DTW is a dynamic programming process that causes the loss function to be nondifferentiable. Therefore, metric learning should transform the DTW into general paths. An optimized path w = {w1,w2,…,wK},where wk = (wx(k), wy(k)), is found with the DTW method and the extracted general path is: Based on this path, the DTW distance is transformed into the general path distance: Then, the PGDM optimized loss function is updated by (6) and (14): Combining (14) and (15) gives: Finally, the transformed loss function is differentiable and can be optimized with Newton's method or the conjugate gradient method. In [18], a greedy strategy that considered the minimization process as an iterative two-step optimization process was proposed. For this algorithm, first, after fixing the Mahalanobis matrix M, the optimized path between two multivariates is sought. Then, the gradient method is used to minimize the loss function. Theoretically, this method can ensure convergence, but not global convergence because the loss function is nonconvex. In practice, even though it may reach a local optimization, the classification performance is usually good.
The time cost of ML-UTSC includes two parts. The first part is data preprocessing and triple matrix generation. The second is the optimization with PGDM; usually the classification performs well. In the data preprocessing step, the bottom-up least-squares fitting strategy was adopted with the time complexity O (ln), where n is the average length of the time series, and l is the number of segments. Additionally, in the PGDM optimization, the approximate complexity is O (n 2 ), where n is the average length of the time series. Therefore, the time complexity of the ML-UTSC algorithm is O (n 2 ).

Experimental Confirmation
To verify the validity of ML-UTSC, time-series datasets were selected from the UCR Time Series Classification Archive to compare the error rate, dimensionality reduction, and time efficiency of the algorithm under different parameters. It can be found at http://www.cs.ucr.edu/~eamonn/time_series_data/. All the tests in this paper were performed in the MATLAB 2016a environment and on the same computer with an Intel Core i5-4590, 3.3 Ghz, 8 GB memory, and WINDOWS 10.

Data Set
A total of 20 representative time-series datasets were selected from the UCR Time Series repository, as shown in Table 1, which includes the dataset name, number of categories, number of training sets, number of test sets, length, and type of time series. The number of dataset categories ranged from 2 to 50, the number of training sets ranged from 24 to 1000, the number of test sets ranged from 30 to 900, and the time series length ranged from 60 to 6174. In addition, the dataset type included synthetic, real (recorded from some processes), and shape (extracted by processing some shapes) [23].

Comparison Methods and Parameter Setting
In order to verify the effectiveness of ML-UTSC, three different methods were selected for comparison, namely Euclidean Distance(EUC), SAX_TD, and DTW. Due to the compression of data in this paper, SAX_TD, a similar method, was selected for comparison. SAX_TD accounts for the trend information and achieves higher classification accuracy. DTW is a classic elastic measurement method that can measure unequal-length time series with high scalability and accuracy. The experiments [16] show that DTW is still one of the methods with the highest accuracy of time series classification. In addition, to verify the effect of equidistant segmentation on the classification error rate of the ML-UTSC algorithm, the ML-UTSC-B was marked as ML-UTSC without equidistant segmentation. In the ML-UTSC algorithm, the least-squares fitting threshold is the max_error rate, and the equidistant segmentation threshold is d. Additionally, in ML-UTSC_B, only the max_error rate is needed.
To obtain better accuracy for SAX_TD and ML-UTSC, we set different parameters for testing, and the highest accuracy and corresponding parameters were recorded. For a given time series with length n, SAX_TD takes the argument w from 2 to n/2, multiplying by 2 at a time, and the argument α value is set from 3 to 10 [23]. ML-UTSC takes the values of the max_error rate to be 0.1, 0.5, 1, 1.5, and 2, while the values of d were 5, 10, 15, 20, and 25. The dimensionality reduction rate was equal to the number of reduced data points divided by the number of source data points. In the experimental analysis, it was found that such parameters were able to meet the dimensionality reduction range criteria.

Classification Results Analysis
The results of the five methods on the 20 datasets are listed in Table 2. In the parentheses of SAX-TD and ML-UTSC are the parameters used to obtain the value reported. The minimum error rate in each row is shown in bold, and in the 20 datasets, there were 12 minimum values in ML-UTSC, five in DTW, and two in SAX-TD. However, multiple values with the same minimum values are not shown in bold; for instance, there are four methods that obtain the minimum value in the 19 th dataset. By comparing the number of minimum values, it was found that ML-UTSC has a lower error rate for most of the datasets, and the value did not differ from the minimum even if the minimum error rate was not obtained. Additionally, the average error rate of the ML-UTSC was only 0.07 higher than the lowest average error rate in the other eight datasets with no minimum error rate. From the error rates of ML-UTSC and ML-UTSC-B, it can be clearly seen that the error rate of ML-UTSC was lower than the error rate before segmentation. However, it was observed from Table 2 that in six datasets with a length less than 150, including Synthetic Control, ECG, CBF, Face (all), Two Patterns, Swedish Leaf, on the first five datasets, ML-UTSC was not competitive, and the Swedish Leaf was not significantly different from the other three algorithms. That is, the Mahalanobis matrix learned by shorter sequences was insufficient to reflect the internal relations of the new feature space, which is the deficiency of ML-UTSC.
To further verify the test results, the ML-UTSC and other methods were compared by a sign test, and it was found that a smaller significance level of the results shows an obvious difference. In Table  3, n+, n_, and n0 are used to represent the number of ML-UTSC' error rates, which is less than, above, or equal to that of other methods.
In addition, the p-value is notable when ML-UTSC is compared with other methods. The p-value in Table 3 indicates that ML-UTSC is particularly significant when compared with EUC. ML-UTSC is significant when compared with SAX-TD, and ML-UTSC is, on average, significant when compared with DTW. The minimum error rate of ML-UTSC in Table 2 was obtained with different parameters. To test the effect of the max_errormax_error parameter and d on the classification error rate, three datasets, including Face (four), Lightning-2, and Fish, were selected. The test results are shown in Figure 5.
To test the effect of d, as shown in Figure 5A, the max_error is set as 0.5 initially, the values of d are 5,10,15,20,25, and the vertical axis shows the classification error rate. It can be seen that as the value of d increased, the error rates of the three datasets also increased slowly, which indicates that a smaller segmentation would make the error rate lower. To test the effect of the max_error rate on the classification error rate, as reported in Figure 5B, the value of d is set as 10 initially, the values of the max_error are 0.1, 0.5, 1, 1.5, and 2, and the vertical axis is the classification error rate. The trend showed that as the value of the max_error rate increases, the error rates of the three datasets slowly increase. However, when the value of the max_error rate is more than 1, the overall error rates of the three datasets began to decrease. According to the analysis in Figure 5, the initial value of the max_error rate was smaller, and the segment length was very small, so the value of d had no effect. Therefore, the classification error rate was low. As the value of the max_error rate increased, the error rate also increased. In addition, when the segment length reached a certain level, the error rate could be reduced by reducing the segment length with equidistance segmentation.
The scatter diagram is an effective visualization method to compare the error rate. In Figure 6, four scatter matrices are plotted, and the values of the axes are the error rates of the two methods. The diagonal divides the matrix into two regions. The region with more points indicates that the method achieved lower error rates in most of the datasets. In addition, the farther the distance to the diagonal is, the larger the difference.   Figure 6A compares EUC and ML-UTSC. It can be seen from the figure that most of the blue rectangular points are on the ML-UTSC region, and there are only two red points on the EUC region, In addition, most of the blue rectangular points are far away from the diagonal, which indicates that ML_UTSC is much better than EUC in most datasets. In Figure 6B,C, there are not too many red points, and there are many blue rectangular points around the diagonal in Figure 6C, which indicates that ML_UTSC is better than SAX_TD and DTW, and DTW is closest to ML_UTSC. In Figure 6D, there is no point in the ML-UTSC-B region, which indicates that the ML-UTSC after segmentation has lower error rates on all datasets.

Dimension Reduction and Time Efficiency
In the five test methods, the dimensionality of the data in SAX-TD, ML-UTSC-B, and ML-UTSC was reduced. In SAX-TD, the size of the data was reduced, and if the number of segments was w, the dimensionality reduction rate was (2w + 1)/n. Data in the ML-UTSC-B were reduced by the leastsquares fitting with the reduction rate related to the threshold value max_error. In addition, the smaller the max_error value, the lower the reduction rate. For instance, when the value of max_error was 0.1, the reduction rate was generally 1/5 of the dataset. When the value of the max_error rate was 1, the reduction rate was generally 1/15 of the dataset. When the ML-UTSC was performed with equidistant segments based on ML-UTSC-B, the dimensionality reduction rate is determined by the max_error rate and d together. Generally, if the value of the max_error rate was set smaller, d would have less influence on the reduction rate. If the value of max_error was larger, the fitting segment would be larger, and d would have a greater influence on the reduction rate. In Table 2, the max_errors in ML-UTSC_B and ML-UTSC are the same, which makes the comparison clearer. As shown in Figure  7, the dimensionality reductions of SAX-TD, ML-UTSC-B, and ML-UTSC that gave the lowest error rates are compared.
It can be seen from Figure 7 that the straight squares of SAX-TD mostly have higher reduction rates than the other two methods. However, the rates are higher only in certain datasets. For instance, the reduction rate in Gun-Point, CBF, Lightning-2, and Lightning-7 was only approximately 1/10 of the dataset. However, the reduction rate on 50Words, Trace, and Yoga was significantly lower because the value of w was 128 when the minimum error rate was obtained on the three datasets. Therefore, there was almost no reduction. Compared with SAX-TD, the reduction rates of ML-UTSC_B and ML-UTSC were higher in most of the datasets, and slightly lower in certain other datasets. Additionally, there was not much difference between ML-UTSC_B and ML-UTSC, and both have their own advantages. Finally, the time efficiencies of EUC, SAX-TD, DTW, and ML-UTSC were compared, and the Synthetic Control, ECG, and CBF datasets were selected. The time efficiencies were compared under a minimum classification error rate. The total time included data preprocessing time and classification time, excluding the time in the metric learning Mahalanobis distance. The time taken by the four algorithms is shown in Figure 8.

Conclusions
In this paper, we proposed a method for combining statistics and metric learning to measure time series similarity. First, the univariate time series data feature was represented by three variables of the mean value, variance, and slope. Next, these variables were used in the metric learning of a three-dimensional Mahalanobis matrix. To obtain a more accurate measurement, the time series was divided into some equal interval segments to obtain the three variable data points with the same weights. Then, the segmented data were used in the Mahalanobis matrix metric learning to ensure a more precise classification. Finally, the classification accuracy, the dimension reduction rate, and the time efficiency were compared with previously reported well-performing methods, including SAX_TD and DTW. In most of the datasets, the classification error rate of our proposed method was lower than SAX_TD and DTW, while the reduction rate and time efficiency were higher.
The PGDM algorithm in metric learning was adopted, which transforms metric learning into a convex optimization problem with constraints. This method makes the time efficiency lower in the learning Mahalanobis matrix. In the future, a deep study on metric learning, such as LMNN and ITML, will be selected to improve efficiency.
Author Contributions: For this research, H.W. and N.W. designed the concept of the research; S.K. implemented experimental design; H.W. and K.S. conducted data analysis; K.S. wrote the draft paper; N.W. reviewed and edited the whole paper; N.W. acquired the funding. All authors have read and agreed to the published version of the manuscript.