Convolutional Neural Network with an Elastic Matching Mechanism for Time Series Classiﬁcation

: Recently, some researchers adopted the convolutional neural network (CNN) for time series classiﬁcation (TSC) and have achieved better performance than most hand-crafted methods in the University of California, Riverside (UCR) archive. The secret to the success of the CNN is weight sharing, which is robust to the global translation of the time series. However, global translation invariance is not the only case considered for TSC. Temporal distortion is another common phenomenon besides global translation in time series. The scale and phase changes due to temporal distortion bring signiﬁcant challenges to TSC, which is out of the scope of conventional CNNs. In this paper, a CNN architecture with an elastic matching mechanism, which is named Elastic Matching CNN (short for EM-CNN), is proposed to address this challenge. Compared with the conventional CNN, EM-CNN allows local time shifting between the time series and convolutional kernels, and a matching matrix is exploited to learn the nonlinear alignment between time series and convolutional kernels of the CNN. Several EM-CNN models are proposed in this paper based on diverse CNN models. The results for 85 UCR datasets demonstrate that the elastic matching mechanism effectively improves CNN performance.


Introduction
Time series classification (TSC) is an important research topic in data mining communities [1]. It has a wide range of applications in human activity recognition [2], speech analysis [3], electrocardiogram (ECG) monitoring [4], and biological research [5].
Deep learning is a subfield of machine learning concerned with deep structures with adjustable parameters. Many deep learning architectures exist for TSC. Compared with other classical architectures such as the multilayer perceptron and recurrent neural network (RNN), the convolutional neural network (CNN) has become one of the most prevalent architectures for TSC in recent years [6]. However, the CNN architecture is sensitive to temporal distortion [7], such as differences in rates and local translation within a pattern [8].
Many studies have been conducted on temporal distortion for TSC. One of the most representative studies is on dynamic time warping (DTW). In conjunction with a onenearest-neighbor (1NN) classifier, DTW achieves great success in TSC. Compared with the lock-step matching in Euclidean distance (ED) [9], elastic matching is exploited in DTW to achieve invariance in temporal distortion. However, DTW is a global distance measure that discards the matching information [8]. In addition, DTW could match two series that have dissimilar local structures [10].
Inspired by the elastic matching in DTW, an elastic matching mechanism combined with CNN called Elastic Matching CNN (EM-CNN) is proposed in this paper. Instead of lock-step alignments between the time series and convolutional kernels as CNN, a matching matrix is used to adaptively learn the alignments between these in the EM-CNN. The EM-CNN is an architecture that learns the matching relationship and convolutional kernel simultaneously. The primary contributions of this paper are concluded as follows: • An elastic matching mechanism is proposed to measure the similarity between the time series and convolutional kernels. This mechanism can be extended to different architectures based on the CNN. • The experiments performed on 85 University of California, Riverside (UCR) time series datasets [11] demonstrate that the proposed mechanism improves the performance of CNN on classification tasks.
The remainder of this paper is organized as follows. This paper briefly reviews the related work in Section 2. In Section 3, an elastic matching mechanism is proposed to learn the matching relationship between the time series and convolutional kernels. Next, the experiments are performed on 85 UCR datasets, and the results are analyzed in Section 4. Additional discussion is presented in Section 5. Finally, a conclusion is provided in Section 6.

Dynamic Time Warping
Dynamic time warping is a point-to-point matching method to measure the similarity between two different time series. In general, DTW allows a time series to be "stretched" or "compressed" to provide a better match with another time series [12]. Finding a better match in DTW is equivalent to finding an optimal path in the warping matrix with certain restrictions and rules. A dynamic programming algorithm is used to obtain the cumulative distance of the optimal path. A smaller cumulative distance results in a higher similarity between two time series.
The point-to-point matching in DTW is dependent on the value differences between two points. A point of a series could map a further point or multiple points of other series, leading to misclassification, especially in such applications as image retrieval [13]. Constraint techniques, such as Sakoe-Chuba [14] Band and Itakura Parallelogram [15] are introduced to DTW to reduce the matching space. Weighted DTW [12] considers the phase differences besides value differences to penalize the further points which are probably outliers. Derivative DTW [16] and shapeDTW [10] encode the local neighborhood information rather than the values at a point to measure the similarity between two points.

Dynamic Time Warping with the Convolutional Neural Network
The artificial neural network (ANN) is famous for its powerful feature extraction capability in the last decades. Recently, ANNs such as the RNN and CNN, have been used to learn supervised [17] or unsupervised representation [18] for time series analysis. The RNN is well-known for time series forecasting [17] with the advantage of sequential learning. Some improvements are proposed to reduce inference time [19] and predict sudden timeseries changes [20]. Although, the RNN is also exploited in the TSC, the CNN achieves better performance in supervised learning on the UCR archive [21]. The CNN, such as the fully convolutional network (FCN) and residual network (ResNet) [22], have achieved strong baselines for TSC. Some attempts have been made to combine DTW and CNN to overcome the brittleness to temporal distortions in the conventional CNN. These attempts are roughly categorized into two categories. The first category, DTW, is a preprocessing method to transform the raw time series. Then, the transformed series are used as inputs to the CNN. In [8], a multimodal fusion CNN (MMF-CNN) is employed to predict a label for the multidimensional time series. The multidimensional time series are composed of the coordinate features and local distance features which are extracted by measuring the DTW similarity between the original time series and prototypes. The second category directly incorporates the DTW into the CNN and training an end-to-end classification framework. In [23], DTW is used to determine a more optimal alignment between convolutional kernels and time series. The DTWNet [7] replaces the inner product kernel with the DTW kernel against the Doppler effect and improves the capability to do feature extraction.

Elastic Matching in Dynamic Time Warping
Elastic matching in DTW is first reviewed to better demonstrate the proposed mechanism. Considering two different time series X = (x 1 , x 2 , ...x i ..., x n ) T and W = (w 1 , w 2 , ...w j ..., w m ) T , a dynamic programming algorithm composed of Equations (1) and (2) is used to decide which points should be matched. The second point in X could match the third and fourth points in W (red rhombuses in Figure 1) using DTW. Compared with lock-step matching (blue circles in Figure 1), used in ED, the matching relationship in DTW is data-dependent and elastic: where c(i, j) is the cumulative distance: X W warping matrix w 1 x 1 w 7 x 7 w 2 w 3 w 4 w 5 w 6 x 2 x 3 x 4 x 5 x 6 Figure 1. Example of elastic matching in dynamic time warping (DTW); the yellow rectangles are the beginning and end of the two paths; blue circles and red rhombuses represent the optimal paths in Euclidean distance and DTW, respectively.

Elastic Matching in the Convolutional Neural Network
The CNN extracts features from the time series by measuring the local similarity between the time series X = (x 1 , x 2 , ...x i ..., x n ) T and convolutional kernel W = (w 1 , w 2 , ...w j ..., w m ) T . In general, the similarity measure adopted in the CNN is the inner product. Considering the definition of the inner product, the matching mechanism of the inner product is similar to the ED. A point of one series only matches the point of another series in the same position. Hence, the inner product is inappropriate to measure similarity for temporal distortion. An elastic matching mechanism is incorporated into the inner product to better model the matching relationship between the time series and convolutional kernels. The elastic matching mechanism allows the kernel points to construct relationships with points in different positions of the time series. The similarity of the ith location is defined by Equation (3). The convolutional layer combined with the elastic matching mechanism is called the matching convolutional (MConv) layer. The structure of the MConv layer is presented in Figure 2. A fully-connected (FC) layer is used to learn the matching relation-ship between the series and kernels. The weights of the FC layer in Figure 2 correspond to the matching matrix M in Equation (3).
where m is the length of a convolutional kernel, and M is an m × m matching matrix. When M is an identity matrix, Equation (3) degenerates to the inner product, and Equation (3) can be considered an extension of DTW. The proof is as follows.
Without loss of generality, this proof is based on the example in Figure 1. The time series and convolutional kernel have the same length in a sliding window for the CNN. Red rhombuses represent the optimal path in Figure 1. Hence, series X = (x 1 , x 2 , ...x i ..., x 7 ) T and kernel W = (w 1 , w 2 , ...w j ..., w 7 ) T are transformed to X and W as shown in Equation (4), respectively: If a dot product is used to measure the similarity between two points, the DTW similarity between X and W is equivalent to the inner product between X and W as presented in Equation (5): Equation (5) can be further expressed as a matrix multiplication as indicated in Equation (6): where M is a binary matrix and satisfies the conditions as shown in Equation (7): Comparing Equations (3) and (7), DTW is a special case of the proposed matching mechanism.  Compared with the EM-FCN and EM-ResNet, EM-Inception ( Figure 5) is based on Inception [24] which extracts features in a multiscale manner. Inception is composed of two bottlenecks, one GAP layer, and one FC layer. Each bottleneck has three basic Inception modules. Multiple paralleled convolutional operators of different kernel sizes in conjunction with a max-pooling operator are performed in each module in Figure 6. Like EM-ResNet, a shortcut connection is used between the consecutive bottlenecks, and the MConv layers only take the place of the convolutional layers in the residual branches.  The MConv layer is the core of the EM-CNN. The matching matrix M in the MConv layer is learned by backpropagation. Similar to the derivation in [25,26], the details for calculating the gradients needed for the backpropagation algorithm are as follows. Figure 2 illustrates that, assuming the response in each location for the MConv layer is y, and the optimized objective function J(W, M) is as follows:

EM-CNN
Equation (8) becomes the objective function of CNN if M is an identity matrix. The gradient descent for the CNN is easy to calculate: Similar to the derivation of Equation (9), we letŴ = W T M using the chain rule, and the gradient descent for the EM-CNN can be calculated as shown in Equation (10): where W 0 is initialized using the Xavier method, and M 0 is initialized with an identity matrix.

Experiments
In this section, experiments are performed on the UCR archive to validate the effectiveness of the elastic matching mechanism.

Hyperparameter Settings
The EM-FCN, EM-ResNet, and EM-Inception were tested on 85 'bake-off' datasets on the UCR archive. Default train/test split was used as [22,24] to train the model and evaluate the performance. The matching matrix M was changed per layer and initialized using an identity matrix. The Adam optimizer was used to train the EM-FCN (2000 epochs), EM-ResNet (1500 epochs) and EM-Inception (1500 epochs) with the initial learning rate of 0.001, β 1 = 0.9, β 2 = 0.999 and =1 × 10 −8 . The best model corresponding to the minimum training loss as [21] is used to evaluate the architecture generalization over the testing sets.

Metrics
The evaluation metrics to compare the performance of different methods are the accuracy ratios on each dataset, number of Win, average arithmetic ranking (AVG-AR), average geometric ranking (AVG-GR), and mean per-class error (MPCE). The definition of the MPCE is presented in Equation (11): where k refers to each dataset and i represents each method, K is the number of datasets, c k and e k are the number of categories and error rates for the k-th dataset, respectively. The critical difference defined by Equation (12) is also tested to compare different methods statistically over multiple datasets [27]. A critical difference diagram was proposed to visualize this comparison where a cluster of methods (a clique) connected by a thick horizontal line are not-significantly different in terms of accuracy [24].
where the critical value q α is the Studentized range statistic divided by √ 2, N c is the number of methods. The value of α is set to 0.05 in the experiments.

Evaluation on the UCR Archive
The first experiment compares the EM-FCN, EM-ResNet, and EM-Inception with FCN, ResNet, and Inception to demonstrate the effectiveness of the elastic matching mechanism for the CNN. Table 1 and Figure 7 indicate that CNN architectures with the elastic matching mechanism exhibit better performance than lock-step matching. Compared with other methods, EM-Inception obtains the best rank in all the metrics. The second experiment is to validate that the elastic matching mechanism is suitable to address the temporal distortion. The compared methods surveyed in this experiment consist of the following: DTW [28], shapeDTW [10] and DTW feature (DTW-F) [29], Edit distance (Subsequence (LCSS) distance [30], Edit Distance with Real Penalty (ERP) [31], Time warp edit (TWE) distance [32] and Move-Split-Merge (MSM) [33]), Ensembles of elastic distance measures (EE) [34], Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) [35], Time warping invariant Echo State Networks (TWIESN) [36], MMF-CNN [8] and EM-Inception. The results in Table 2 and Figure 8 indicate that EM-Inception achieves a comparable performance with HIVE-COTE (the state-of-the-art method on the UCR archive). Moreover, HIVE-COTE is an ensemble method based on 35 different classifiers, including DTW-1NN, MSM-1NN, and others. It has a robust ability to address temporal distortion. Hence, the experimental results also reflect the effectiveness of the proposed mechanism. Compared with other methods, such as MMF-CNN and shapeDTW, the superiority of EM-Inception demonstrates that an end-to-end learning architecture with an elastic matching mechanism is preferred.

Effects of the Different Numbers of Layers
The models EM-FCN(2) and EM-FCN(1) are generated by EM-FCN to analyze the effects of the number of layers. In addition, EM-FCN(2) removes the third basic module of EM-FCN, and EM-FCN(1) removes the second and third modules of EM-FCN, simultaneously. The same technique is used to generate FCN(2) and FCN(1) from FCN. As illustrated in Figure 9, regardless of the number of layers, architectures based on EM-FCN are superior to the corresponding architectures based on the FCN. The performance difference between the EM-FCN (2) and FCN is small. It indicates that a deep CNN architecture could be replaced by a shallow CNN architecture based on the elastic matching mechanism to mitigate overfitting to small datasets.

Effects of the Different Kernel Sizes
The kernel size of EM-FCN(1) is 8, which is relatively small for large-scale patterns. In this experiment, the kernels are enlarged from 8 to 20 and 40 to generate EM-FCN (1,20) and EM -FCN(1,40), respectively, which have large receptive fields. Moreover, FCN (1,20) and FCN(1,40) are generated from FCN(1) in the same way. As depicted in Figure 10, even when the kernel size is 40, EM-FCN(1,40) still improves the performance of FCN(1,40). The results demonstrate that the elastic matching mechanism strengthens the feature extraction capability of CNN architectures for multiple scales.

Effects of the Different Kernel Initialization
In this section, an experiment is performed to compare the EM-CNN and kernelvarying EM-CNN (KEM-CNN). In the KEM-CNN, the matching matrix M is the independent initialization for each kernel. In addition, KEM-FCN, KEM-ResNet, and KEM-Inception are the EM-FCN, EM-ResNet, and EM-Inception models with varying matching matrices M for each kernel, respectively. The comparison between the EM-CNN and KEM-CNN on 85 UCR datasets is as follows.
Intuitively, KEM-CNN should be better than EM-CNN due to the larger modeling capacity. However, Figure 11a-c indicates that no matter what backbone is used, the EM-CNN wins on more datasets than KEM-CNN on the UCR archive. Furthermore, another comparison based on the MPCE (a lower value indicates better performance) is made, and the same result is observed in Figure 11d. These results prove that KEM-CNN is more prone to overfitting on the UCR archive.

Computational Complexity
Compared to the conventional CNN, the extra parameters added in the EM-CNN come from the matching matrix M. The parameters learned in matrix M are proportional to the corresponding kernel size S l . Moreover, the number of matrices M is proportional to the number of kernels N l used in each layer and in layers L in the network. Hence, the overall parameter N p added in EM-CNN is as presented in Equation (13): Compared to the parameter learned in the convolutional layers, as in Equation (14), the parameter added by the matching matrix M is at least min l {S l } times N conv . Thus, the EM-CNN can overfit on the UCR archive. Therefore, the matching matrix M of the EM-CNN is fixed on each layer in the experiment. An experimental comparison between the EM-CNN and KEM-CNN is conducted in the next section to confirm the necessity of this:

Discussion
From the results shown in Table 2 and Figure 7, the performance of the EM-FCN, EM-ResNet, and EM-Inception are better than the FCN, ResNet, and Inception, respectively. Nevertheless, it should be noted that the EM-CNN is not better than the corresponding CNN in all the datasets and the base architecture is important to the performance. It is more helpful to combine the elastic matching mechanism with the CNN in the "motion" datasets such as "InlineSkate", "UwaveGestureLibraryAll" because it is common for different people to perform the same movement for different durations. Moreover, as shown in Figure 9, the improvement from the elastic matching mechanism decreases as the number of layers increases. The reason is that temporal distortion is adjusted layer by layer. In the limiting case, if the temporal distortion disappears at some layer, it is expected that the EM-CNN degenerates to the CNN, and performance improvement also disappears.
Besides, EM-CNN is a static model because the matching matrix is fixed after the training is completed. Hence, it is not an optimal solution in theory. The probable solution is to train another auxiliary network to adjust the matching matrix according to the different inputs. Furthermore, despite the results shown in Figure 11, KEM-CNN still has a larger capacity to model the nonlinear relationship between the time series and convolutional kernels, in theory, it is meaningful to apply the KEM-CNN to the large-scale datasets.

Conclusions
In this paper, an elastic matching mechanism was proposed to learn the matching relationship between the time series and convolutional kernels. Experiments on the EM-FCN, EM-ResNet and EM-Inception show that this elastic matching mechanism is appropriate to assist CNN to model the nonliear alignment between the time series and convolutional kernels. As presented in the discussion, this elastic matching mechanism is also beneficial to CNN with a different number of layers and convolutional kernel sizes. Compared with the conventional CNN, the extra computational complexity from this elastic matching mechanism is small which ensures this elastic matching mechanism is flexible. In future work, we will consider combining the dynamic filter with the elastic matching mechanism to more complex applications such as multivariable time series classification and clustering.