Time-Series Classiﬁcation based on Fusion Features of Sequence and Visualization

: For the task of time-series data classiﬁcation (TSC), some methods directly classify raw time-series (TS) data. However, certain sequence features are not evident in the time domain and the human brain can extract visual features based on visualization to classify data. Therefore, some researchers have converted TS data to image data and used image processing methods for TSC. While human perceptionconsists of a combination of human senses from different aspects, existing methods only use sequence features or visualization features. Therefore, this paper proposes a framework for TSC based on fusion features (TSC-FF) of sequence features extracted from raw TS and visualization features extracted from Area Graphs converted from TS. Deep learning methods have been proven to be useful tools for automatically learning features from data; therefore, we use long short-term memory with an attention mechanism (LSTM-A) to learn sequence features and a convolutional neural network with an attention mechanism (CNN-A) for visualization features, in order to imitate the human brain. In addition, we use the simplest visualization method of Area Graph for visualization features extraction, avoiding loss of information and additional computational cost. This article aims to prove that using deep neural networks to learn features from different aspects and fusing them can replace complex, artiﬁcially constructed features, as well as remove the bias due to manually designed features, in order to avoid the limitations of domain knowledge. Experiments on several open data sets show that the framework achieves promising results, compared with other methods.


Introduction
Time-series (TS) data is a set of values sequentially ordered in time, which is seen frequently in real-life, such as financial data, trajectory data, weather data, and so on. With the development and application of the Internet of Things (IoT), the data collected by various sensors is also TS data. Research on TS data is diverse, such as compression, storage and query, anomaly detection, prediction, and so on [1]. This paper focuses on the time-series classification (TSC) task, the purpose of which is to classify concrete TS data into pre-determined categories which have similar characteristics.
For the task of TSC, our predecessors have done a lot of research and produced many methods. These methods are mainly based on distance functions, such as dynamic time warping (DTW) [2]; features, such as the shapelet transform (ST) [3]; and ensemble methods, such as the hierarchical vote COTE (HIVE-COTE) [4], combining the former two types of methods.
• The framework imitates the mechanism of the human brain, in that the human brain's cognition is a combination of the human body's multiple senses. Thus, we use a DNN to learn features from different aspects to enhance the feature space. Then, based on the fusion of features, we carry out the TSC task and gain promising results. • The framework uses well-trained LSTM-A and CNN-A to extract sequence features and visualization features, as well as combining an attention mechanism to extract the key features that contribute to classification. In particular, an innovational category trend attention (CT-Attention) is gained from data belonging to the same category in an innovative way. • The framework transforms TS data into Area Graphs. Compared with the existing visualization methods (such as RP), this conversion method is simpler and avoids the loss of some information in complex conversions as well as additional computational cost.
The rest of this paper is organized as follows: In Section 2, we present the related work. Section 3 describes the proposed method in detail, including the structure of the network and how to calculate the attention to find out the contributing region in the raw data for the specific labels. Section 4 presents the evaluation results. Finally, we present conclusions, discussions, and suggestions for future research in Section 5.

Related Work on TSC
There are a variety of methods for the task of TSC. These methods can be divided into four main categories: distance-based methods, feature-based methods, ensemble methods, and deep learning methods.
Distance-based methods. This type of method uses a variety of distance functions [16] to measure the similarity between TS records for classification. The most widely used distance function methods are DTW and its variants, such as the one nearest neighbor (1NN) classifier with DTW (1NN-DTW) [2]. In [17], the authors proposed a framework named Proximity Forest (PF), which uses Proximity Trees with 11 distance measures for the TSC task. The main drawback of this kind of method is the huge computational cost involved.
Feature-based methods. The basis of this type of method is a variety of features learned from TS data, through which we can distinguish the differences between data and classify them. The methods in this class include ST [18,19], bag of symbolic Fourier approximation (SFA) symbols (BOSS) [20], time-series forest (TSF) [21], and TS classification based on a bag-of-features representation (TSBF) [22]. Word ExtrAction for time SEries cLassification (WEASEL) [23] uses a novel discriminative feature generation and a feature selection method based on bag-of-patterns (BOP). The Shapelet Transform Classification (STC) uses a novel way to find shapelet and increases accuracy for multi-class problems [24]. Random Interval Spectral Ensemble (RISE) [25] combines tree structure with multiple features for TSC. Some methods input hand-engineered features using some domain knowledge into a DNN discriminative classifier. The disadvantages of these methods lie in the complexity and weak generality of building features, which obviously limits their versatility. Besides hand-engineered features, some methods use a DNN to extract the features of TS for classification: In [26,27], the authors added a deconvolutional operation into a convolutional neural network (CNN)-based model to reconstructing a multivariate time-series. Deep Belief Networks (DBNs) [28] and Recurrent Neural Network auto-encoders [29] have also been used to model the latent features in an unsupervised manner. Other studies [30,31] have used self-predicting modeling to ensure the effectiveness of feature learning. Inspired by computer vision processing methods, some scholars have converted TS data into image data and used image processing methods to extract features for TSC. Typical image transform methods include Gramian Angular Field (GAF) [8,9], RP [10,11], Markov Transition Fields (MTF) [12], and so on. However, these methods only use a single kind of feature.
Ensemble methods. This kind of method combines several effective methods, in order to obtain the most appropriate classification results from the classification results obtained through different mechanisms. It includes three typical methods: Elastic Ensemble (EE) [32], the collection of transformation ensembles (Flat-COTE) [33], and HIVE-COTE [4]. Among them, EE integrates 13 classification methods based on distance measurement; Flat-COTE includes 35 classification methods, where several feature-based classification methods are added in Flat-COTE; while HIVE-COTE is based on Flat-COTE, improving the mechanism of getting the final classification result from each sub-classification result. Time Series Combination of Heterogeneous and Integrated Embedding Forest (TS-CHIEF) [34] rivals HIVE-COTE in accuracy but is faster than HIVE-COTE. Random Convolutional Kernel Transform (ROCKET) [35] claims to achieve optimal performance with less computational cost. At the cost of a slight decrease in accuracy, Canonical Time-series Characteristics (Catch22) [36] generates a diverse and interpretable feature set with a greatly reduced number according to the properties of the TS data. The basis of these methods is still distance-based or feature-based; although the drawbacks of both types of methods are alleviated by ensembling, their flaws still exist. In addition, as a variety of classification models and features are ensembled, the model is very complex, which limits its practical application.
Deep learning methods. A variety of deep learning models have been proposed for TSC. This type of method trains a DNN model to form a mapping from data to categories. These models are discriminative, where they input raw TS data and output a probability distribution over the class variables in a data set. The models include Multi-scale CNN (MVCNN) [37], fully convolutional networks (FCN) [27], and deep residual networks (ResNet) [38], as well as many hybrid models such as attention-based LSTM-CNNs [39], multivariate LSTM-FCNs [40], and LSTM-FCNs [41]. It is worth noting that InceptionTime [42] claims higher accuracy and being faster than HIVE-COTE.

Related Work on Feature Extraction through DNN from Different Aspects
Deep learning has made amazing progress in many fields. However, the impact of changing the structure of the network on classification accuracy has been getting smaller and smaller and, so, researchers have begun to focus on expanding data sets, as the scale of the existing common data sets is inadequate compared to the current level of deep learning development. However, expanding data sets is not a simple task, as manual marking is needed, in general, because artificial participation leads to errors. These errors inevitably become factors that affect the training effect. Another method is to use more effective features. However, artificial features can only achieve good results in specific tasks.
A DNN imitates the mechanisms of the human brain [43] and is able to automatically learn the characteristics of data without heavy pre-processing for further processing. However, a human's understanding of things is often a combination of multiple human senses; therefore, some scholars have used DNN to extract features from different perspectives for some tasks. In [13], the authors trained two VAEs to extract visual and semantic features of images, respectively, for generalized zeroand few-shot learning. In [14], the authors used a pre-trained saliency model to segment the foreground and background, and trained two feature extractors-one for extracting foreground features and the other for extracting background features-to improve image classification. In the field of NLP, the authors used historical Chinese scripts to enrich the pictographic evidence in characters and design CNN structures tailored to Chinese character image processing. The proposed glyph-based models gained outstanding results in multiple Chinese NLP tasks [6]. In the field of TS data mining, TS have been converted to bar images and a CNN was used for TS prediction, the results of which was also promising [7].

Model Design
The proposed framework consists of two parts, LSTM-A and CNN-A (as shown in Figure 1). LSTM-A is pre-trained with the loss function AUX 1 on raw TS data and CNN-A is pre-trained with the loss function AUX 2 on Area Graphs converted from the raw TS data. After LSTM-A and CNN-A are well-trained, raw test TS data are input to the LSTM-A to extract sequence features and corresponding Area Graphs to CNN-A for visualization features. Fusion features are obtained through a fully-connected layer concatenating the sequence features and visualization features. The final classification operation is realized by a softmax layer. An Attention mechanism is used to make the model focus on key sub-sequences and sub-regions which contain more discriminative information for TSC.

Data Notation
to represent a time-series database, where N represents the number of records, x i = x 1 , x 2 · · · x m represents the i th record, x j represents the ordered observation value of the record in the whole time-series m, and y i ∈ {1, 2 · · · C} represents the category of the i th record, where C ∈ Z + is the number of classes. Generally speaking, the goal of TSC is to learn a mapping function from x i to y i , as shown in the following formula: where f (·) is the mapping function we aim to learn. The following two sections describe how to extract sequence features and visualization features through LSTM-A and CNN-A, respectively.

LSTM-A
An RNN is a type of neural network used to process sequence data. There are many variants of RNNs, some representative ones including LSTM, Gated Recurrent Unit (GRU) [44], Bi-directional RNN [45], and so on. Greff et al. [46] compared popular RNN variants, showing that the RNN variants have almost the same performance and that LSTM is superior to other simplified LSTMs (such as GRU). In [47], the authors tested more than 10,000 RNN structures and found that, in certain tasks or situations, some RNN variants work better than LSTM; but only in special cases. Comparing GRU and LSTM, on the one hand, GRU has fewer parameters, so its training is faster and requires less data to generalize. On the other hand, if given enough training data, the great expressive power of LSTM may produce better results than GRU. In addition, in the field of TSC research, many studies have chosen LSTM [39][40][41] to do TSC and achieved promising results. So, we chose LSTM to learn the temporal dependencies of TS data. However, the temporal dependencies of long input sequences cannot be reasonably learned and, so, we added attention mechanisms to learn these long-term dependencies [48]. Finally, we used the LSTM-A model ( Figure 2) to learn sequence features. Given a time-series data set D, the LSTM-A model processes the records as follows: where h t−1 and c t−1 are the outputs of the last LSTM cell; x t is the current input; stands for the splicing operation; z i , z f , and z o are the input, forget, and output gates obtained by different parameters, respectively; and ⊕ represent matrix multiplication and matrix addition, respectively; and h t and c t are the outputs of the current LSTM cell.
The T-Attention block is used to enhance the performance of sequence feature learning for very long input records. Its calculation formula is as follows: where h t is the output of the t th hidden unit of the last LSTM layer, a t is the t th attention weight, w h and w t are weighted matrices, a t and h t are merged, and ⊗ represents the matrix merge operation. Then, we use a Fully-connected layer to transform it to obtain sequence features F s . To prevent overfitting and gradients from disappearing, we use dropout and Batch Normalization continuously after the Fully-connected layer. Finally, a softmax layer is used to obtain the classification results, where P s is the output prediction probability sequence of LSTM-A. In the pre-training phase of LSTM-A, we use cross-entropy as the loss function AUX 1 , Y is the true probability sequence, and p s i and y i are the predicted probability and true probability of the record belonging to category i, respectively.

CNN-A
TS data is a series of sequential data. It is difficult for human beings to classify them only by ordered numbers, as vision is an important sense for human beings. By visualizing TS data, humans can easily classify them. Deep learning is a technology that imitates the human brain and, so, we consider visualizing time-series data and extracting features to imitate human vision for TSC. In addition, through data visualization, human vision can notice the parts with the largest differences and classify them. This is similar to the attention mechanism in deep learning. Therefore, we added an attention mechanism to CNN to extract key visualization features.
We used CNN-A to imitate human vision (the network structure is shown in Figure 3). For univariate TS data D, we convert each TS record x i into a black-and-white Area Graph (examples shown in Figures 4a,b) as input. In order to avoid the influence of unnecessary information such as co-ordinates, we removed the co-ordinates and other information from the Area Graphs, and only retained the graph part showing the data fluctuations. Differing from the existing visualization transformation methods (e.g., Gramian fields [8,9], RP [10,11], and MTF [12]), the Area Graph is similar to human vision, directly reflecting information such as the fluctuation of TS data. The most important point is that this method is a very simple transformation method without additional calculation, avoiding the loss of some information in the calculation process. Such image data can extract visualization features through the CNN-A model.
In human vision, information such as peak value and mean value is more likely to attract attention. TS data belonging to the same category have similar peaks and averages after visualization, while data of different categories has large differences in such information. So, we used the CT-Attention block to extract these features and improve TSC. This block extracts the mean and maximum information of the category through a pixel average pooling layer based on Area Graphs of TS records belonging to same category, where the output is the CT Attention. The extracted CT attention is combined separately with the features (average and max features) extracted from each time-series record and, then, through a global average pooling layer, a Dropout layer, and a Batch Normalization layer, the final visualization features are obtained. Finally, classification is carried out through the softmax layer.
The process of learning visualization features by CNN-A can be expressed by Equation (4), where trans means converting the TS record x to an Area Graph I, C denotes the result of a convolution (where * indicates the standard dot product) applied on every I, W and B are filter parameters, and f represents the combination of Batch Normalization and a final activation operation such as Rectified Linear Unit (Relu). P stands for max and average pooling, and H represents the pooling results. CT is the category-related attention obtained by Avg pooling on every pixel (x, y) of Area Graphs belonging to the same category, and y I is the category that image I belongs to (as shown in Figure 5). H and CT are merged by and, after GAP processing, the final visualization features F v are obtained. In the pre-training phase of CNN-A, we use cross-entropy as the loss function AUX 2 . P v is the output prediction probability sequence of CNN-A and Y is the true probability sequence, and p v i and y i are the predicted probability and true probability of the record belonging to category i, respectively.

Experiment
In this section, we analyze the framework from different aspects. The TS data used in our experiments was the UCR collection [15], which is the largest public TS data classification archive. The data sets in UCR involve multiple different domains and consist of clusters of different sizes, shapes, and densities. We used 112 of the data sets and excluded 15 data sets with unequal length and one (Fungi) which had a single record per category in the training data set. Through experiments, we first compared the performance of the integrated model with the selected baseline model and evaluated the factors that affected the performance of the model. Secondly, we evaluated two sub-models, including the convergence status and the effectiveness of the learned features. We also evaluated whether the attention mechanism helped to learn features better. Our source code has been uploaded to https://github.com/wangbaoquan520/TSC-FF.

Experiment Setting and Model Configuration
All experiments were conducted using Python (Keras 2.2.4 [49] and Tensorflow 1.12.0 [50] as the backend). The Python packages used for data loading, visualization, and pre-processing included Numpy 1. 16 Table 1 shows the hyper-parameters of LSTM-A and CNN-A. Relu was selected as the activation function of the activation layer. As the basis of the proposed model is the features extracted by pre-trained LSTM-A and CNN-A from TS data, the training of the two sub-models had a great impact on feature extraction. Therefore, we adopted multiple methods to improve the training of the two sub-models in the experiments. First, we trained the two sub-models using the loss function of "categorical cross-entropy" with the Adam optimizer [51]. The initial learning rate was 1e − 3, reducing by a factor of 0.5 to the final learning rate of 1e − 4 every 50/100 epochs (depending on epoch sizes; for epoch sizes smaller than 100, this was set to 50, otherwise 100) of no improvement in the validation score. Second, we used the early stop method to prevent the model from overfitting. When the change range of validation accuracy was less than 0.0003% every 50 epochs, the training was stopped. Finally, we replaced the Fully-connected layer with a global average pooling layer before the softmax layer in the CNN-A model, which greatly reduced the amount of parameters. In addition, we did not perform additional pre-processing on the data (e.g., regularization), as the UCR data sets have already been z-normalized. The batch and epoch sizes were chosen from {5, 20,100,200} and {50, 250, 1000, 2000}, respectively. The pixel size of the Area Graphs converted from TS data was chosen from {30 * 30, 150 * 150}.
In the field of TSC research, much work based on LSTM, CNN [27,52], and hybrid LSTM-CNN [39][40][41] has been carried out, achieving promising results. The architectures of these models are used for reference.
In the experiments, we also tried models with different LSTM layers or CNN layers. We found that reducing the number of layers damaged the performance of model, while increasing the number of layers not only damaged the performance of model as well, but also increased the model training time.
For the detailed structure and parameters of each method, please refer to their respective references. All results were obtained by taking the average value from multiple experiments. All models were evaluated using classification accuracy and mean-per-class-error (MPCE), which is defined as the average error of each class for all data sets and mathematically represented in Equation (5). The Average Arithmetic Rank (AVG Rank) is the mean of the classification accuracy ranking over the 112 data sets. Table 2 lists the comparison results. In order to facilitate the display, the names of some methods have been abbreviated again, including 1NN_DTW (labelled DTW), WEASEL (WS), HIVE-COTE (HCT), TS-CHIEF (CHI), ROCKET (RK), Catch22 (C2), ResNet (RN), InceptionTime (IcT).
Compared with the baseline methods on 112 data sets in UCR, the proposed mthod TSC-FF won or tied on 20 data sets. Judging from the AVG Rank and MPSE, its overall ranking was fifth, after HIVE-COTE, TS-CHIEF, ROCKET and InceptionTime. By comparison, we found that TSC-FF was far better than six feature-based methods. Therefore, even though the deep learning-based method TSC-FF proposed in this paper did not achieve the best performance, it showed the greatest potential of the deep learning methods considered for feature learning, proving that DNN can automatically complete feature extraction, not requiring complicated feature engineering. With the development of DNN and the accumulation of data, deep learning methods can even achieve better performance.
Although TSC-FF could not defeat the deep learning method InceptionTime, which achieved the lowest MPSE as a whole, by comparing the results of FordA, FordB, and Wafer, we found that our framework achieved higher accuracy on these data sets. This shows that, when a DNN is given enough data to learn features, using pre-trained sub-models to learn features from different aspects and then fusing features can improve DNN performance. Of course, the amount of data and the number of categories also has an effect on DNN performance, which will be explored in later experiments. On the data set ElectricDevices, the featured-based method BOSS had the highest accuracy. The reason for this is the artificially constructed features used by BOSS are better than those learned by the DNN. After all, there still exists a certain gap between DNNs and the human brain. On some data sets, the features learned by DNN are not as effective as those designed by humans. However, the results of BOSS on other data sets proved the weak generality of artificially constructed features.
Among all methods, HIVE-COTE achieved the best performance. Comparing the two distance-based methods, the accuracy of PF is much improved compared to DTW. The difference between the two is that PF uses multiple distance functions. Among the feature-based methods, STC achieved the best performance. What makes STC special is that it uses some innovative methods to find more effective shapelet features. In the deep learning methods, the best performance is achieved by InceptionTime. InceptionTime combines several DNNs with the same structure but different parameters, and it can be considered that multiple DNNs learn different features, thus InceptionTime improved the performance compared to other deep learning methods.
Those achieving excellent performance methods either combine multiple distance functions and features, or find and select more effective features, or use multiple DNNs to learn various features from limited data. The common point of these methods is to use more effective features (or more features to cover the most effective features) to improve performance in the case of limited data. Similarly, the proposed method TSC-FF purposefully selected and used the fusion features of multiple features learned from different aspects, so it achieved promising results.       1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9920 0.9830 1.0000 1.0000 1.0000 0.9997 1.0000 1.0000 1  In the Wilcoxon Sign Rank Test (Table 3), the p-values of TSC-FF with 1NN_DTW, TSF, RISE and Catch22 were smaller than 0.05, while with other methods were larger than 0.05, indicating that TSC-FF had almost equally good results as those achieving excellent performance methods. The reason why TSC-FF did not achieve the best overall result is that TSC-FF is a method based on deep learning. TSC-FF does not suffer from high computational complexity, due to employing multiple different classifiers, compared with the ensemble method HIVE-COTE, but the sub models used by TSC-FF are more affected by overfitting due to the training set being too small, considering its difference in structure compared to InceptionTime, FCN and ResNet. Some data sets in UCR had a small training set and a large number of categories, such that the sub-models were easily over-fitted, resulting in insufficient extracted features and affecting the classification performance.
We used the DNN to learn the characteristics of different classes. Whether there are enough records for each class to learn is one of the problems that must be considered. Table 4 shows the performance of all methods with different amounts of records in each category. The methods based on deep learning achieved the best performance. In particular, InceptionTime, which used multiple DNNs, achieved the highest average accuracy in most cases with different data volumes in each category. This shows that DNNs can learn more effective features than artificial features and replace the work of constructing features by human beings.
In our proposed framework, TS data was converted into Area Graphs to extract visualization features. However, some existing methods convert TS data into image data through other different methods to extract features, so we also carried out a comparison with these methods. We chose two such methods: The first method uses Tiled Convolutional Neural Networks (tiled CNNs) on Gramian Angular Summation/Difference Fields (GASF/GADF) and MTF to extract features for classification, abbreviated as GASF-GADF-MTF. The second method uses a Support Vector Machine (SVM) on Texture Features extracted from RP for classification, abbreviated as TFRP. Table 5 shows the comparison results. Through the comparison, we can see that the proposed framework had better performance than the selected comparison methods. The reason for this may be that the visualization method in this paper is simple, avoiding the loss of some information in the conversion process. In addition, the combination of different aspects of features (i.e., sequence features and visualization features) makes up for the deficiencies of single-category features.    Figure 6 depicts the accuracy and loss curves of two sub-models trained on the data set StarLightCurves. From the figure, we can see that, although the accuracy rate and loss fluctuated, the overall accuracy rate gradually increased and the loss gradually decreased, finally reaching a stable state, and that the results on the training set were slightly better than the results of the verification set, which shows that the model has good generalization ability.  Figure 4 shows an example of transformed Area Graphs and features from the 'BeetleFly' data set. In the data set BeetleFly, the records are divided into two categories. We randomly selected one record from the validation set of each category and show the corresponding Area Graphs (Figure 4a,b), sequence features (Figure 4c,d), and visualization features (Figure 4e,f) extracted by LSTM-A and CNN-A, respectively. Using the area in the red box in Figure 4a,b for comparison, the numbers in this area in the original record were quite different, which may be the main trend used to distinguish record categories. Correspondingly, in the learned sequence features, the corresponding positions were the wave crest and wave trough; that is, they were quite different. In the visualization features, we can see that most of the features of the corresponding positions were similar, as the numbers around the red box in the original record are similar, and only one part of the visual features corresponding to the value of the red box position is different, which proves that CNN-A has learned the features of similar parts, as well as the features of the different parts, of the record. Table 6. Accuracy comprison of TSC-FF with sub-models on 112 data sets.  Table 6. Cont. We also compared the accuracy of the pre-trained model with that of the integrated model. Table 6 shows the accuracy comparison of the two sub-models and the integrated model on 112 data sets. The bold values in TSC-FF denote model wins or ties with sub-models, while those in LSTM-A and CNN-A denote model wins or ties with LSTM and CNN, respectively. Among them, the integrated model TSC-FF achieved a higher accuracy on 72 data sets than the two sub-models. This shows that the combination of the two features generally improved the classification performance. By comparing the accuracy of the sub-models with and without the attention mechanism, we can see that LSTM-A achieved higher accuracy on 95 data sets than without the attention mechanism, and the same for CNN-A was 103. This indicates that, for the two sub-models, when using the attention mechanism, the model can learn more effective features. In addition, the difference in performance between CNN-A and CNN also illustrates the effectiveness of the CT-Attention method proposed in this paper.

TSC-FF LSTM-A LSTM CNN-A CNN
Finally, we explored the influence of the length of TS records on the feature learning of sub-models. From Table 7, we can see that, when the record length was small (<80), the average test accuracy obtained by LSTM-A was higher than that of CNN-A and, as the record length increased, CNN-A performed better than LSTM-A. This shows that the sequence features learned by LSTM-A on short TS records were more effective than the visual features learned by CNN-A. This is because, as the record length increases, the LSTM-A may "forget" some features, while using CNN-A to learn visualization features is not affected by this. In addition, when we visualize long data, some small fluctuations will not be clearly displayed on the picture. Therefore, CNN-A can learn some more distinguishable features, such as the main trends of the data.

Conclusions and Future Work
In this paper, we used a DNN to learn features from different aspects to enhance the feature space. Specifically, we used LSTM-A to learn sequence features from raw TS data and CNN-A to learn visual features from Area Graphs converted from TS data. Then, we classified TS data based on the fused features. Through various forms of comparison and analysis, we found that the well-trained LSTM-A and CNN-A learned features that could effectively distinguish the TS data and promote each other. With the proposed framework, we have proved that, in the task of TSC, using deep learning methods can achieve similar performance as the complex ensemble methods, and the features extracted by deep learning are more effective and general than artificially constructed features.
We proposed the use of CNN-based methods to extract visualization features, which is feasible for univariate TS data; however, on multivariate TS data, how to perform visualization to extract visual features and how to handle the correlation between multi-dimensional data provides difficulties which need to be solved for the framework proposed in this paper to expand on multivariate TS data.

Conflicts of Interest:
The authors declare no conflict of interest.