Encoding Time Series as Multi-Scale Signed Recurrence Plots for Classification Using Fully Convolutional Networks

Recent advances in time series classification (TSC) have exploited deep neural networks (DNN) to improve the performance. One promising approach encodes time series as recurrence plot (RP) images for the sake of leveraging the state-of-the-art DNN to achieve accuracy. Such an approach has been shown to achieve impressive results, raising the interest of the community in it. However, it remains unsolved how to handle not only the variability in the distinctive region scale and the length of sequences but also the tendency confusion problem. In this paper, we tackle the problem using Multi-scale Signed Recurrence Plots (MS-RP), an improvement of RP, and propose a novel method based on MS-RP images and Fully Convolutional Networks (FCN) for TSC. This method first introduces phase space dimension and time delay embedding of RP to produce multi-scale RP images; then, with the use of asymmetrical structure, constructed RP images can represent very long sequences (>700 points). Next, MS-RP images are obtained by multiplying designed sign masks in order to remove the tendency confusion. Finally, FCN is trained with MS-RP images to perform classification. Experimental results on 45 benchmark datasets demonstrate that our method improves the state-of-the-art in terms of classification accuracy and visualization evaluation.


Introduction
In the era of big data, the real world produces a huge number of time series worth being analyzed. Among all the time series analyzing tasks, classification is likely to be the most fundamental one, which predicts the associated category labels of sequences to be investigated. Due to the development and maturity of sensor technology, time series classification (TSC) problems arise across a wide range of domains, e.g., action recognition, medical diagnosis, natural language processing, mechinery fault diagnosis, electrical energy monitoring [1][2][3], etc., and have received more and more attention.
In the literature, TSC approaches fall into three popular categories: feature based, distance based and ensemble based. Feature-based approaches extract representative features from time series, and then use a classifier to map each of them to a category [4][5][6][7][8][9][10]. Distance-based approaches measure the elastic distance between the testing and training set, and assign a label to the testing based on the distance similarity [11][12][13][14][15]. Ensemble based methods integrate different features and multiple classifiers in one framework, thus obtaining a complementary effect and better classification accuracy [16,17]. Recent years have witnessed the great success of deep neural networks (DNN) in various domains. In particular, DNN-based methods have also been explored in TSC and show inspiring advancement. Reference [18] proposes a multi-scale Convolutional Neural Network (CNN), which first preprocesses the raw data through down-sampling, smooth filtering, and slicing up to perform data augmentation; then, a traditional CNN is applied. Reference [19] proposes Multilayer Perceptrons (MLP), Fully Convolutional Networks (FCN), and Residual Networks (ResNet) as baseline architectures for TSC, which are the most traditional forms of DNNs. It is worth mentioning that FCN and ResNet are regarded as the best DNN-based classifiers for TSC [3]. Reference [20] proposes a multilevel Wavelet Decomposition Network, which first decomposes a time series into different frequencies of subsequences through a fine-tuned Daubechies 4 Wavelet; these subsequences are then handled with FCN and ResNet for classification. This work achieves very strong performance among existing DNN-based methods. Reference [21] augments FCN with Attention LSTM (ALSTM) module. ALSTM exhibits temporal information and obviously supplements the performance of FCN. This work achieves state-of-the-art accuracy among existing DNN-based methods.
To make better use of the outstanding classification power of CNN, some recent works first encode time-series as images, and then transform TSC problem to image classification. With this way, the distinctive regions of a sequence are magnified and the temporal correlations are constructed, thus an improvement in accuracy can be achieved. References [22][23][24] assemble Gramian Angular Fields (GAF) and Markov Transition Fields (MTF) of sequences in multi-channel images, then train a Tiled CNN on these images for classification. Reference [25] transforms the acceleration data and angular velocity data to multi-channel GAF images respectively, and then utilizes a multi-branch residual network to fuse these images for human activity recognition. Similarly, reference [26] encodes time series as images through Recurrence Plots (RP), and then trains a CNN on RP images to perform classification. This work achieves best performance among representative sequence-to-image methods, which raises the interest of the community in it.
In this paper, inspired by the rich texture information provided by RP and the outstanding results of DNN in image classification [27][28][29], we incorporate RP images and the state-of-the-art DNN classifiers of TSC in one framework. RP is a widely used visualization technique for analyzing dynamical systems [30,31]. Due to the graphical nature of exposing hidden patterns and local correlation information of a sequence, RP has been introduced to TSC for representing time series as images [26,32,33]. However, several defects limit its further application in TSC. In this paper, we summarize three major challenges and provide our solutions as follows.
Firstly, sequences of different datasets vary significantly in length and their distinctive regions usually distribute on various scales. Existing methods deal with this problem only by adjusting the image size [26,[32][33][34]. However, to avoid high computational overhead, the adjustable size is limited to a small range, which often decreases the representation ability of RP. We address this challenge by additionally introducing phase space dimension (m) and embedding time delay (τ) of RP, which are fixed in other methods [26,[32][33][34]. Different values of m, τ, and image sizes are explored according to different datasets in order to enrich the scales of RP images. The images with suitable scales will be selected as the input to a classifier.
Secondly, RP is not good at encoding very long sequences, especially when the length of a sequence is larger than 700. For very long sequences, the sizes of their RP images are so large that they are downsampled to being small enough. Consequently, it causes the loss of image information. We address this challenge by constructing asymmetrical RP images. Specifically, a sequence is first divided into two pieces, and each piece is encoded as an RP image. Then, thanks to the symmetrical structure of RP image, the oblique triangle matrix of each RP image is extracted and reassembled into one image. By this way of constructing an asymmetric matrix, the information loss caused by downsampling obviously alleviates.
Thirdly, RP confuses the tendencies of time series. The reason is that the norm operation, being used to calculating the distances of states in the phase space, leaves all of the pixel values in RP images positive. Thus, RP could not distinguish the rising and falling trend of sequences. We address this challenge by introducing the rule of signs. Specifically, designed signed masks are calculated, which utilize the positive and negative values to indicate the rising and falling trend of the sequences. Then, these masks are multiplied to RP images for supplementing trend changing information.
We incorporate the aforementioned solutions together to propose Multi-scale Signed Recurrence Plots (MS-RP), then FCN and ResNet, which are the state-of-the-art classifiers of TSC, are applied to classify MS-RP images. Compared with the state-of-art time series classification algorithms, the advantages of this algorithm are mainly reflected in two aspects. First, the proposed MS-RP preserves the advantages of RP in providing temporal correlations and magnifying the distinctive regions. Moreover, compared with other image encoding algorithms, MS-RP better accommodates the variations of sequences in tendency, length, and scale. Second, the state-of-the-art deep learning classifiers, FCN, and ResNet, are used to handle the transformed MS-RP images, which further improves classification performance.
Our proposed method achieves superior performance in 45 UCR (University of California, Riverside) time series classification datasets [35] and the validation experiments are provided hierarchically. Moreover, utilizing t-Distributed Stochastic Neighbor Embedding (t-SNE) [36], we visualize the spatial distribution of the latent representation learned by the networks. It clearly demonstrates that MS-RP better utilizes the advantage of DNN in extracting features.

Approaches
The proposed approach in this paper consists of two stages. In the first stage, we improve RP comprehensively into MS-RP, and encode time series as MS-RP images. In the second stage, FCN and ResNet are applied to handle these images. The framework of our approach is shown in Figure 1.

MS-RP Images
Multi-scale Signed RP Transform

MS-RP Images
Choose Best Scale Figure 1. The framework of our approach. The x-axis and y-axis of time series represent the length and the amplitude, respectively.
As shown in the figure, the input time series are first transformed into images through MS-RP. Then, the MS-RP images are produced in various scales, and the images with best scale are selected. Finally, the selected images are taken as the input of FCN and ResNet classifiers instead of the original sequences for classification.

Proposed MS-RP
In this section, MS-RP is introduced in four parts. The basic theory of RP is described in Section 2.1.1, the process of multi-scale RP is described in Section 2.1.2, Section 2.1.3 illustrates how to encode very long sequences, and Section 2.1.4 introduces the rule of signs for RP. The general overview of MS-RP is shown in Figure 2.
As shown in the figure, we separate the input sequences into two cases. For short sequences (less than 700 in length), a sequence is first encoded as RP images in two different scales, and the sign masks are extracted and multiplied to these RP images. Then, these multi-scale signed RP images are resized into multiple sizes. Finally, through classification performances on the validation sets, the image with the best scale is selected and taken as the input instead of the original sequences for classification. For long sequences (more than 700 in length), a sequence is first divided into two pieces of equal length, and each piece is encoded as RP images just like a short sequence. Then, utilizing the symmetrical structure of RP, two RP images corresponding to two divided pieces are reassembled in one asymmetrical image. The rest part of the encoding process stays consistent with short sequences.

Review of RP
RP is a visualization tool widely used to analyze the recurrent behaviors of time series generated in dynamical systems [30,31]. Concretely, a sequence is mapped to m-dimensional phase space; then, RP image of the sequence is achieved by calculating the distance matrix between the states in the phase space. RP reveals the local correlation information of a sequence through distance calculation between subsequences, while autocorrelation information is crucial to TSC [16]; thus, it is widely used to encode sequences as images (see Figure 3). Equation (1) defines RP formally: where N is the number of states, m is the phase space dimension, − → x (i) is i-th state in the phase space as well as the subsequence observed at the i-th position of the sequence, · is a norm operation, is a threshold, Θ is the Heaviside function used to binarize the distance matrixes, whose value is zero for negative argument and one for positive argument, and RP i,j is the pixel at position (i, j) of the RP image. Moreover, another important parameter for the generation of states is embedding time delay τ, which can be regarded as the sampling interval of the time series. Actually, the binarization step is usually omitted in TSC, to avoid texture information loss; thus, Equation (1) can be simplified into Equation (2): Though RP provides rich texture information [32,34] and facilitates the application of convolutional networks [26,37], the mentioned challenges of Section 1 limit its further application in TSC. In the following sections, the improvements of RP will be illustrated in detail.

Multi-Scale RP: An Improvement of RP
The distinctive regions of sequences appear in various scales, and the lengths of sequences vary largely. Existing methods adapt to these variabilities through adjusting the image sizes. However, considering the computing costs, these image sizes are controlled in a relatively small range, thus limiting the representation ability of RP.
The generation process of RP images is similar to the process of dilated convolution operations. The subsequences sliding over the raw data can be regarded as dilated convolution kernels, except that norm calculation is replaced by inner product. The lengths and sampling intervals of the subsequences correspond to the kernel sizes and dilatation rates, respectively, and can be controlled by m and τ. Different values of m and τ vary the receptive fields of sliding subsequences, and temporal correlations can be constructed in various scales.
Thus, phase space dimension m and embedding time delay τ of RP are introduced to address this challenge, which are always ignored and kept fixed in other articles. According to different datasets, the values of m and τ are adjusted together with image sizes to produce multi-scale images. Through selecting the multi-scale images properly, temporal correlations can be built in suitable scales, and image sizes can better adapt to the length variability of sequences as well as the receptive field of the network.
The most commonly used values of (m, τ) are either (2, 1) or (3,4) [26,[32][33][34]. Both of them are adopted in this paper, corresponding to two different scales of RP images. Such a small search scope of (m, τ) is due to our initial motivation validating the significance of adjusting these two parameters, other than searching for the best values. Figure 4 shows a triangle periodic sequence and its RP images with these two groups of m and τ. It can be seen that, even with same image sizes, smaller values of (m, τ) produce a more fine-grained image, while larger values of (m, τ) produce an image with overall information.

Asymmetrical RP for Encoding Very Long Sequences
For very long sequences (>700), the sizes of RP images can be very large. On the one hand, RP images with such large sizes will bring computation explosion, on the other hand; if these images are resized to reasonable sizes, it will lead to serious information loss.
To address this challenge, asymmetrical RP is proposed. Figure 5 shows the process of constructing an asymmetrical RP image. As is shown, a sequence is halved into two pieces, and each piece is encoded as an image. Then, the upper and lower oblique triangle matrixes of the two images are extracted separately and then reassembled in one image, utilizing the symmetrical structure of RP. Through constructing the asymmetrical RP images, we alleviate the information loss brought in the resizing process, and overcome the information redundancy problem of symmetrical RP.

Symmetrical RP Image
Aymmetrical RP Image

Symmetrical RP Image
Aymmetrical RP Image Asymmetrical RP Image Figure 5. The illustration of constructing an asymmetrical RP image (the sequence comes from 'Mallat' dataset).

Rule of Signs
As is indicated in Equation (2), norm operation is utilized for the distance calculation between states in the phase space, these distances correspond to the pixel values in the RP image. The commonly used norm operations are L 1 − norm, L 2 − norm and L ∞ − norm; however, no matter which norm operation is selected, all of the pixel values of RP images are positive, leading to serious tendency confusion problem of RP.
A simple example can illustrate this problem. s 1 and s 2 are two short sequences, whose values are [1,2,3] and [3, 2, 1], corresponding to two opposite monotonous trends, respectively. Equation (2) is utilized for the calculation of RP matrixes, where · is L 2 − norm and (m, τ) is (2, 1). The RP matrixes of s 1 and s 2 are shown in Equations (3) and (4) separately. As is shown, the RP matrixes of these two sequences are totally the same: To address this challenge, the rule of signs is introduced for RP. Firstly, a sequence is mapped to phase space, then the subtraction and norm operations between states in phase space are performed, to obtain the state difference vectors and the RP image pixel values respectively. Secondly, we sum each state difference vector separately; then, the signs of the sum values are extracted to construct a sign mask with the same size of the RP image. Finally, the sign mask is multiplied to the RP image, thus we obtain the signed RP image. The whole process is defined by Equation (5): where sum is the vector summation function, · is L 2 − norm, | · | is the function calculating absolute values. As a visual illustration, Figure 6 (left) shows the RP images of two sequences with opposite tendencies. These sequences come from 'SyntheticControl' dataset. Figure 6 (middle) shows the RP images of the two sequences, and they can hardly be distinguished. Figure 6 (right) shows the signed RP images of the two sequences, and these signed images reflect the trend of sequences and can be easily distinguished.

Classification Using FCN on MS-RP Images
In the last section, RP is modified comprehensively into MS-RP to encode time series as images. A high performance classifier should be applied for these images. Existing methods usually combine RP with k-nearest neighbor (kNN) classifiers [32,34] or traditional CNN classifiers [26,37]. However, the performances of kNN classifiers are heavily dependent on the handcrafted features; in addition, although the traditional CNN classifiers unify feature learning and classification in one framework, the pooling operation leads to serious information loss, and the fully connected layers with huge number of parameters may overfit the MS-RP images.
To address these problems, in this paper, FCN and ResNet are introduced to handle MS-RP images, which are expanded to 2D-version according to the image data format. FCN and ResNet are firstly proposed as baseline classifiers in [19], and they are widely regarded as the state-of-the-art classifiers for TSC [3]. The architectures of these two networks are shown in Figure 7b,c. FCN is a fully convolutional network, which has three convolution layers; each layer follows a Batch Normalization (BN) layer and a Rectified Linear Unit (ReLU) activation function. FCN has no Fully Connected (FC) layers. After the convolution process, the features pass through a Global Average Pooling layer and a Softmax layer for classification. ResNet expands FCN through residual connections. It has three residual blocks, and each block has the same structure with FCN. ResNet explores a network with deeper architecture; it is a compromise for balancing better representations and overfitting.

Experimental Setup
Our proposed method and the state-of-the-art competitors are evaluated on 45 datasets of the UCR archive, which is an assembly of TSC datasets coming from various domains in the real world [35]. The competitive approaches are listed as follows: • FCN and ResNet: These two models are proposed in [19], which have been regarded as the strong baseline and best DNN-based classifiers for TSC.
• RP-CNN: [26] combines RP with a traditional CNN, which is similar to our proposed approach. We take it as the baseline for methods encoding time series as images. The RP image sizes are consistent with our approach for fairness, and (m, τ) of RP are (3,4). The architecture of traditional CNN is shown in Figure 7a.  [21]. ALSTM supplements important temporal information for FCN, which obviously improves classification performance. The proposed ALSTM-FCN achieves state-of-the-art performance among DNN based methods.
The adjustable parameters of MS-RP are m, τ and image sizes, which vary according to different dataset. The values of (m, τ) are selected between (2, 1) and (3,4), and the image sizes range over (16,48,64,80,96,112,128). Suitable parameters can be obtained according to classification performances on the validation set. We first initialize the image sizes (usually 64, larger or smaller according to the sequence length), and search the values of (m, τ). Then, we fix (m, τ) and search the image sizes from the aforementioned range scope. Note that the search scope of image sizes can be narrowed according to the sequence length. For the network parameter configuration, the sizes and channel numbers of convolutional kernels in FCN are given in Table 1. ResNet is expanded from FCN, thus the parameters of each residual block in ResNet stay consistent with FCN. The MS-RP image sizes are provided in Table 2, and the three numbers in the parentheses represent the values of image sizes, m and τ, respectively. FCN and ResNet are trained utilizing "categorical-crossentropy" loss function and 'Adam' optimizer [38], with learning rate 5e − 5.
The classification results of our proposed approach are the average of five repeated experiments. The performances of the competitors are directly obtained from the corresponding articles [17,19,21,26], and we supplement the missing experimental results utilizing author provided code. For the evaluation of our proposed approach and the competitors, Number of Wins, Average Arithmetic ranking, Average Geometric ranking, and Mean Per-Class Error (MPCE) are introduced from [19]. Then, we follow the recommendations of [39] to adopt the Friedman test for rejecting the null hypothesis [40]. Finally, utilizing a Wilcoxon signed-rank test with Holm correction (α = 0:05) [41,42], we measure the significance of difference between different classifiers. A critical difference (CD) diagram [39] is performed to visualize these comparisons intuitively.

Results and Analysis
Comparison of classification results is listed in Table 2, with the best performance of each dataset highlighted in bold. The CD diagram is shown in Figure 8. Moreover, pairwise comparison between MS-RP-FCN and its competitors are provided in Figure 9. In the CD diagram of Figure 8, it is clear to see that MS-RP-FCN and MS-RP-ResNet achieve the best performance among all of the competitors. The evaluation indexes of Table 2 shows that MS-RP-FCN wins three of the four metrics and MS-RP-ResNet wins two of the four metrics. For the MPCE index, MS-RP-FCN and MS-RP-ResNet are tied for the first. For the Arithmetic ranking and Geometric ranking indexes, MS-RP-FCN ranks first and MS-RP-ResNet ranks second. The relative disadvantage of MS-RP-FCN is win number index. This is due to the sizes of datasets in the UCR archive having a large variability. FCN is slightly inferior to large datasets due to the shallow structure.
Some interesting and more detailed observations can be made as follows. First, compared with FCN and ResNet, the advantages of our proposed methods are obvious. It demonstrates that the texture information provided by MS-RP can be more easily distinguished by the networks. Second, the performances of MS-RP-FCN are far better than RP-CNN and RP-FCN, due to better classifiers and the improvement of RP. Finally, although HIVE-COTE, FCN-RCF, and ALSTM-FCN achieve very competitive performances, they have small gaps with MS-RP-FCN and MS-RP-ResNet as shown in Figure 8, which further demonstrates the effectiveness of our proposed methods.
Considering MS-RP is composed of three parts as mentioned in Section 2.1, and the effectiveness of each part should be demonstrated. Thus, the validation experiment of each part is provided in Tables 3-5 respectively, with FCN selected as the classifier. For visual convenience, the best performance of each dataset in these tables is highlighted in bold.
Comparison between Different Values of m and τ. Table 3 provides ten pairs of classification error rates for performance comparison between two different groups of m and τ, with the image sizes staying consistent with Table 2. Distinct gaps between the two groups of error rates can be found in the table. As is aforementioned, different values of (m, τ) enrich the scales of RP images, which are helpful in better representing time series. Table 3. Comparison in terms of error rates between different m and τ.

Dataset Adiac
Face-All Comparison between Symmetric RP and Asymmetric RP. Asymmetric RP images are proposed for encoding very long sequences. To compare the performances of symmetric and asymmetric RP, six UCR datasets with very long sequences are selected, and the error rates are provided in Table 4. It can be seen that the asymmetric structure is helpful, though the gains are small except for the 'CinCECGTorso' dataset. This is likely due to asymmetric RP being more capable of preserving detailed information, while most selected datasets own global shapes. Comparison between Signed RP and Unsigned RP. The rule of signs is introduced to overcome the tendency confusion problem of RP. To evaluate its effectiveness, we select ten datasets and provide the performances of signed and unsigned RP in Table 5. As is shown, signed RP has obtained a huge advantage. Thus, the sign masks are effective supplements to RP, which describe the tendency variations of sequences. Actually, the sign masks are more critical for action recognition datasets, and sequences of these datasets are more sensitive to tendency changing.

Visualization
In order to visually demonstrate that MS-RP better utilizes the advantage of DNN in extracting features, we gain some insights on the spatial distribution of the latent representation learned by the networks. Specifically, we feed the raw data, RP images, and MS-RP images into FCN respectively, and extract the last latent representations (feature vectors of global average pooling layer) learned by the network. Then, t-Distributed Stochastic Neighbor Embedding (t-SNE) [36] is introduced to visualize the classification results of different input data. It is a technique embedding high-dimensional vectors into a two-dimensional map.
We select 'TwoPatterns' and 'Fish' datasets to produce the mentioned three kinds of input data, and FCN is trained on 2000 epochs with each kind of data. Figure 10 shows the visualization results. As is shown, when FCN is trained with the raw data, the feature crowds of different classes are hard to separate, and they are close in distance. When FCN is trained with RP images, the results of the 'Fish' dataset are pretty good, though small category confusion still exists, while the results of 'TwoPattern' dataset are disastrous, due to the tendency confusion problem of RP. When FCN is trained with MS-RP images, the feature crowds of different classes can be totally distinguished with a relatively large distance on both datasets.

Conclusions
In this paper, we improve RP comprehensively into MS-RP, and then transform TSC problems as image classification tasks for DNN. Firstly, phase space dimension m and embedding time delay τ of RP are introduced to enrich the scales of RP images. Secondly, asymmetrical RP is constructed to encode very long sequences. Finally, the rule of signs is introduced to overcome the tendency confusion problem of RP. Moreover, FCN and ResNet are trained to handle MS-RP images, which are state-of-the-art classifiers for TSC.
Experimental results on 45 UCR datasets demonstrate that our proposed method outperforms the state-of-the-art, and each block of MS-RP is also demonstrated hierarchically through validation experiments. Furthermore, utilizing t-SNE, the classification results of different input data are analyzed visually, which further supports the effectiveness of our proposed approach.
Thanks to the the popularity of wearable sensors, our work can be easily extended to practical applications, e.g., motion recognition, ECG health, and sleep state monitoring on mobile phones. We would like to take these interesting jobs as our future work.