Random Subspace Ensembles of Fully Convolutional Network for Time Series Classiﬁcation

: Time series classiﬁcation (TSC) task is one of the most signiﬁcant topics in data mining. Among all methods for this issue, the deep-learning-based shows superior performance for its good adaption to raw series data and automatic extraction of features. However, rare eyes are kept on composing ensembles of these superior individual classiﬁers to achieve further breakthroughs. The existing deep learning ensembles NNE did a heavy work of combining 60 individuals but did not maximize the deserving improvement, since it merely pays attention to the diversity of individuals but ignores their accuracy. In this paper, we propose to construct an ensemble of Full Convolutional Neural Networks (FCN) by Random Subspace Method (RSM), named RSM-FCN. FCN is a simple but outstanding individual classiﬁer and RSM is suitable for high dimensional data such as time series, but there are few instances. Thus, the combination of these strengths, RSM-FCN provides a highly cost-effective approach to yield promising results. Experiments on the UCR dataset demonstrate the effectiveness and reasonability of the proposed method.


Introduction
Time series classification (TSC) has attracted great attention, for it exists in a wide variety of fields [1], such as recognition and diagnosis problems in many industries, such as finance and medicine. Meanwhile, it is deemed as a challenging problem of data mining [2], due to its high dimensionality and nonalignment. These two drawbacks make it difficult to find distinctive features from time series.
To tackle this issue, hundreds of novel methods are proposed, including distancebased [3], feature-based, deep-learning-based [4] and ensemble-based. Wherein, methods based on deep learning shows impressive performance, as they have certain tolerance for the unaligned time series and be able to perceive the features benefiting for classification task automatically. An intuitional example is: Z. Wang [5], who evaluates some simple end-to-end deep learning benchmarks, which are Multilayer Perceptions (MLP), Residual Network (ResNet) and Fully Convolutional Network (FCN). The comparisons of experiment results show that FCN and ResNet outperform several complicated nondeep-learning TSC methods. The two winners, FCN and ResNet, are two specific forms of a Convolutional Neural Network (CNN). CNN is indeed the most highly regarded deep learning model in TSC field [4], and FCN is one of its common application forms. Ismail F. et al. [6] are attracted by the robustness and good performance of FCN, so that use it as the classifier to execute transfer learning experiments among datasets in UCR archive. Karim F et al. [7] employ FCN to perceive spatial features of time series curves, and augment it by adding LSTM module to extract time features simultaneously. Their proposed LSTM-FCN actually achieved the state-of-the-art performance at that time, while Elsayed N. et al. [8] use GRU module to realize the extraction of time-related features, and the GRU-FCN shows accurate results, as well. In fact, RNN modules like LSTM and GRU is not suitable for solving TSC problems alone [9], such superior results are mainly based on the contribution of FCN. Moreover, MFCN [10] is the proposed on the basis of FCN to learn multi-scale features of time series, and it also achieve improved performance. The successes of above models have repeatedly demonstrated the excellence of FCN for TSC problem.
Ensemble learning is an approach to achieve further improve the accuracy when basic model reaches its bottleneck. COTE [11] and PROP [12] are the representative in TSC field. One uses a total of 35 weak classifiers which trained by information from four domain to construct a strong one, the other integrates 11 kinds of classifiers which evaluates similarity by different distance measures. Except methods like those above, which combine different existing TSC techniques, classical ensemble learning ideas which provide heuristic rules for constructing diversity individual classifiers with given dataset are also employed. Ji Cv et al. [13] propose an XG-SF based on XGBoost algorithm and Shapelet features. Raza A. et al. [14] construct ensemble classifiers called EnRS-Bagging and EnRS-Boosting in a classical Boosting and Bagging manner, respectively. As we all know, the basic individual classifiers within ensemble model should be fine and diverse, whereas most of the mentioned methods employed the inferior decision tree or distance-based models, so that they cannot compete with state-of-the-art results, though achieve some improvements in accuracy. Furthermore, there is no method focuses on constructing deep neural network ensembles, until NNE [15] is proposed in 2019. NNE compose an ensemble of total 60 different neural networks that are generated by 6 kinds of models and 10 initial weights of each kind. It does pay attention to the diversity of individual classifiers but is not wise to involve the ones with more complex structure but less accurate judgment than FCN, such as Time-CNN [16] and MCD-CNN [17]. Which leads the achieved improvements not deserve the devoted efforts. Though NNE exceeds COTE [11] in accuracy under the contribution of huge amount basic classifiers, it does not maximally play the role of ensemble. In addition, approaches like XG-SF and EnRS are imperfect in another aspect that they did not choose the appropriate data perturbation method for time series. Methods such as Bagging will further lose the training instances and Boosting will be complex in fusion phase for multi-class problems, especially facing the datasets with dozens of categories.
Based on the above analysis, we propose to employ the simple-structure but excellent FCN and the classical data perturbation manner Random Subspace Method (RSM) [18] to build a deep learning ensemble. RSM, which samples in attributes subspaces, is appropriate for time series with high dimensionality except for a few instances. A classic example of RSM is the Machine Learning algorithm called Random Forest [18]. Other applications such as creating different subsets of the same features [19] or dataset [20] for sentiment classification and image classification support our task as well. However, differently than those discrete attributes or sets, the value-continuous and order-dependent time series cannot be randomly selected. Fortunately, the feature-based TSC solution Shapelet offers an inspiration. Shapelet views a high-dimensional time series as a combination of multiple shape primitives [21] and it learns several most discriminative primitives rather than whole series. Many Shapelet-based studies are proposed and achieve successes in TSC field, such as Shapelet Transformation [22], Logical Shapelet [23], as well as the COTE, XG-SF and EnRS, as mentioned before. Thus, it makes sense to only focus on the discriminative local information of time series. Despite its advantages, the brute force approach of Shapelet has a huge time complexity of O(k 2 m 4 ) [21], so we simulated but discarded the Shapelet method. Thus, in our method, the equally-divided time series intervals are regard as primitives.
Therefore, in this paper, a lightweight RSE-FCN (Random Subspace Ensembles of FCN) is designed to tackle the challenging TSC problem, where the equally-divided time series intervals are regarded as candidate subsequences, and Top-K ones with significantly discriminative feature are screened out by evaluation model, then with Random Subspace Method deploying on the Top-K and the superior FCN serving as individual classifier, the neural network ensembles RSE-FCN can be constructed. Through processes of screening and random selection, dimension of input can be reduced without losing discriminative features, so that FCN focusing on the key task-related information will be well trained. The basic classifier FCN has superior performance and the ensemble built by RSM will make a further breakthrough. Thus, promising results can be expected.
The rest of this paper is organized as follows. Section 2 describes the details of proposed RSE-FCN. Then, experiment results are presented and analyzed in Section 3. Finally, conclusion of this paper is drawn.

FCN Ensemble Built by Random Subspace Method
FCN has proven to be an effective vehicle for TSC problems. Ensemble learning is able to make further improvement from the accuracy level of single classifier. Furthermore, RSM prepares diverse individuals for ensemble by offering various combination of inputs. Meanwhile, classification model trained on the low-dimensional key features via screening, usually has good performance. This work proposes a Random Subspace Ensemble of FCN (RSE-FCN), combining the strength of above all, in order to yield promising TSC results. It converts raw continuous time series into discriminative subsequences, then deploys Random Subspace Method on these subsequences with FCN classifier. The overview of RSE-FCN is shown in Figure 1. In this section, we will first introduce the computationally less expensive converting method established by simulating Shapelet discovery, and then give a description of the precise working of proposed RSE-FCN. The procedure of Shapelet discovery can be summarized in four stages: enumerate candidate subsequences, evaluate candidates and form the Shapelet binary tuple, screen and re-rank. The converting operation adopted in our method generally follows the logic of Shapelet discovery. However, we rather to roughly eliminate part of irrelevant and redundant subsequences than to accurately search out K most representative primitives as Shapelet. As shown in Algorithm 1, first, N equally-divided intervals of time series are viewed as N candidate subsequences. Then, with classification accuracy as score, each subsequence is evaluated by a simple feature-based classifier. Here, roughly trained FCN are leveraged as the evaluator. Finally, K subsequences who have the highest score will be retained for Random Subspace Method.
1: Input: Training set D and its label set Y d , Test set T and its label set Y t , Subsequences amount N; Number of optimal subsequences K 2: Output: kBestID,D S ,T S 3: L = X.demision //The time step (attribute dimension) of the instances 4: SL = L/N //Rounding down, the end of the series will be discarded. 5: SubsequencesID = ∅ 6: for i in range(1,K) do 7: start = SL*(i − 1), end = SL*i 8: SubsequencesID.append(i, Score) //Record every part and its score 12: end for 13: SubsequencesID.sortByScore() 14 This ensemble framework requires four structural hyperparameters, they are: number of intervals to be divided N, number of optimal subsequences K, subsequences number M received by classifier and number of basic classifiers E. They directly correspond to the performance of proposed method. Figure 2 gives an example with hyperparameter value. A big N will lead each subsequence too short to be learned. The diversity of subsequence selections is positively related to N and the amount of individual classifiers E is also positively related to that diversity, so we think N and E within 10 is enough. Furthermore, obviously, M should be less than or equal to K. Moreover, M/N > 1/2 should be guaranteed, since above discovery process of key discriminative subsequence is rough, on the basis, retaining too little proportion as inputs may miss some beneficial features which hides in the discard ones. In the ensemble phase, Random Subspace Method is deployed on FCNs (Figure 3), which keeps the same structure with that found in another paper [5]. FCNs are trained by different combinations selected from top-K discriminative subsequences. Finally, the soft classification scores given by each individual classifier will be combined by soft voting. Based on above settings and descriptions, the details of RSE-FCN can be presented in Algorithm 2. First, by GetTopKBestSubsequences method (Algorithm 1), KBestId, the set of ordinal number corresponding to K optimal intervals, can be obtained. The sets of N intervals Ds and Ts , which the train set D and test set T are divided into, should be reserved. Second, M ordinal numbers RsID are randomly selected out from KBestId every time and this operation should be repeated for E times. Next, according to Ds, Ts and current RsID, new training set D' and test set T' can be generated. To preserves the features which stretch over division nodes, the subsequences combining order should be in accordance with the number-ascending order rather than the selection order. Or reversely, we can get rid of the subsequences whose ordinal number not in the selected M ones from the interval sets of whole time series. Then, FCN is trained by current dataset D', to yield the judgment for the test set T'. Finally, the classification results of E individual FCN classifiers are integrated by soft voting, thus the ensemble decision for test set T can be obtained. IndividualResults.append(r i ) 13: end for 14: Y ← IndividualResults // Vote 15: return Y

Experiments
The 44 datasets in the UCR archives [24], adopted by Wangs work [5], are employed by the most existing methods. In this scope, we consider the ones that fit our background question, which are picked from the following aspects: the number of training instances, number of their categories and their time step. Thus, validation experiments are carried out on 26 datasets.
The training iteration of evaluator FCN in Algorithm 1 is set as 300 and that of the classifier FCN is 1600. The four hyper-parameters N, K, M and E in the method are up to specific dataset. Towards the special dataset ItPwDmd with only 24 dimensions, K = N is taken, that means no elimination in time step. Settings of other hyper-parameters for the experiment are given in Table 1.

Hyper-Parameters Setting
Batch size 128 Optimizer Adam Learning rate 10 −3 to 10 −4 Loss fuction Cross-entropy Comparisons will be conducted between RSE-FCN and related benchmarks, including: (1) classical single modal deep learning TSC methods [5] FCN, ResNet and MLP, (2) the multimodal neural network MCNN [25] and MFCN [10], (3) non-deep-learning and deeplearning ensemble model: SE (Shaplet Ensemble), COTE [11] and NNE. In terms of the authors of the UCR archive also noted that "beating the performance of DTW on some data sets should be considered a necessary", we select the classical DTW [26] and an elastic -distance based ensemble PROP [12]. In addition, we conduct another sets of comparison among RSE-FCN, ENSR [14] and XG-BF [13] on several common datasets as well, since they are all the ensemble classifier built by learning different local discriminative information.
Four evaluation metrics are adopted in above comparisons, they are the average accuracy, the times of wins, the win times excluding ties and the average ranking of each algorithm. Further, the average ranking with a critical distance (CD) given by hypothesis testing [27] could measure the performance difference.
The experiments results are shown in Tables 2 and 3. Furthermore, Figure 4 is the CD diagram.    Observation shows that RSE-FCN is in the leading position among all models within comparison. Its performance is slightly better than NNE, the 60 deep learning classifier ensembles, and much better than FCN, COTE and all the other methods. Figure 4 shows most ensemble learning methods and deep learning methods are in the first level and they have no significant difference. However, RES-FCN ranks high in this group of methods, and exceeds the competitors in other specific metrics, and it is on a par with NNE. Through further comparative analysis, it can be concluded that: -According to the three metrics in Table 1 -RSE-FCN achieves higher accuracy and more winning times than COTE and EnRS. XG-SF gives more top result than RSE-FCN but has less average accuracy, which also means it is not as robust as RSE-FCN. The three ones are methods which learn Shapelets information. Consequently, it suggests our method makes use of discriminative local information successfully, as well. Although simple and rough the converting process is, it retains the key subsequences and eliminates the information weakly related to TSC task.
After witnessing the experimental results, we briefly analyze the running time of the model. It should be note that currently in the TSC field, in order to make fair comparisons, FCN will be trained for 2000 times conventionally, since Wangs did so in the initial experiment. For other deep learning models, a fixed number of training periods are kept as well. We assume that NNE also follow this default rule. In our RSE-FCN, the feature picking and basic model training process of RSE-FCN is approximately equivalent to totally construct eight FCNs. In FCN, only a single model needs to be trained, while NNE integrated 60 models. According to the volume of these models, within the same hardware condition, the time consumption of RSE-FCN is about 6-10 times of that of single FCN, and will be 1/6-1/10 of that of NNE.
Above comparison establish the significance of our proposed RSE-FCN. Furthermore, its mechanism is analyzed by the representative dataset Beef with balanced class and middle-level time steps. Instances in Beef have dimensionality of 470 and categories of 5. Taking N = 8 as an example, as given in Figure 5, the first image in the upper left corner is the complete curve of the five categories instances. Images from 2 to 9 show the 8 equally divided intervals and their evaluating accuracy. It can be seen from Figure 5 that in the second, seventh and eighth intervals, all curves mainly differ in amplitude. Furthermore, though there exists some fluctuations, compared with that in the others, they are tiny and irregular, which may be caused by noise and not category-related. Thus, in these intervals, all instances produce similar activation values when perform convolution operations with the filter matrix, making themselves to be indistinguishable, while in the rest intervals, differences appear in aspects of amplitude, change tendency, fluctuations, etc., thus they can be distinguished more easily and achieve higher accuracies. During the converting procedure of RSE-FCN, some intervals with weak discriminative feature will be discarded, so that all the random combinations will involve key information and FCNs trained by them will give fine judgment. Actually, on this dataset, some individual models which receive the discriminative local information generated by RSM did achieve higher accuracies than that based on whole series. This result proves our application of time series is reasonable and provides certain interpretability for the success of RSE-FCN.

Conclusions
Deep-learning-based methods show impressive performance for TSC task. However, few studies pay attention to building neural network ensembles, and the only existing NNE are not cost-effective. This paper proposes a method called RSE-FCN using Random Subspace Method to integrate the simple but outstanding FCN. It combines the strengths of learning key local information, employing superior classifier FCN and bringing promotion by ensemble. Firstly, time series are equally divided into subsequences and the top-K discriminative ones can be obtained via evaluating and screening. In that process, the main features of raw time series retain when its dimension are reduced. Then, individual models are generated with RSM deploying on the reversed subsequences and FCN serving as classifier. Finally, ensembles can be built by voting.
In experiments on UCR datasets, the lightweight RSE-FCN within 10 individuals shows same competitive with the heavy NNE, outperforming the other benchmarks. The well-trained FCNs establish a superior performance basis, the ensemble purely built by them makes further breakthrough, thus promising results can be achieved. Further analysis indicate that our usage of time series can indeed retain discriminative feature and eliminate the less relevant information, which contributes to generating strong individual classifiers. Furthermore, Random Subspace Method remains an effective manner to build good ensemble by deploying on time series subsequences.
Author Contributions: Y.Z. and C.M. came up with the creation and methodology of the models, provided the design of experiment and wrote the initial draft; J.M. participated in the programming phase and the analysis of the experimental data; L.Z. supervised the research activity planning and execution and was also responsible for ensuring that the descriptions are accurate and agreed upon by all authors. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.