Using Permutations for Hierarchical Clustering of Time Series

Two distances based on permutations are considered to measure the similarity of two time series according to their strength of dependency. The distance measures are used together with different linkages to get hierarchical clustering methods of time series by dependency. We apply these distances to both simulated theoretical and real data series. For simulated time series the distances show good clustering results, both in the case of linear and non-linear dependencies. The effect of the embedding dimension and the linkage method are also analyzed. Finally, several real data series are properly clustered using the proposed method.


Introduction and Main Definitions
The goal of time series clustering is to split a set of time series into homogeneous groups, that is, similar time series should lie in the same cluster. However, there are many distances to measure the degree of similarity between two time series, depending on the clustering objectives. The most popular are the Euclidean and correlation-based distances. However, if we want to cluster time series in shape, it has been proved that Dynamic Time Warping distance (see [1]) is more appropriate. Some other distance measures are the short time series given in [2], the Kullback-Leibler studied in [3] or the recent copula-based distance introduced in [4].
Interesting surveys in the field can be found in [5,6], where three different types of clustering approaches are distinguished: shape-based, feature-based and model-based approach. In most cases, the procedures only take into account univariate features of the time series and do not consider the possible relationships among them. Therefore, these methods are useful in the case of independent time series and when the objective is to group them looking at the similarity of their univariate models. It is well known that the selection of a suitable distance measure mainly depends on the objective of clustering. Typically three objectives can be distinguished [5]: finding similar time series in time, finding similar time series in shape and finding similar time series in change (structural similarity). In the first case, the Euclidean distance on raw time series or the Wavelet Transform distance are proper for the first goal. The distances based on Pearson's correlation are usually considered of this type, even though the aim is to put the time series that are correlated in the same cluster. In the second case, where the time of occurrence of patterns in not important, the use of elastic methods like the Dynamic Time Warping distance is highly encouraged. In the third case, where the aim is to cluster time series with similar structure, autocorrelation-based measures are appropriate. All these similarity (dissimilarity) measures have been applied to some simulated data and compared to our proposed measures to illustrate that they are not proper for clustering time series by dependency. Many of these clustering methods are available in the R package TSclust, developed by Montero and Vilar [7].
Clustering time series by dependency has recently been analyzed in [8][9][10]. In these works, the goal is to cluster the time series according to their degree of dependency. The two former assume that the vector of time series is generated by a Dynamic Factor Model where some factors affect different groups of series. The latter proposes the generalized cross correlation as a general measure of linear association between two time series.
There are a wide range of papers dealing with the applications of ordinal patterns in time series analysis. For example, the utility of permutations to detect structural changes in time series has been widely studied and applied to real data such as speech signals [11], electroencephalogram signals [12][13][14][15] or economic and seismic data [16]. On the other hand, the discriminative power of ordinal pattern statistics and symbolic dynamics to classify cardiac biosignals has been evaluated in [17], whereas [18] uses ordinal patterns to detect and locate change points in the time series and classify the segments with similar dynamics. In this context, Ref. [19] proposes a new metric called Ordinal Synchronization to evaluate the level of synchronization between time series by means of a projection into ordinal patterns.
The aim of the present paper is to illustrate the utility of symbolic dynamic (the time series are codified by means of permutations) in clustering time series by dependency, where the main contribution respect to the works mentioned above are the absence of assumptions and the detection of linear and non-linear dependencies. For that, we have based on the work developed by Ruiz-Abellon et al. [20], where the next three aspects are combined: the time series codification by means of permutations, the distance measures among time series and different linkages for hierarchical clustering.
In this paper we will consider the labeling by permutations technique jointly with some measures which allow us to group different time series, or at least to state what time series are closer in the sense of the considered measure. Let us introduce the basic notation.
Several authors have pointed out that permutations can be a good tool to study time series. In [21] (see also [22][23][24]) the notion of permutation entropy has been used for analyzing time series. Let us recall a few notions on permutation labeling of a time series.
Let (x n ) T n=1 , T ∈ N ∪ {∞}, be a sequence which comes from real or simulated data. For m ≥ 2, let S m be the group of permutations of length m, whose cardinality is m!. Let x m (r) = (x r , x r+1 , ..., x r+m−1 ), 1 ≤ r < T − m + 1, be a sliding window taken from the sequence (x n ) T n=1 . We say that the window x m (r) is of π-type, π ∈ S m , if and only if π = (i 1 , i 2 , ..., i m ) is the unique element of S m satisfying the two following conditions: The positive integer m is usually known as embedding dimension. Fix r < T − m + 1. For each π ∈ S m , the probability of occurrence of π is given by: A permutation π ∈ S m with p(π) > 0 for some r < T − m + 1 is called an admissible permutation of (x n ) T n=1 . It is clear that permutations are linked to data series complexity. For instance, a periodic or increasing time series has at most a finite number of admissible permutations which are bounded for any embedding dimension m. Conversely, it is also clear that a big enough i.i.d noise should admit any permutation of length m. On the other hand, for piecewise monotone maps, it is proved that topological entropy, a useful tool to decide whether a deterministic time series is complicated, can be computed by using permutations [25,26].
The above-described codification can be extended in a direct way when we have two dimensional time series. Let (x n ) T n=1 and (y n ) T n=1 be two real time series and let (z n ) T n=1 be the corresponding two-dimensional time series with z n = (x n , y n ), for all n = 1, ..., T. Let z m (r) = (x m (r), y m (r)), 1 ≤ r < T − m + 1, be a two-dimensional sliding window taken from the sequence (z n ) T n=1 . The window z m (r) is said to be of type π i × π j if and only if x m (r) is a π i -type and y m (r) is of π j -type. After the codifying process, all of the empirical information is collected in a contingency table, where O i,j denotes the observed frequency of the symbol π i × π j and as usual, the relative frequency is given by: The contingency table is shown below.
(x n )/(y n ) π 1 -type π 2 -type . . . π m! -type The above contingency table was used in [27] to decide whether two time series were independent or not. Among others, the Pearson's chi-square statistic was used for the above contingency table given by: e i,j denotes the expected frequencies under the hypothesis of independence and is given by: Hence, we can define the Crammer's V measure as follows, see [28]: Values of Cramer's V close to zero means no association (independency) and close to one mean strong association (dependency). Cramer's V measure allows us to define the "distance": It is unclear whether D V is a metric, but as we will show later, it is a good help in clustering data series. Before that, we will introduce another measure based on mutual information measures.
Given two discrete random variables X and Y, the mutual information coefficient (see [29]) is defined by: where p(x i , y j ) is the join probability function of (X, Y) and p 1 (x i ) and p 2 (y j ) are the marginal probability functions of X and Y, respectively. The mutual information coefficient can be computed using the concept of Shannon entropy (see [30]) by: are the Shannon entropies of X and Y, respectively, and is the Shannon entropy of (X, Y). The mutual information coefficient is a dependency measure because I(X, Y) = 0 if and only if X and Y are independent. Moreover, it is symmetric and non-negative, but there is not a fixed upper bound. Hence, we can consider the metric given in [29] which is a metric because it is non-negative, symmetric and holds the triangular property. Applied to the codified time series it reads as: Remark that the efficiency of the Pearson's chi-square statistic (applied to symbolic dynamic) to detect linear and non-linear dependencies between two time series was illustrated in [27]. Therefore, a dissimilarity measure based on this statistic can be useful to develop a new clustering approach. Analogously, the efficiency of the mutual information coefficient for detecting dependencies was shown, among others, in [29], so a dissimilarity measure combining symbolic dynamic and the mutual information coefficient can be proposed as a good tool to cluster time series by dependency.
In the next section, different scenarios are simulated: linear dependency among time series using several models (the logistic map, uniformly distributed noise and autoregressive process); non-linear dependency for deterministic systems and non-linear dependency for non-deterministic systems. Additionally, simulations were carried out for different embedding dimensions and three hierarchical linkages (single, complete and average) to analyze their effect on the resulting dendrograms.

Linear Dependence
We consider data series by following the system of difference equations as follows: where n ≥ 0, k ≥ 3, f j are real functions for 1 ≤ j ≤ k, and for 1 ≤ i, j ≤ k, we have that λ ij ≥ 0 and These maps have been introduced as a model for migration in population dynamics (see e.g., [31] and some references therein). In principle, if λ ij = λ ji = 0, then there is no relationship between sequences x i n and x j n . Of course, we are assuming that some of them are not completely independent. Note that if both sequences are independent and are generated by the same deterministic map f with the same initial condition, then they are the same sequence indeed.
The experiments we have done are as follows. We generate several data series and apply D V and D to cluster them. The smaller values of D V or D we get, the closer the two data series are in the sense of these measures.
Below, we summarize the experiments we have done. We take k = 5 and consider the matrix: According to this matrix, sequences x 1 n and x 2 n are linked, as well as x 3 n and x 4 n , while sequence x 5 n is isolated. We consider several possibilities for f j , 1 ≤ j ≤ 5. Firstly, we take f j (x) = 4x(1 − x) (the logistic map) and random initial conditions. The data length is 10,000 and we consider different embeddings m = 2, 3, 4. We consider distances D V and D and three different linkages for the hierarchical method (single, complete and average). This is the general procedure along the paper.
As we can see, there are not significative differences in applying the three different hierarchical linkages.
In this example, and as it is expected, x 3 n and x 4 n are linked together as well as x 1 n and x 2 n , but the result shows a deep link between variables x 3 n and x 4 n . Figures 1 and 2 show the results we obtain in detail. To improve the visibility we write 1 for x 1 n and so on. We see a closed connection between x 3 n and x 4 n as expected, then between x 1 n and x 2 n . The difference appears with x 5 n which is more close to x 1 n and x 2 n for m = 2, 3 and to x 3 n and x 4 n for m = 4. To improve the visibility we write 1 for x 1 n and so on. We see a closed connection between x 3 n and x 4 n as expected, then between x 1 n and x 2 n . In all cases x 5 n is closer to x 1 n and x 2 n .
In order to compare the results of the proposed approach with some traditional ones, Figure 3 shows the dendrograms obtained for system (1) and the logistic time series using the Wavelet Transform, Pearson's correlation and Dynamic Time Warping distances. For the correlation-based distance the right clustering is achieved as expected, because the dependency among the time series given by system (1) is linear. However, the Wavelet Transform distance is not able to detect the correct relationships (recall that the objective of this distance is to find similar time series in time). Even though the Dynamic Time Warping distance provides the expected clustering results in this case, we will see later an example where it does not. To improve the visibility we write 1 for x 1 n and so on.
Then, we repeat the experiments when f j is an i.i.d. uniformly distributed noise for 1 ≤ j ≤ 5. To avoid repetitive graphics we only use the average linkage in Figure 4 for obtaining the dendrograms. No significative differences are obtained when the other two linkages (single and complete) are used instead of the average one. Figure 4 shows the clustering results obtained for both distances D V and D. We obtain the same results as in the previous purely deterministic case for m = 3 and m = 4. However, for m = 2 the actual relationships are not properly detected. Again, we repeat the experiment when f j (x) = 0.9x + ε, where ε is an i.i.d. uniformly distributed noise, that is, f j is an autoregressive process. The dendrograms shown in Figure 5 give us the same result as in the two previous cases.

Non Linear Dependence: Deterministic Systems
Now, we consider non linear time series constructed by deterministic systems. We consider the system: with initial conditions 0.45 for all the variables. We consider a sample of 10,000 points and show the results with the average linkage method. We obtain that x 1 n and x 2 n are linked as expected, then x 3 n and x 4 n , as one can expect given the system shape. Dendrograms are shown in Figure 6. Note that for the D distance, the expected results are obtained using different embeddings, whereas for the D V distance, only with m = 4. Next, we consider another nonlinear deterministic time series given by the system: with initial conditions 0.45 for all the variables. Dendrograms are shown in Figure 7, where as usual we have considered a sample of 10,000 points and the average method. We see that x 1 n and x 2 n are connected first and then x 3 n , x 4 n and x 5 n are clustered together as one may expect for the system shape. Note that the appropriate clustering is obtained when m = 3 and m = 4, but not for the case m = 2.

Non Linear Dependence: Non Deterministic Systems
First, we consider the time series generated by the difference equations: where u n and v n are i.i. d. N(0, 1) variables. We consider 0.45 as initial conditions and generate samples of 60,000 points for each variable. Then we divide them into five time series of 12,000 points labeled by x 1 ,..., x 5 , y 1 ,..., y 5 in Figure 8, where the results are presented with the average linkage method. We see that embedding dimensions m = 3 and 4 give us the expected results.  Next, we consider a similar example generating 60,000 data points and dividing them into data series of 12,000 points labeled by x 1 ,..., x 5 , y 1 ,..., y 5 . The points are generated by the system of difference equations: x n+1 = 0.6x n + u n y n+1 = 0.6(u n ) 2 + v n (5) where u n and v n are i.i.d. N(0, 1) variables. We consider 0.45 as initial conditions and the results are given in Figure 9 for average method. We see how embedding dimensions m = 3 and 4 give us the results that one can expect.  Simulated time series given by (5) have also been clustered using traditional distance measures. Figure 10 depicts the results obtained for the Euclidean, Pearson's correlation, autocorrelation-based, Wavelet Transform and Dynamic Time Warping distances. In the first case, the Euclidean distance does not provide the expected clustering results, because the goal of this distance is to find similar time series in time. In the second case, the correlation-based distance is not able to detect the non-linear relationships among the time series. Regarding the autocorrelation-based distance, which computes the dissimilarity between two time series as the distance between their estimated simple or partial autocorrelation coefficients, two groups are distinguished (one cluster formed by the time series split from x n and the other formed by the time series split from y n ). Once again, the Wavelet Transform distance does not achieve the correct clustering results. Finally, the Dynamic Time Warping distance is not appropriate for clustering time series by dependency (recall that this distance leads to a shape-based approach in time series clustering, where the time series with similar shape are put together in the same cluster; however, two time series with different shapes can be strongly dependent).

Real Data Experiments
The above section introduced several simulated time series proving that the distances D and D V are useful tools to establish what time series are closer for both linear and nonlinear dependencies. In this section we apply the theoretical framework for real time series.

Latin American Exchange Rate Dependencies
We consider the log-returns of daily exchange rates of six Latin American currencies vs. US dollar. Namely, we consider the currency of Argentina (ARS), Brasil (BRL), Chile (CLP), Colombia (COP), Mexico (MXN) and Peru (PEN), from 22 June 2005 to 25 April 2012. The data length is 1754, and it has been already considered in paper [32]. In that paper, authors evaluate the level of contagion among the exchange rates of the previous six Latin American countries, and using copulas they conclude that two blocs were distinguished: the first bloc consists of Brazil, Colombia, Chile and Mexico, whose exchange rates exhibit the largest dependence coefficients, and the second bloc consists of Argentina and Peru, whose exchange rate dependence coefficients with other Latin American countries are low. Figure 11 shows that the same conclusion can be obtained using the method proposed in this paper with embedding dimension m = 3 and complete linkage.

Tumor Clustering According to RNA Sequences
The data analyzed below consists of 50 time series of length 20,531 (RNA-sequence). Each series belongs to one of the five types of tumors which can be found and downloaded at the web page https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq (see also https://www. synapse.org/#!Synapse:syn2812961). In fact, the original dataset consists of 800 time series, but we have selected the first 10 series of each type of tumor (a total of 50 time series) in order to get a simpler dendrogram. This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene expressions of patients having different types of tumors: BRCA (breast carcinoma), KIRC (kidney renal clear-cell carcinoma), COAD (colon adenocarcinoma), LUAD (lung adenocarcinoma) and PRAD (prostate adenocarcinoma).
Our results show that distance D V clusters properly each data series with its tumor group for embedding dimensions m = 3 and average linkage, although the most similar pair of tumors (BRCA and LUAD) cannot be properly grouped. With the same conditions for distance D, a completely correct clustering is obtained (see Figures 12 and 13).

Tumor Clustering According to RNA Sequences
The data analyzed below consists of 50 time series of length 20,531 (RNA-sequence). Each series belongs to one of the five types of tumors which can be found and downloaded at the web page https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq (see also https://www. synapse.org/#!Synapse:syn2812961). In fact, the original dataset consists of 800 time series, but we have selected the first 10 series of each type of tumor (a total of 50 time series) in order to get a simpler dendrogram. This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene expressions of patients having different types of tumors: BRCA (breast carcinoma), KIRC (kidney renal clear-cell carcinoma), COAD (colon adenocarcinoma), LUAD (lung adenocarcinoma) and PRAD (prostate adenocarcinoma).
Our results show that distance D V clusters properly each data series with its tumor group for embedding dimensions m = 3 and average linkage, although the most similar pair of tumors (BRCA and LUAD) cannot be properly grouped. With the same conditions for distance D, a completely correct clustering is obtained (see Figures 12 and 13).

Evolution of Spanish IBEX35 Banks
An interesting application consists in analyzing the evolution of log-returns of Spanish banks at IBEX35 Spanish index. In Figure 14 we show the index evolution and the days we use to divide the data. Namely, before the 2008 financial crisis, from 2005 to February of 2009, Banco Popular (POP), Banco Santander (SAN), Banco de Sabadell (SAB) and Bankinter (BKT) were grouped while Banco Bilbao Vizcaya (BBV) was not grouped with them as Figures 15 and 16

Conclusions
This paper proposes the using of permutations as an efficient tool in time series clustering. Although traditional approaches (shape-based, feature-based and model-based) are useful to cluster time series that are not related among them, we show that permutations play an important role to cluster time series according to their degree of dependency.
Two distances based on permutations have been considered for the simulations, as well as three different embedding dimensions and three linkages methods for the hierarchical procedure. Simulation results demonstrate that: • The proposed clustering approach is able to detect linear and non-linear dependencies among time series.
• In some cases, a very small embedding dimension like m = 2 is not enough to detect dependencies among time series, thus a greater embedding dimension is required.
• The distance measure based on the mutual information has revealed a better performance than the distance measure based on the Crammer's V statistic.
• There are not significant differences with respect to the selected linkage method.
The necessity of the proposed approach to detect dependencies among time series has been shown using simulated data, by comparing the results obtained with some traditional distances. Furthermore, the performance of this clustering approach has also been validated using real data in the fields of Finance and Health Science.