Comparative Assessment of Hierarchical Clustering Methods for Grouping in Singular Spectrum Analysis

Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1 Research Institute of Energy Management and Planning (RIEMP), University of Tehran, Tehran 1417466191, Iran 2 Department of Statistics, Payame Noor University, Tehran 19395-4697, Iran; kalantarimahdi@gmail.com 3 Department of Tourism, Faculty of Economic Sciences, Ionian University, 49100 Corfu, Greece; benekic@ionio.gr * Correspondence: hassani.stat@gmail.com


Introduction
Singular spectrum analysis (SSA) is a model-free technique that decomposes a time series into a number of meaningful components. Owing to its widespread applications in various fields, this non-parametric method has received much attention in recent years. As evidence, examples of the wide variety of SSA applications can be found in [1][2][3][4][5][6][7][8][9]. A whole and precise detailed summary of the theory and applications of SSA can be found in [10,11]. Additionally, there are other books devoted to SSA; for example, Refs. [12][13][14]. A comprehensive review of SSA and the description of its modifications and extensions can be found in [15].
One of the challenging problems of SSA is identifying the interpretable components of a time series. The conventional method of detecting interpretable components such as trend and periodic components, is to apply the information contained in singular values and singular vectors of the trajectory matrix of a time series. In this approach, a screen plot of the singular values, one-dimensional and two-dimensional figures of the singular vectors, and the matrix of the absolute values of the weighted correlations can provide a visual method to identify the appropriate components. More details on group identification based on visual tools can be found in [11,14].
There is another supplementary approach to identifying the meaningful components of a time series that performs clustering based on a dissimilarity matrix defined by weighted correlation between elementary components. In this simple method, first, the similarity of elementary components is measured by means of the weighted correlations between them. Then, a proximity matrix is constructed using weighted correlations. Finally, the elementary components are grouped via distance-based clustering techniques such as hierarchical methods. Although this approach is interesting, it has not yet been established which hierarchical clustering method can provide an accurate and reasonable grouping. For instance, the hierarchical clustering with complete linkage was used in [16], while the reason for selecting the complete linkage was not clear. Since, to the best of our knowledge, no one has conducted a comparison study to determine an adequate hierarchical clustering method on the basis of weighted correlations, the present research has tried to fill this gap. In this investigation, we sought to compare the performance of several hierarchical clustering linkages in order to find a proper linkage.
This paper is divided into five sections. In Section 2, the theoretical background and general scheme of basic SSA are reviewed. Furthermore, a brief overview of hierarchical clustering methods and measure of similarity between clusters that is exploited in this research is outlined in this section. Section 3 is dedicated to comparing the performance of hierarchical clustering linkages via simulating a variety of synthetic time series. Two case studies are proposed in Section 4 using real-world time series datasets. In this section, it is also demonstrated how one can proceed, step-by-step, to conduct grouping in SSA using a hierarchical clustering method. Some conclusions and the discussion are given in Section 5 and supplementary R codes are presented in the Appendix A.

Theoretical Background
In this section, we first briefly explain the theory underlying Basic SSA which is the most fundamental version of the SSA technique. Then, the hierarchical clustering methods that are of interest to this study were reviewed.

Review of Basic SSA
The basic SSA consists of four steps which are similar to those of other versions of SSA. These steps are briefly described below and in doing so, we mainly follow [11,14]. More detailed information on the theory of the SSA method can be found in [10].
Step 1: Embedding. In this step, the time series X = {x 1 , . . . , x N } is transformed into the L × K matrix X, whose columns comprise X 1 , . . . , X K , where X i = (x i , . . . , x i+L−1 ) T ∈ R L and K = N − L + 1. The matrix X is called the trajectory matrix. This matrix is a Hankel matrix in the sense that all the elements on the anti-diagonals i + j = const. are equal. The embedding step only has one parameter L, which is called the window length or embedding dimension. The window length is commonly chosen such that 2 ≤ L ≤ N/2 where N is the length of the time series X. Step 2: Decomposition. In this step, the trajectory matrix X is decomposed into a sum of rank-one matrices using the conventional singular value decomposition (SVD) procedure. The eigenvalues of XX T were denoted by λ 1 , . . . , λ L in decreasing order of magnitude (λ 1 ≥ · · · ≥ λ L ≥ 0) and by U 1 , . . . , U L , the eigenvectors of the matrix XX T corresponding to these eigenvalues. If d = max{i, such that λ i > 0} = rank(X), then the SVD of the trajectory matrix can be written as Step 3: Grouping. The aim of this step is to group the components of (1). Let I = {i 1 , . . . , i p } be the subset of indices {1, . . . , d}. Then, the resultant matrix X I corresponding to the group I is defined as X I = X i 1 + · · · + X i p , that is, summing the matrices within each group. With the SVD of X, the split of the set of indices {1, . . . , d} into the m disjoint subsets I 1 , . . . , I m corresponds to the following decomposition: If I i = {i}, for i = 1, . . . , d, then the corresponding grouping is called elementary.
Step 4: Diagonal Averaging. The main goal of this step is to transform each matrix X I j of the grouped matrix decomposition (2) into a Hankel matrix, which can subsequently be converted into a new time series of length N. Let A be an L × K matrix with elements a ij , 1 ≤ i ≤ L, 1 ≤ j ≤ K. By diagonal averaging, the matrix A is transferred into the Hankel matrix HA with the elements a s over the anti-diagonals (1 ≤ s ≤ N) using the following formula: where A s = {(l, k) : l + k = s + 1, 1 ≤ l ≤ L, 1 ≤ k ≤ K} and |A s | denotes the number of elements in the set A s . By applying diagonal averaging (3) to all the matrix components of (2), the following expansion is obtained: X = X I 1 + · · · + X I m , where X I j = HX I j , j = 1, . . . , m. This is equivalent to the decomposition of the initial series X = {x 1 , . . . , x N } into a sum of m series: N } corresponds to the matrix X I k . Usually, Steps 1 and 2 of the SSA technique are called the Decomposition Stage, and Steps 3 and 4 are called the Reconstruction Stage. There are many software applications to implement SSA such as Caterpillar-SSA and SAS/ETS. In this investigation, we apply the free available R package Rssa to conduct the decomposition and reconstruction stages of SSA. More details on this package can be found in [17][18][19].
There is a fundamental concept in SSA called separability that indicates the quality of the decomposition by determining how well different components of a time series are separated from each other. Let us describe this important concept in more detail. Let A = a ij L,K i,j=1 and B = b ij L,K i,j=1 be L × K matrices. The inner product of two matrices A and B is defined as Based on this definition of inner product, the Frobenius norm of matrices A and B is A F = A, A , B F = B, B . Now, suppose that X i = HX i and X j = HX j are, respectively, the trajectory matrices of two reconstructed series X i and X j , which are obtained from the diagonal averaging step of SSA based on the elementary grouping I i = {i}. The measure of approximate separability between two reconstructed series X i and X j is the weighted correlation or w-correlation, which is defined as ij is equal to the cosine of the angle between the two trajectory matrices of reconstructed components X i and X j . Using the Hankel property of matrices X i and X j , it can be easily shown that the w-correlation is equivalent to the following form [10]: where w k = min{k, L, N − k + 1}. It is noteworthy that the weight w i is equal to the number of times the element x i appears in the trajectory matrix X of the initial time series is large, then it can be deduced that X i and X j are highly correlated. Hence, these two components should be included in the same group. However, a small value of the w-correlation can indicate the good separability of the components.
Consequently, an absolute value of the w-correlation close to zero ( ρ (w) ij 0) would imply a high separation between two reconstructed components X i and X j .

Hierarchical Clustering Methods
In this paper, we apply hierarchical clustering methods to cluster the eigentriples in the grouping step of SSA. Hierarchical clustering is a widely used clustering method that connects objects to form clusters based on their distance. This distance-based method requires the definition of a dissimilarity measure (or distance) between objects. There is a multitude of dissimilarity measures available in the literature that can be used in hierarchical clustering, such as Euclidean, Manhattan, Canberra, Minkowski, binary and maximum distances. Owing to the fact that different distances result in different clusters, some important aspects should be considered in the choice of the similarity measure. The main considerations include the nature of the variables (discrete, continuous, binary), the scales of measurement (nominal, ordinal, interval, ratio), and subject matter knowledge [20]. In this article, we define the dissimilarity measure between the two reconstructed components . As illustrated in [14], the Algorithm 1 of auto-grouping based on clustering methods can be written as follows: Algorithm 1 Auto-grouping using clustering methods.
ij ) between elementary reconstructed series X i and X j using (4). 3: Compute dissimilarity between elementary reconstructed series X i and X j as It should be noted that the automatic identification of SSA components has a limitation. As mentioned in [11], this limitation is assuming that the components to be identified are (approximately) separated from the rest of the time series.
There are generally two types of hierarchical clustering approaches: • Divisive. In this approach, an initial single cluster of objects is divided into two clusters such that the objects in one cluster are far from the objects in the other cluster. The process proceeds by splitting the clusters into smaller and smaller clusters until each object forms a separate cluster [20,21]. We implemented this method in our research via the function diana from the cluster package [22] of the freely available statistical R software [23]. • Agglomerative. In this approach, the individual objects are initially treated as a cluster, and then the most similar clusters are merged according to their similarities. This procedure proceeds by successive fusions until a single cluster is eventually obtained containing all the objects [20,21]. Several agglomerative hierarchical clustering methods are employed in this paper including single, complete, average, mcquitty, median, centroid and Ward. There are two algorithms ward.D and ward.D2 for the Ward method, which are available in R packages such as stats and NbClust [24]. By implementing the ward.D2 algorithm, the dissimilarities are squared before the cluster updates.
All of the agglomerative hierarchical clustering methods that were implemented in this research can be attained via the function hclust from the stats package of R software. More information about hierarchical clustering algorithms can be found in [20,21,25,26].

Cluster Validation Measure
In this paper, we used the corrected Rand (CR) index to measure the similarity between the grouping output obtained with a hierarchical clustering method and a given "correct" grouping. Let us explain the CR index in more detail. Given an n element set S, suppose The information on class overlap between the two partitions X and Y can be written in the form of Table 1: Table 1. Notations for the CR index.
Where n ij denotes the number of objects that are common to classes X i and Y j , n i· and n ·j refer, respectively, to the number of objects in classes X i (sum of row i) and Y j (sum of column j). Based on the notations provided in Table 1, the CR index is defined as follows: The CR index is an external cluster validation index and can be implemented in the R function cluster.stats from the fpc package [27]. This index varies from −1 (no similarity) to 1 (perfect similarity) and if it displays high values then this indicates great similarity between the two clusters. However, this index does not take into consideration the numbers and contributions of time series components with the wrong partition. More details on this index can be found in [25,[28][29][30].

Simulation Study
In this section, the performances of agglomerative and divisive hierarchical clustering methods were evaluated by applying them to various simulated time series with different patterns. In the following simulated series, the normally distributed noise with zero mean (ε t ) is added to each point of the generated series: Various signal-to-noise ratios (SNRs) including 0.25, 0.75, 5, and 10 were employed to assess the impact of noise levels on grouping results. The simulation was repeated 5000 times for each SNR of each scenario (a)-(f) and then, the average of the CR index was computed. In order to enable better comparison, the dashed horizontal line y = 1 was added to all figures of the CR index. To observe the structure of each simulated series, the simulation was performed once with SNR = 10. The time series plots of the simulated series are depicted in Figure 1. As can be seen in this figure, the simulated series (a)-(f) have various patterns such as an upward and downward trends, and periodicity. In Table 2, the "correct" groups for each of the simulated series are determined based on the rank of the trajectory matrix of the simulated series and the grouping rules proposed in [10]. A major step in hierarchical clustering is the decision on where to cut the dendrogram. In this simulation study, we used the number of groups (or clusters), which is reported in Table 2, as a cutting rule of the dendrogram.  Figure 2 shows the CR index for the exponential series (case a) which are computed for various hierarchical clustering algorithms and different values of window length (L). It can be concluded from this figure that the ward.D and ward.D2 methods have the worst performance at each level of SNR. In addition, the similarity between the grouping by the complete method and "correct" groups decrease as L increases. However, the other methods exhibit good performance in detecting the "correct" groups, especially for a larger SNR. In the case of SNR = 0.25, the methods of centroid, single, and average are better than the methods of mcquitty, median, and diana. Figure 3 depicts the CR index for the Sine series (case b). It can be deduced from this figure that the ward.D and ward.D2 methods are unable to identify the "correct" groups at each level of the SNR when L ≥ 18. In addition, it seems that the capability of the complete method to increase as the SNR reaches high levels, although its capability for moderate values of L is better than other values when SNR > 1. Moreover, the diana method generally outperforms the ward.D, ward.D2, and complete methods. When SNR < 1, the methods of centroid, single, and average provide more reliable outputs compared to the mcquitty, median, and diana methods. Furthermore, it seems that there is not a significant difference among the centroid, single, average, and mcquitty techniques when SNR > 1.    In Figure 4, the CR index for the Linear+Sine series (case c) is presented. Once again, it can be concluded from this figure that the ward.D and ward.D2 methods have the worst performance at each level of SNR. The other results that can be concluded from this figure are similar to those of the Sine series (case b) and are hence not repeated here.
As we can see in Figure 4, the CR index of any of the methods could not obtain the value of one. This is true for all cases of SNR. In other words, none of the methods could exactly identify the correct groups at each level of SNR. This is due to the fact that the rank of linear time series is equal to 2 and therefore, it generates two non-zero eigenvalues. However, the second eigenvalue of the linear time series is relatively small and consequently, it is probably mixed with noise and can therefore not be identified, especially for a high level of noise.  In Figure 5, the CR index for the Sine+Sine series (case d) is shown. Once again, the ward.D and ward.D2 methods present poor results at each level of the SNR. When SNR < 1, the single, centroid and average methods provide better results in comparison to the other methods. Furthermore, it seems that these methods can exactly identify the correct groups if SNR > 1. Figure 6 depicts the CR index for the Exponential×Cosine series (case e). As can be seen in this figure, the ward.D and ward.D2 methods cannot detect the "correct" groups at each level of the SNR. The single and average methods show better outputs when SNR < 1. Additionally, in the case of SNR > 1, it seems that there is not a significant difference among the single, average, and mcquitty approaches.
In Figure 7, the CR index for the Exponential+Cosine series (case f) is presented. Similar to the previous results, the ward.D and ward.D2 methods cannot provide effective results at each level of the SNR. However, it seems that the centroid, single, and average methods are better than the other methods-especially for SNR < 1.
The outputs of the simulation study clearly indicate that the ward.D and ward.D2 methods cannot detect the "correct" groups at each level of the SNR. In addition, these outputs reveal that choosing the best hierarchical clustering linkage is not straightforwardit depends on the structure of a time series and the level of the contribution of noise. In order to have a general overview of the simulation results, the findings of the simulation study are summarized in Table 3. Based on these findings, it seems that the single, centroid, average and mcquitty methods can identify the "correct" groups better than the other linkages.

Case Studies
Here, let us apply the hierarchical clustering methods to identify adequate groups using real-world datasets. To this end, two time series datasets were considered as follows: 1. FORT series: Monthly Australian fortified wine sales (abbreviated to "FORT") in thousands of liters from January 1980 to July 1995 with 187 observations [14,31]. This time series is part of the dataset AustralianWine from R package Rssa. Each of the seven variables of the full dataset contains 187 points. Since the data were missing values after June 1994, we used the first 174 points. 2. Deaths series: Monthly accidental deaths in the USA from 1973 to 1978 including a total of 72 observations. This well-known time series dataset has been used by many authors and can be found in many time series books (as can be seen, for example, [32]). In this study, the USAccDeaths data of R package datasets were used. Figure 8 shows the time series plots of these datasets. It is evident that there is a seasonal structure in these time series. Furthermore, there is a downward linear trend in the FORT time series. The following steps provide a simple step-by-step procedure to group the elementary components of the time series by means of a hierarchical clustering method: Step 1: Choosing the window length One of the most important parameters in SSA is the window length (L). This parameter plays a pivotal role in SSA because the outputs of reconstruction and forecasting are affected by changing this parameter. There are some general recommendations for the choice of the window length L. It should be sufficiently large and the most detailed decomposition is achieved when L is close to the half of time series length (L N/2) [10]. Furthermore, in order to extract the periodic components of a time series with the known period P, the window lengths divisible by the period P provide better separability [14]. In the FORT time series, which is periodic with the period P = 12, L = 84 meets these recommendations, since 84 is close to N 2 = 174 2 = 87 and L P = 84 12 = 7. Furthermore, the Deaths time series is periodic with the period P = 12. Therefore, L = 36 satisfies these recommendations, because 36 is equal to N 2 = 72 2 and L P = 36 12 = 3.

Step 2: Determining the number of clusters
An important step in hierarchical clustering is the decision on where to cut the dendrogram. Similar to the simulation study proposed in Section 3, here, we use the number of clusters as a cutting rule of the dendrogram. If the purpose of time series analysis extracts the signal from noise, determining the number of clusters is quite straightforward; it is sufficient to set the number of clusters to equal two. However, if we want to retrieve different components concealed in a time series such as the trend and periodic components, determining the number of clusters requires more information. In this case, we recommend using the w-correlation matrix owing to utilizing it to measure the distance matrix of hierarchical clustering. Figure 9 shows the matrix of absolute values of w-correlations between the 30 leading elementary reconstructed components of FORT series. The matrix of absolute values of w-correlations between the 36 elementary reconstructed components of Deaths series is depicted in Figure 10. In these figures, the white color corresponds to zero and the black color corresponds to the absolute values equal to one. Highly correlated elementary reconstructed components can be easily found by looking at the w-correlation matrix and then we can place these components into the same cluster. It can be deduced from Figure Tables 4 and 5, it can be concluded that the ward.D and ward.D2 linkages have the worst performance both in FORT and Deaths series. These findings are in good agreement with the simulation results obtained in Section 3. Figure 11 shows the dendrogram of the single linkage for our FORT series. The dendrogram of the centroid linkage for the Deaths series is depicted in Figure 12.
The reconstruction of the FORT series with the help of the first 13 eigentriples is presented in Figure 13. Furthermore, Figure 14 shows the plot of the signal reconstruction of the Deaths series based on the first 10 eigentriples.   Figure 11. Dendrogram of the single linkage for the FORT series.

Conclusions
In this paper, we conducted a comparison study in order to find a proper hierarchical clustering method to identify the appropriate groups at the grouping step of SSA. In general, the simulation results provided a clear indication that the ward.D and ward.D2 linkages could not detect meaningful groups. In addition, the choice of the best hierarchical clustering linkage is not straightforward. It depends on the structure of a time series and the level of noise that disturbs a time series. In general, the evidence from this investigation suggests using the single, centroid, average and mcquitty linkages to group eigentriples in the grouping step of SSA.
It should be noted that our study has a clear limitation. It is due to the fact that in the automatic identification of SSA components, it should be assumed that the components to be identified are (approximately) separated from the rest of the time series.

Conflicts of Interest:
The authors declare no conflict of interest.