Weighted z ‐ Distance ‐ Based Clustering and Its Application to Time ‐ Series Data

: Clustering is the practice of dividing given data into similar groups and is one of the most widely used methods for unsupervised learning. Lee and Ouyang proposed a self ‐ constructing clustering (SCC) method in which the similarity threshold, instead of the number of clusters, is specified in advance by the user. For a given set of instances, SCC performs only one training cycle on those instances. Once an instance has been assigned to a cluster, the assignment will not be changed afterwards. The clusters produced may depend on the order in which the instances are considered, and assignment errors are more likely to occur. Also, all dimensions are equally weighted, which may not be suitable in certain applications, e.g., time ‐ series clustering. In this paper, improvements are proposed. Two or more training cycles on the instances are performed. An instance can be re ‐ assigned to another cluster in each cycle. In this way, the clusters produced are less likely to be affected by the feeding order of the instances. Also, each dimension of the input can be weighted differently in the clustering process. The values of the weights are adaptively learned from the data. A number of experiments with real ‐ world benchmark datasets are conducted and the results are shown to demonstrate the effectiveness of the proposed ideas.


Introduction
In the field of artificial intelligence, clustering techniques play a very important role [1,2]. Clustering is an unsupervised learning technology, with the purpose of forming meaningful clusters for the unlabeled data instances under consideration. Intuitively, similar instances are to be grouped in a cluster and different instances grouped in different clusters. Clustering has been widely utilized in a variety of applications, such as revealing the internal structure of the data [3], deriving segmentation of the data [4,5], preprocessing the data for other artificial intelligence (AI) techniques [6,7], business intelligence [1,8], and knowledge discovery in data [9,10]. For example, in electronic text processing [11][12][13], clustering is used to reduce the dimensionality in order to improve the efficiency of the processing, or to ease the curse of dimensionality encountered for high-dimensional problems. In the recommendation applications in e-commerce [14], the size of the information matrix is reduced by clustering to enhance the efficiency of making recommendations. In power systems, clustering helps predict the trend of electricity demand in the future [15]. In stock market forecasting and social media data analysis [1,16], clustering technology is an important and indispensable core key. Therefore, developing good clustering technology is a very critical issue. Lee and Ouyang proposed a self-constructing clustering (SCC) algorithm [20] which has been applied in various applications [4,6,13,14,16,55]. SCC is an exclusive clustering method. For a given training set, SCC performs only one training cycle on the training instances. Initially, no clusters exist. Training instances are considered one by one. If an instance is closer enough to an existing cluster, the instance is assigned to the most suitable cluster. Otherwise, a new cluster is created and the instance is assigned to it. SCC offers several advantages. First, the algorithm runs through the training instances only one time, so it is fast. Second, the distributions of the data are statistically characterized. Third, the similarity threshold, instead of the number of clusters, is specified in advance by the user. However, once an instance is assigned to a cluster, the assignment will not be changed afterwards. The clusters produced may depend on the order the instances are considered, and assignment errors are more likely to occur. As a result, the accuracy of the result can be low. Also, all dimensions are equally weighted in the clustering process, which may not be suitable in certain applications, e.g., time-series clustering [56][57][58].
In this paper, we propose improvements to the SCC method to overcome its shortcomings. Two or more training cycles on the instances are performed. In each cycle, training instances are considered one by one. An instance can be added into or removed from a cluster, and thus it is allowed to be re-assigned to another cluster. A desired number of clusters is obtained when all the assignments are stable, i.e., no assignment has been changed, in the current cycle. In this way, the clusters produced are less likely to be affected by the feeding order of the instances. Furthermore, each dimension can be weighted differently in the clustering process. The values of the weights are adaptively learned from the data. Having different weights is useful when certain relevance exists between different dimensions in many applications, e.g., clustering of time series data [56,57]. The effectiveness of the proposed ideas is demonstrated by a number of experiments conducted with real-world benchmark datasets.
The rest of this paper is organized as follows. SCC is briefly reviewed in Section 2. The proposed methods are described in Section 3 and Section 3.2, respectively. Experimental results are presented in Section 4. Finally, a conclusion is given in Section 5.

Suppose
|1 is a finite set of N unlabeled training instances, where , ⋯ , ∈ ℝ is the ith instance. Each instance is a vector with n features. SCC [20,55] does clustering in a progressive way. Only one training cycle on the instances is performed. Each cluster is described by a Gaussian-like membership function characterized by the center and deviation induced from the data assigned to the cluster. The instances are considered one by one sequentially, and clusters are created incrementally. Let J denote the number of currently existing clusters. Initially, no clusters exist and thus J = 0. When instance 1 comes in, the first cluster, , is created and instance is assigned to it and J = 1. Then, for instance i, which is , 2, the z-distance between instance i and every existing cluster, , is calculated by for 1,2, ⋯ , . Note that , ⋯ , and , ⋯ , denote the mean and deviation, respectively, induced from the instances assigned to cluster . The membership degree (MD) of instance i belonging to cluster is defined as: with the value lying in the range of (0, 1]. Let denote the number of instances assigned to cluster . There are two cases:


If all the MD's are less than , where is a pre-specified similarity threshold in one dimension, i.e., for all existing clusters , 1 , a new cluster, , is created and instance i is assigned to it by where is a pre-specified constant.
 Otherwise, instance i is assigned to the cluster with the largest MD, say cluster , i.e., center , ⋯ , , deviation , ⋯ , , and size , are updated by Note that J is not changed in the latter case. When all the instances have been considered, SCC stops with J clusters. Algorithm 1 can be summarized below.
Create a new cluster according to Equation (4); else Instance i is assigned to the cluster with largest MD, according to Equation (5); end if end for end SCC Note that SCC takes the training set X as input and outputs J clusters , ⋯ , . SCC runs fast due to running through the training instances only once. Unlike K-means, the similarity threshold, instead of the number of clusters, is specified in advance by the user.

Proposed Methods
The clusters produced by SCC may depend on the feeding order of the instances and the accuracy of the result can be low. Furthermore, all dimensions are equally weighted in the clustering process, which may not be appropriate in certain applications, e.g., time-series clustering.

Iterative SCC (SCC-I)
For convenience, the proposed approach is abbreviated as SCC-I, standing for the iterative version of SCC. SCC-I consists of multiple iterations. An instance is allowed to be re-assigned to another cluster. The clustering work stops when all the assignments are stable, i.e., no assignment will be changed. A training cycle on the instances is performed in each iteration.
In the first iteration, SCC is applied. Consider the rth iteration, 2. For any instance , 1 , we first remove it from the cluster, say , which is assigned to. Three cases may arise: Then, we calculate the z-distance and MD between and each existing cluster by Equations (1) and (2). A new cluster may be created and is assigned to it as following Equation (4), or is assigned to the cluster with the largest MD as following Equation (5).
The current rth iteration ends when all the instances are run through. If one of the cluster assignments has been changed in this iteration, the next iteration, i.e., the 1 th iteration, proceeds. Otherwise, the assignments are stable and SCC-I stops with J clusters. The SCC-I Algorithm 2 can be summarized below.
Algorithm 2. SCC-I Perform SCC on X; repeat for each instance, instance i, 1 ≤ i ≤ N Remove instance i from its cluster and update the existing clusters; Compute Z(i, j), 1 ≤ j ≤ J; Create a new cluster or assign instance i to the cluster with largest MD; end for until assignments are stable; end SCC-I Note that SCC-I takes the training set X as input and outputs J clusters , ..., . Since re-assignments are allowed to be done, SCC-I can produce clusters that are less likely to be affected by the feeding order of the instances. As a result, more stable clusters can be obtained by SCC-I. By "more stable" we mean both the number of clusters and the clusters produced are less likely to be affected by the feeding order of the instances. However, SCC runs faster than SCC-I since SCC-I performs more than one iteration. To illustrate the advantage of SCC-I over SCC, a simple example is given in Appendix A.

Weighted SCC-I (SCC-IW)
In SCC and SCC-I, each dimension is equally weighted in the calculation of z-distances, as shown in Equation (1). For some applications, e.g., time-series clustering, where certain relevance exists between different dimensions, allowing different dimensions to be weighted differently could be a useful idea [56]. This motivates the development of the weighted SCC-I, abbreviated as SCC-IW.
Let the weighted z-distance between instance i and cluster , 1 , 1 , be defined as where , ⋯ , is the weight vector associated with , , 0 for 1 , and , ⋯ , 1. Accordingly, the MD of instance i belonging to is defined as: Clearly, we have 0 for all existing clusters , 1 , a new cluster is created; otherwise, instance i is assigned to the cluster with the largest MD, as described before.

Remark 1. In Equation (3), the test is
, where is the pre-specified similarity threshold. Note that for , , we have For , , we have This is Equation (9).
Now the problem left is how the weights are determined to optimize the clustering work under consideration. Here, we consider time-series clustering as an exemplar application [56]. Firstly, it is required that the instances assigned to each cluster be close together as much as possible. In other words, we want to maximize: , where , 1 if instance i is assigned to cluster and , 0 otherwise, for 1 and 1 , and is the MD defined in Equation (8). Note that is an exponential function which is non-linear on , , ⋯ , , . Maximizing Equation (10) which is hard. However, maximizing is identical to minimizing , . Therefore, instead of maximizing Equation (10), we minimize: Since , is linear on , , ⋯ , , , minimizing Equation (11) is kind of linear optimization which is much easier. Secondly, since the neighboring dimensions are next to each other in the time line, the weights of neighboring dimensions should be close to each other. Therefore, we want to minimize: , , .
Combining Equation (11) and Equation (12), together with the constraints on weights, we'd like the weights to minimize subject to , 1 , , 0, 1 , . . . , , 1 which, by Equation (7), is equivalent to minimizing , , subject to , 1 , , 0, 1 , . . . , , 1 where α is a positive real constant. Through quadratic programming, optimal values for the weights can be derived from Equation (14). Now, we are ready to present the SCC-IW algorithm. We adopt , , instead of , , in SCC-IW. Also, whenever a new cluster is created, its weights are each initialized to be . At the end of the current iteration, we minimize Equation (14) to find optimal weights which will be used in the next iteration. The SCC-IW Algorithm 3 can be summarized below.

Algorithm 3. SCC-IW
Perform SCC on X with weighted z-distance, and initialize the weights for each newly created cluster; repeat for each instance, instance i, 1 Remove instance i from its cluster and update the existing clusters; Compute , , 1 ; Create a new cluster or assign instance i to the cluster with largest MD, and initialize the weights for each newly created cluster; end for Derive optimal weights by solving Equation (14) through quadratic programming; until assignments are stable; end SCC-IW Note that SCC-IW takes the training set X as input and outputs J clusters , ..., .
It is not surprising that SCC-IW can work well. In each iteration, optimal weights are derived. For a dimension which is more useful for clustering, it is more important and is, therefore, given a larger weight. To illustrate how SCC-IW works, a simple example is given in Appendix B.

Experimental Results
In this section, we demonstrate empirically the superiority of our proposed methods. The proposed methods and others are applied to do clustering on benchmark datasets. Three external measures of evaluation, Fscore, Rand Index (RI), and Normalized Mutual Information (NMI) [59], and another three internal measures, Dunn index (DI), Davies-Bouldin index (DBI), and Silhouette index (SI) [60], are adopted.
where K is the number of classes, J is the number of clusters, N is the size of the entire data set, is the number of data instances belonging to class k in cluster j, is the size of cluster j, and is the size of class k. A higher Fscore is better.
where a is the number of pairs of data instances having different class labels and belonging to different clusters, b is the number of pairs of data instances having the same cluster labels and belonging to the same clusters, and N is the size of the entire data set. A higher RI is better.
where K is the number of classes, J is the number of clusters, N is the size of the entire data set, is the number of data instances belonging to class k in cluster j, is the size of cluster j, and is the size of class k. A higher NMI is better.
where , is the minimum distance between clusters and , and diam is the largest distance between the instances contained in cluster . A higher DI is better.
where avg is the average of the distances between the instance contained in cluster C, and , is the between the centers of clusters and . A lower DBI is better.
where is the average of the distances between instance and all other instances within the same cluster, is the lowest average distance of instance to all instances in any other cluster of which instance is not a member, and is the size of the entire data set. A higher SI is better.

Non Time-Series Datasets
To illustrate the effectiveness of SCC-I, fourteen benchmark non-time-series datasets are selected from the UCI repository [61] for the experiments. The characteristics of these datasets are shown in Table 1. For example, there are 569 instances in the Breast dataset. Each instance has 30 features or dimensions, and belongs to one of 2 classes. For each dataset, an instance belongs to one and only one class. We compare SCC-I with Kmeans [62], DSKmeans [59], Fuzzy C-means (FCM) [63], Gaussian mixture model (Gmm) [64], DBSCAN [65], and SCC [20]. The codes for K-means, DBSCAN, and FCM are adopted from Matlab [66], and the code for Gmm is adopted from [67]. We wrote the codes for DSKmeans, SCC, and SCC-I in Matlab.  Table 2 shows comparisons of Fscore, RI, and NMI among different methods for each dataset. To have fair comparisons among different methods, the number of clusters are tuned to be identical to the number of classes for each dataset. For K-means, the number of clusters, k, is set to be equal to the number of classes. For DSKmeans, the parameter is set to be a value between 0.01 and 5, and is between 0.01 and 0.3. For SCC and SCC-I, and are set to be values between 0.1 and 0.95. Also, each method performed 25 runs on a dataset and the averaged result is shown. For K-means and DSKmeans, each run started with a different set of initial seeds. For SCC and SCC-I, each run was given a different feeding order of the training instances. For FCM, the maximum number of iterations is set to be 50. For DBSCAN, is set to be a value between 0.1 and 3, while the minimum number of neighbors, minpts, required for a core point, is set to be between 1 and 10. In addition to the values of the measures, performance ranking is also indicated at the right side of '/' for each dataset in Table 2. For example, consider the Breast dataset. The Fscore values are 0.9270, 0.8710, 0.9274, 0.7470, 0.7983, and 0.9264 by K-means, DSKmeans, FCM, Gmm, DBSCAN, SCC, and SCC-I, respectively. FCM has the best value, 0.9274, so it ranks as the first place, indicated by 1 at the right side of '/'; K-means has the second best value, 0.9270, so it ranks as the second place, indicated by 2 at the right side of '/'; etc. From this table, we can see that (1) SCC-I outperforms SCC significantly, and (2) SCC-I is no less effective than the other methods. Table 3 shows the averaged ranking of all the 14 datasets for each method. As can be seen, SCC-I is the best in Fscore and NMI, indicated by boldfaced numbers, and is the second best in RI.
Although the overall ranking of SCC-I is better than others, looking at individual results for each dataset there are some variations. For example, K-means outperforms SCC-I for Heart, Ionophere, and Seeds. Compared with K-means, SCC-I has advantages: (1) SCC-I considers deviation in the computation of distance; (2) SCC-I allows ellipsoidal shape of clusters. Note that SCC-I is less affected by the feeding order of instances, and thus can give a more stable and accurate clustering than SCC. However, given the number of clusters, the clusters obtained by K-means are not affected by the feeding order of instances. For the datasets with ellipsoidal shape of clusters, SCC-I is more likely to perform better on them. By contrast, for the datasets with spherical clusters, SCC-I may be inferior to K-means on such datasets. Table 2 also shows the execution time, in seconds, of each method on each dataset. The computer used for running the codes is equipped with Intel(R) Core(TM) i7-4770 CPU 3.40GHz, 16GB RAM, and Matlab R2011b. The times shown in the table only provide an idea of how efficiently these methods can run. Note that SCC-I runs slower than SCC. SCC only performs one training cycle on the instances, while SCC-I requires two or more training cycles. SCC-I takes more training time than other baselines due to some factors: (1) The codes of these baselines, e.g., K-means and FCM, were adopted from established websites, while SCC-I were written by graduate students; (2) SCC-I has to compute z-distances and Gaussian values, which is more computationally expensive; (3) In order to do re-assignment, the operation of removing instances from clusters is done during the clustering process. Comparisons of DBI, DI, and SI among different methods for each dataset are shown in Table 4. As can be seen from the table, SCC-I is better than other methods. SCC-I gets the lowest DBI for 9 out of 14 datasets, the highest DI for 9 out of 14 datasets, and the highest SI for 7 out of 14 datasets.  Now we use the paired t-test [68] to test the real significant differences between SCC-I and other methods. Table 5 shows the t-values based on the values of Fscore, NMI, and DBI, respectively, under the 90% confidence level. Note that we have 14 datasets involved. So the degree of freedom is 13 and the corresponding threshold is 1.771. From Table 5, we can see that all the t-values are greater than 1.771, except for Gmm with NMI. Therefore, by statistical test we observe that SCC-I shows significantly better performance than other methods. A multiple comparison test may be used since there are multiple algorithms involved. Analysis of variance (ANOVA) [69] provides such tests. We have tried with ANOVA in two ways. Firstly, we used ANOVA to return a structure that can be used to determine which pairs of algorithms are significantly different. The results similar to those shown in Table 5 were obtained. Secondly, we used ANOVA to test the null hypothesis that all algorithms perform equally well against the alternative hypothesis that at least one algorithm is different from the others. A p-value that is smaller than 0.05 indicates that at least one of the algorithms is significantly different from the others. However, it is not certain which ones are significantly different from each other.

Time-Series Datasets
Next, we show the effectiveness of SCC-IW in clustering time series data. Ten benchmark time series datasets are taken from the UCR repository [70] for the experiments. The characteristics of these datasets are shown in Table 6. In addition to the previous methods, we also compare with TSKmeans [56] here. We wrote the code for TSKmeans in Matlab.  SynControl  600  60  6  Coffee  56  286  2  Light7  143  319  7  OSU_Leaf  442  427  6  Sony_Surface  621  70  2  Trace  200  275  4  CBF  930  128  3  ECGFiveDays  884  136  2  FaceFour  350  112  4  OliveOil  60  570  4  Tables 7 and 8 show the Fscore, RI, and NMI obtained by different methods for each dataset. Again, each method performed 25 runs on a dataset and the averaged result is shown. From these two tables, we can see that (1) SCC-IW outperforms both SCC and SCC-I significantly, and (2) SCC-IW is no less effective th.an the other methods. Table 8 shows the averaged ranking of all the 10 time series datasets for each method. As can be seen, SCC-IW is the best in Fscore, RI, and NMI. However, SCC-IW runs slower than SCC and SCC-I, since weights are involved in SCC-IW and they have to be optimally updated in each training cycle. Comparisons of DBI, DI, and SI among different methods for each dataset are shown in Table 9. As can be seen from the table, SCC-I is better than other methods. SCC-I gets the lowest DBI for 8 out of 10 datasets, the highest DI for 5 out of 10 datasets, and the highest SI for 4 out of 10 datasets. The main reason that SCCIW outperforms all the other competing classifiers in Table 10 is the consideration of both weights and deviations in SCC-IW. Gmm, SCC and SCC-I do not use weights. K-means, DSKmeans, and FCM do not use deviations nor weights in the clustering process. TSKmeans considers weights, but deviations are not involved.  Table 9. Performance comparisons of DBI, DI, and SI on 10 time-series datasets. Now we use the paired t-test [68] to test the real significant differences between SCC-IW and other methods. Table 11 shows the t-values based on the values of Fscore, RI, and DBI, respectively, at the 90% confidence level. Note that we have 10 datasets involved. So the degree of freedom is 9 and the corresponding threshold is 1.833. From the above table, we can see that all the t-values are greater than 1.833, except for DSKmeans and FCM with RI. Therefore, by statistical test we observe that SCC-IW shows significantly better performance than other methods.

Compairsons with Other Methods
Recently, evolutionary algorithms, e.g., simulated annealing and differential evolution, have been proposed to perform clustering [71]. The evolutionary algorithms can perform clustering using either a fixed or variable number of clusters and find the clustering that is optimal with respect to a certain validity index. Either a population of solutions or only one solution can be used. The single-solution-based evolutionary algorithms have smaller evaluation count but their solution quality is usually not as good as those that are population-based.
Siddiqi and Sait [72] propose a heuristic for data-clustering problems, abbreviated as FCM . It comprises two parts, a greedy algorithm and a single-solution-based heuristic. The first part selects the data points that can act as the centroids of well-separated clusters. The second part performs clustering with the objective of optimizing a cluster validity index. The proposed heuristic consists of five main components: (1) genes; (2) fitness of genes; (3) selection; (4) mutation operation; and (5) diversification. The objective functions used in the proposed heuristic are the Calinski-Harabasz index and Dunn index. Zhang et al. [73] propose a clustering algorithm, called ICFSKM, for clustering large dynamic data in industrial IoT. Two cluster operations, cluster creating and cluster merging, are defined to integrate the current pattern into the previous one for the final clustering result. Also, k-mediods is used for modifying the clustering centers according to the new arriving objects. Table 12 shows a DI comparison between SCC-I and HDC, and Table 13 shows a NMI comparison between SCC-I and ICFSKM, for some datasets. In these tables, the datasets Balance scale, Banknote authentication, Landsat satellite, Pen-based digits, Waveform-5000, and Wine are also selected from the UCI repository [61]. The results for HDC and ICFSKM are copied directly from [65,66]. We can see that SCC-I is comparable to HDC and ICFSKM. Note that HDC and ICFSKM are goal-oriented. For example, The objective functions used in HDC include the Dunn index. Therefore, DI values can be optimized by HDC intentionally. Furthermore, the objective function is usually computationally intensive and the evolutionary algorithms are considered to be slow.

Setting of α
In Equation (14), the difference between the weights of neighboring dimensions is controlled by . For = 0, no constraints are imposed on the weight differences. As increases, neighboring dimensions are forced to be increasingly equally weighted and SCC-IW will behave more and more like SCC-I. Therefore, the setting of α could affect the performance of SCC-IW. In [56], a constant is defined as , where ∑ , and it is shown empirically that the performance of TSKmeans varies with the value of [56].

Conclusions
SCC is an exclusive clustering method, performing only one training cycle on the training instances. Clusters are created incrementally. However, the clusters produced may depend on the feeding order in which the instances are considered, and assignment errors are more likely to occur. Also, all dimensions are equally weighted in the clustering process, which may not be suitable in certain applications, e.g., time-series clustering. We have presented two improvements, SCC-I and SCC-IW. SCC-I performs two or more training cycles iteratively, and allows instances to be re-assigned afterwards. In this way, the clusters produced are less likely to be affected by the feeding order of the instances. On the other hand, SCC-IW allows each dimension to be weighted differently in the clustering process. The values of the weights are adaptively learned from the data. Experiments have shown that SCC-IW performs effectively in clustering time-series data.
SCC-I and SCC-IW take more training time due to: (1) they have to compute z-distances and Gaussian values, which is computationally expensive; (2) in order to do re-assignment, the operation of removing instances from clusters is done during the clustering process. We will investigate these issues to reduce the training time in the future. Spectral clustering [74] and multidimensional scaling [75] deal with extraction of features from original ones. SCC-IW has probably some relationship with these methods since SCC-IW is also somehow able to extract some new "axis or principal components" by the adaptation of the weights. It will be interesting to explore such a relationship in the future.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
A simple example is given here to illustrate how SCC-I works. Suppose X has the following 12 training instances: Note that 12 and 2. These instances are shown in Figure A1a, marked as circles. Let 0.55 and 0.2. Below, we consider two feeding orders of instances.
After performing SCC in the first iteration, there are six clusters: , , , , , and , as shown in Figure A1b The clusters are numbered and wrapped in dashed contours, with their centers marked with crosses. Instances and are assigned to , and are assigned to , is assigned to , and are assigned to , , , , and are assigned to , and is assigned to . After the second iteration, we have 4 clusters: , , , and , as shown in Figure A1c Instances , , , and are assigned to , , , and are assigned to , is assigned to , and , , , and are assigned to . After the third iteration, we have 3 clusters: , , and , as shown in Figure A1d Instances , , , , and are assigned to , , , and are assigned to , and , , , and are assigned to . In the fourth iteration, no assignment has been changed. Therefore, SCC-I stops with three clusters , , and as shown above, with 5, 3, and 4 instances assigned to them, respectively.


The feeding order of , , , , , , , , , , , . After performing SCC in the first iteration, there are 5 clusters: , , , , and , as shown in Figure A2a Instances , , , and are assigned to , and are assigned to , and are assigned to , , , and are assigned to , and is assigned to . Iterations 2 and 3 are performed subsequently. After the fourth iteration, we have 3 clusters: , , and , as shown in Figure A2b Instances , , , and are assigned to , , , , , and are assigned to , and , , and are assigned to . Then the cluster assignments are stable and so SCC-I stops with three clusters , , and as shown above, with 4, 5, and 3 instances assigned to them, respectively.
Note that SCC produces 6 clusters, Figure A1b, with the first feeding order and 5 clusters, Figure A2a, with the second feeding order, and the two sets of clusters are different. However, SCC-I produces 3 clusters, Figure A1d and Figure A2b, with both feeding orders and the two sets of clusters are essentially the same. Clearly, the clusters obtained by SCC-I are more stable and reasonable.

Appendix B
Another simple example is given here to illustrate how SCC-IW works. Suppose X has the following 15 training instances:  By SCC, 7 clusters, , ..., , are obtained, with size being 3, 1, 5, 1, 1, 3, and 1, respectively. Instances , , and are assigned to ; , , , , and are assigned to ; and , , and are assigned to . , , , and are singletons, containing , , , and , respectively.  By SCCI-I, convergence is achieved in the 3rd iteration with 4 clusters, , ..., , with size being 5, 5, 4, and 1, respectively. Instances , , , , and are assigned to ; , , , , and are assigned to ; , , , and are assigned to ; and is assigned to .  By SCC-IW, convergence is also achieved in the 3rd iteration, but with only 3 clusters, , , , with size being 5, 5, and 5, respectively. Instances , , , , and are assigned to ; , , , , and are assigned to ; and , , , , and are assigned to . The instances assigned to the clusters are shown in Figure A3a Which are depicted in Figure A3d. The clustering done by SCC-IW seems to be the most suitable, intuitively. All the instances assigned to cluster have similar time samplings at indices 2, 3, and 4 which are manifested by large weights at these indices. Similarly, all the instances assigned to cluster have similar time samplings at indices 4, 5, and 6, which are manifested by large weights at these indices, and all the instances assigned to cluster have similar time samplings at indices 1, 2, and 3, which are manifested by large weights at these indices.