Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering

: Clustering is an unsupervised machine learning and pattern recognition method. In general, in addition to revealing hidden groups of similar observations and clusters, their number needs to be determined. Internal clustering validation indices estimate this number without any external information. The purpose of this article is to evaluate, empirically, characteristics of a representative set of internal clustering validation indices with many datasets. The prototype-based clustering framework includes multiple, classical and robust, statistical estimates of cluster location so that the overall setting of the paper is novel. General observations on the quality of validation indices and on the behavior of different variants of clustering algorithms will be given.


Introduction
Clustering aims to partition a given dataset (a set of observations) into groups (clusters) that are separated from other groups in a twofold manner: observations within a cluster are similar to each other and dissimilar to observations in other clusters [1].Diverse sets of clustering approaches have been developed over the years, e.g., density-based, probabilistic, grid-based, and spectral clustering [2].However, the two most common groups of crisp (here, we do not consider fuzzy clustering [3]) clustering algorithms are partitional and hierarchical clustering [4].Hierarchical clustering constructs a tree structure from data to present layers of clustering results, but because of the pairwise distance matrix requirement, the basic form of the method is not scalable to a large volume of data [5].Moreover, many clustering algorithms, including hierarchical clustering, can produce clusters of arbitrary shapes in the data space, which might be difficult to interpret for knowledge discovery [6].
The two aims of clustering for K groups in data are approached in the partitional algorithms, most prominently in the classical K-means [4,7], by using two main phases: initial generation of K prototypes and local refinement of the initial prototypes.The initial prototypes should be separated from each other [4,8].Lately, the K-means++ algorithm [9], where the random initialization is based on a density function favoring distinct prototypes, has become the most popular variant to initialize the K-means-type of an algorithm.Because the prototype refinement acts locally, we need a globalization strategy to explore the search space.This can be accomplished with repeated restarts through initial prototype regeneration [10] or by using evolutionary approaches with a population of different candidate solutions [11].
One can utilize different error (score) functions in partitional clustering algorithms [12].Mean is the statistical estimate of the cluster prototype in K-means and the clustering error is measured with the least-squares residual.This implies the assumption of spherically symmetric, normally distributed data with Gaussian noise.These conditions are relaxed when the cluster prototype is replaced, e.g., with a robust location estimate [13][14][15].The two simplest robust estimates of location are median and spatial median, whose underlying spherically symmetric distributions are uniform and Laplace distributions, respectively.If the type of data is discrete, for instance, an integer variable with uniform quantization error [16], then the Gaussian assumption is not valid.Median, given by the middle value of the ordered univariate sample (unique only for odd numbers of points [17]), can, like the mean, be estimated from the marginal distribution being inherently univariate.The spatial median, on the other hand, is truly a multivariate, orthogonally equivariant location estimate [18].These location estimates and their intrinsic properties are illustrated and more thoroughly discussed in [17,19].The median and spatial median have many attractive statistical properties, especially since their so-called breakdown point is 0.5, i.e., they can handle up to 50% of contaminated and erroneous data.
In a typical unsupervised scenario, one does not possess any prior knowledge of the number of clusters K. Finding the best possible representation of data with K groups is difficult because the number of all possible groupings is the sum of Stirling numbers of the second kind [19].Defining validation measures for clustering results has been, therefore, a challenging problem that different approaches have tried to overcome [20][21][22][23][24][25].The quality of a clustering result can be measured with a Clustering Validation Index (CVI).The aim of a CVI is to estimate the most appropriate K based on the compactness and separation of the clusters.Validation indices can be divided into three categories [26]: internal, external, and relative.An external validation index uses prior knowledge, an internal index is based on information from the data only, and in a relative CVI, multiple clustering results are compared.A comprehensive review of clustering validation techniques up to 2001 was provided in [27].There exists also alternative approaches for determining the number of clusters, e.g., by measuring the stability of the clustering method [28] or using multiobjective evolutionary approaches [11,29].
In this paper, we continue the previous work reported in [30] by focusing on a comparison of the seven best internal CVIs, as identified in [30] and augmented by [22].The earlier comparisons, typically reported when suggesting novel CVIs, only include K-means as the partitional clustering algorithm [22,[30][31][32][33][34][35][36].Here, this treatment is generalized by using multiple statistical estimates as a cluster prototype and to define the clustering error, under the currently most common initialization strategy as proposed in [9] (which is also generalized).Note that prototype-based clustering can also be conducted with an incremental fashion [37][38][39].However, here were restrict ourselves on the batch versions of the algorithms, which can be guaranteed to converge in a finite number of iterations (see Section 2.1).The definitions of the considered validation indices are also extended and empirically compared with K-means, K-medians, and K-spatialmedians (using spatial median as a prototype estimate) clustering results for a large pool of benchmark datasets.According to our knowledge, there exists no previous work that compares CVIs with multiple different distance metrics.Our aim is to sample the main characteristics of the indices considered and to identify what indices most reliably refer to ground truth values of the benchmark datasets.Note that by their construction, all CVIs considered here can also be used to suggest the number of clusters in hierarchical clustering.
The structure of the article is as follows.After this introductory section, we describe generalized prototype-based clustering, discuss its convergence, and also present the generalized versions of cluster initialization and indices in Section 2. Our experimental setup is described in Section 3, and the results are given and discussed in Section 4. Finally, conclusions are drawn in Section 5.

Methods
In this section, we introduce and analyze all the necessary formulations for clustering and cluster validation indices.

General Prototype-Based Clustering and Its Convergence
As described above, prototype-based partitional clustering algorithms comprise two main phases.First, they start with an initial partition of the data, and second, the quality of this partition is improved by a local search algorithm during the search phase.The initial partition can be obtained based on many different principles [4,8], but a common strategy is to use distinct prototypes [9].Most typically, the globalization of the whole algorithm is based on random initialization with several regenerations [10].Then, the best solution with the smallest clustering error is chosen as the final result.The iterative relocation algorithm skeleton for prototype-based partitional clustering is presented in Algorithm 1 [12,16].As stated in [17] (see also [19]), the different location estimates for a cluster prototype arise from different l p -norms to the q-th power as the distance measure and the corresponding clustering error function; mean refers to • 2 2 (p = q = 2), median is characterized by • 1 1 (p = q = 1), and the spatial median is given by • 1 2 (p = 2, q = 1).Hence, generally the repetition of Steps 1 and 2 from the search phase of Algorithm 1 locally minimize the following clustering error criterion: Here {b k }, k = 1, . . ., K denote the prototype vectors to be determined and {x i } N i=1 , x i ∈ R n refers to the given set of n-dimensional observations.The interpretation of (1) is that each observation x i is assigned to cluster k with the closest prototype on the l p -norm: Hence, as noted in [40], another more compact way of formalizing the clustering error criterion reads as which more clearly shows the nonsmoothness of the clustering problem, because the min-operator is not classically differentiable (see [17] and references therein).This observation gives rise to a different set of clustering algorithms that are based on nonsmooth optimization solvers [41].However, despite the nonsmoothness of the error function, it can be shown that the search phase of Algorithm 1 decreases the clustering error, ensuring local convergence of the algorithm in finite many steps.We formalize this in the next proposition.The proof here is a slight modification and simplification of the more general treatment in [19], Theorem 5.3.1, along the lines of the convergence analyses in different problem domains, as given in [42][43][44].

Proposition 1. The repeated Steps 1 and 2 of Algorithm 1 decrease the clustering error function (2).
This guarantees convergence of the algorithm in finite many steps.
Proof.Let us denote by superscript t the current iterates of the prototypes {b t k } with the initial candidates for t = 0.If assignments to clusters and to the closest cluster prototypes do not change, we are done, so let us assume that the repeated step 1 in Algorithm 1 has identified at least one 1 ≤ j ≤ N such that, for x j ∈ C t k , there exists a better prototype candidate: Then, a direct computation, using monotonicity of the function • q for q = {1, 2} and reflecting the change in the assignments, gives Here, the last inequality follows from the repeated Step 2 of Algorithm 1 and from the optimization-based definitions of mean/median/spatial median as minimizers of the l q p -norm [17] over a dataset: Because ( 4) and ( 5) are valid for any reallocated index j satisfying (3), we conclude that the clustering error strictly decreases when a reallocation for a set of observations occurs in Algorithm 1.
To this end, because there exists only a finite number of possible sets C t k , we must have after a finite number of steps t → t + 1.This ends the proof.
The K-means++ initialization method utilizes squared Euclidean distance-based probabilities to sample initial prototypes from the data points.In Algorithm 2, this initialization strategy is generalized for varying l p -norms to the q-th power.Note that Algorithm 2 is the same as the K-means++ initialization algorithm when p = q = 2.In order to be successful, one needs to assume in Algorithm 2 that the dataset {x i } N i=1 has at least K distinct data points, which is a natural and reasonable assumption.In Step 2, the probability for each point x i to be selected as the next initial prototype is proportional to the distance to the closest already selected prototype divided by the value of the clustering error Function (2) for the already selected prototypes.Clearly, in each iteration, the most distant points (with respect to the previously selected initial prototypes) have the highest probability of being selected.
i=1 and the number of clusters K.

Cluster Validation Indices
From now on, for a given number of clusters K, we denote, by {c k } and {C k }, k = 1, . . ., K, the best prototypes and divisions obtained after a fixed number of repeated applications of Algorithm 1.When K = 1, we denote the prototype, i.e., the mean, median, or spatial median, of the whole data with m.Moreover, we let J K denote the corresponding clustering error over the whole data and p to refer to the corresponding final within-cluster errors.Cluster validation considers the quality of the result of a clustering algorithm, attempting to find the partition that best fits the nature of the data.The number of clusters, given as a parameter for many clustering algorithms (such as the ones presented in Section 2.1), should be decided based on the natural structure of the data.Like the best clustering solution, the number of clusters is also not always clear and many 'right' answers can exist (see, e.g., [19], Figure 5).The number can also depend on the resolution, i.e., whether the within-and between-cluster separabilities are considered globally or locally.Here, we focus on the CVIs based on the internal criteria.
The validation indices measure how well the general goal of clustering-high similarity within clusters and high separability between clusters-is achieved.These are considered with measures of within-cluster (Intra) and between-cluster (Inter) separability, for which lower and higher values are better, respectively.Normally, a division between Intra and Inter is made and the optimal value is at the minimum or maximum, based on the order of the division.
In Table 1, the best internal validation indices (as determined for the K-means-type of clustering in [30], augmented with [22]) are introduced in a general fashion for the l q p -norm setting.All except one of the indices have been modified in such a way that the optimal number of clusters can be found at the minimal value.Only the Wemmert-Gançarski index, which has a unique pattern of the general formula, is given in the original form, where the maximum value indicates the optimal number of clusters.In Table 1, if the formula that combines Intra and Inter depends on the generated clusters, then this is indicated with the corresponding parameters.

Name
Notation Intra Inter Formula Pakhira, Bandyopadhyay, and Maulik [45] PBM As can be seen from the formulas of the indices, there are a lot of similarities in how different indices measure the within-and between-cluster separability.For example, the clustering error straightforwardly measures the similarity within clusters and, therefore, almost all of the indices include it in their measure of Intra.In case of between-cluster separability, it is common to measure, for instance, the distance between cluster prototypes or between cluster prototypes and the whole data prototype.The rationale behind the index structures and their more detailed descriptions can be found in the original articles.

On Computational Complexity
The computational complexity of the prototype-based clustering is O(RTKNn), where T is the number of iterations needed for convergence and R is the chosen number of repetitions of joint Algorithms 1 and 2. As the clustering itself is quite time-consuming, especially with large datasets, it is only natural to also consider the complexity of the indices to avoid excessively complex computations.Here, the KCE index has an advantage in not requiring any extra calculation after the clustering solution has been obtained.For the indices that go through the prototypes once, here the WB and CH that measure the prototype distances to the whole data prototype, the complexity is O(Kn).Indices that measure the distances between all the prototypes, such as DB, PBM, and RT, have complexity O(K 2 n).
In our tests, WG is the index with the highest complexity, O(KNn), going through the whole data, comparing points and the prototypes.A commonly used and generally well performing index, Silhouette (see, e.g., [30,33]), goes through the whole data twice when calculating its values and therefore its complexity is O(N 2 n).With large datasets of at least hundreds of thousands of observations, this might be even more complex than the clustering task itself, with the chosen values of R and K and the observed value of T (see Figure A1) in Section 4. If computationally more involved indices would be used, also application of more complex clustering algorithms should be considered.Therefore, Silhouette was excluded from our tests.

About Earlier Validation Index Comparisons
There has been a lot of research on cluster validation, including comparisons of different validation indices and clustering algorithms.Often when a new index is proposed, the work also includes a set of comparisons that conclude that the new index is the best one.In [34], eight common CVIs were compared and with 5% additional noise, different densities, and skewed distributions, most indices were able to find the correct number of clusters.However, only three of them were able to recognize close subclusters.In their tests, S_Dbw was the only CVI that suggested the correct number of clusters for all datasets.In our previous tests in [30], the S_Dbw also recognized the close subclusters in Sim5D2, but it did not perform that well in general.
Often no single CVI has a clear advantage in every context, but each is best suited to a certain kind of data.This was also the conclusion in [33], where 30 different indices with 720 synthetic and 20 real datasets were compared.However, a group of about 10 indices were found to be the most recommendable, including Silhouette, Davies-Bouldin * and Calinski-Harabasz at the top.Also, in the earlier extensive comparison [47], where 30 indices were compared, the authors suggested that if different datasets were used for testing, the order of the indices would change but the best ones-including Calinski-Harabasz, Duda-Hart, and the C-index-would still perform well.

Experimental Setup
Next, we test the indices in Table 1 with different distance measures, i.e., with different prototype-based clustering algorithms.The index values are calculated with the distance corresponding to the clustering method used, i.e., city-block with the K-medians, squared Euclidean with the K-means, and Euclidean with the K-spatialmedians.All datasets were scaled to the range of [−1, 1].All the tests were run on MATLAB (R2014a), where a reference implementation on both the validation indices and the general clustering Algorithm 1 with the initialization given in Algorithm 2 were prepared.The impact of the necessary amount of repetitions of Algorithms 1 and 2 was tested with multiple datasets (S -sets, Dim-sets, A1, Unbalance), comparing the clustering error and the cluster assignments.With 100 repetitions, the minimum clustering error and the corresponding cluster assignments were stabilized and an appropriate clustering result was found.This result with the minimum clustering error was used for computing the CVI value.
To test the indices, we used the basic benchmark datasets described in detail in [48], with the two other synthetic datasets http://users.jyu.fi/~jookriha/CVI/Data/ as given in [30] (see Figure 1).These benchmark sets are synthetic datasets, suggested for use when testing any algorithm dealing with clustering spherical or Gaussian data.Here, we restrict ourselves to the benchmarks with at most 20 clusters, because the interpretation and knowledge discovery from the clustering results with a large number of prototypes might become tedious [6,49].Therefore, the number of clusters was also tested with K = 2 − 25.

Sim5D2
PCA of Sim5D10 In addition, we use six real datasets that include Steel Plates, Ionosphere, and Satimage (Train) from the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets.htmlIris and Arrhythmia from MATLAB's sample datasets, https://se.mathworks.com/help/stats/_bq9uxn4.html, and the USPS dataset.https://www.otexts.org/1577A summary of all these datasets can be seen in Table 2.For these real datasets, even if there are class labels provided, we do not compare the clustering results with the class information because of the completely unsupervised scenario with the internal validation indices.Moreover, the classes need not correspond to the clusters determined by the prototype-based clustering algorithm, since the probability density functions of the classes are not necessarily spherically symmetric.For these datasets, we therefore only study and compare the stability of the suggestions on the numbers of clusters for different indices.Finally, our datasets include the following: A1, with 20 spherical clusters and some overlap; the S -sets, including four datasets with 15 Gaussian clusters, and cluster overlap increasing gradually from S1 to S4 ; Dim-sets, including six datasets with 16 well-separated clusters in high-dimensional space and dimensions varying from 32 to 1024; a subset of Birch2, including 19 datasets with 2-20 clusters with their centroids on a sine curve; Unbalance, with eight clusters in two separate groups, the one having three dense and small clusters and the other five more sparse and bigger clusters; and a subset of G2, with 20 datasets-the lowest and highest overlap in ten different dimensions from 1 to 1024.Finally, together with the six real datasets, we had altogether 62 datasets in our tests.

Results
In this section, we provide the results of the clustering validation index tests with K-medians, K-means, and K-spatialmedians clustering.Results for the synthetic datasets are combined in Table A1, where each cell includes the results for all three methods with 'cb' referring to the city-block distance (p = q = 1), 'se' to the squared Euclidean distance (p = q = 2), and 'ec' to the Euclidean distance (p = 2, q = 1).In addition, the convergence properties of the clustering algorithms are compared for varying K values.

CVIs for Synthetic Datasets
For the Dim-sets, results were equal (and correct) in all dimensions from 32 to 1024 and for the G2 -sets results were equal (and correct) from dimension eight upwards.Therefore, these have been excluded from Table A1.Correct suggestions for the number of clusters are marked in bold.
The most challenging synthetic datasets seem to be Unbalance, Sim5D2, and Sim5D10.Only a few indices were able to recognize the correct number and no single index managed to solve both the Unbalance and the Sim sets.
As concluded in the previous studies (see Section 2.4), different indices seem to work better with different kinds of datasets.However, there are also a lot of differences in the general performances of the tested CVIs.The overall success rates for the CVIs, i.e., for how many of the datasets (%) it gave correct suggestions, can be seen in Table 3, listed separately for each distance measure.In conclusion, the WG index outperforms all the other indices in all three distance measures and clustering approaches.WB and KCE also perform very well in general.For some indices, the performances vary between different distances; for example, CH works very well with the squared Euclidean distance, while PBM clearly works better with city-block and Euclidean distances.As a whole, the recommendation for the use of indices is as follows: for the original K-means, WG, KCE, and CH are the three best indices, and for the robust variants with K-medians and K-spatialmedians, WG, PBM, and WB have the highest success rate.

CVIs for Real Datasets
As mentioned, here we only observe and compare the stability of the clustering results.The results are combined in Table 4.With real datasets, a typical behavior of the internal validation indices is the suggestion of only a small number of clusters.This is especially true for KCE and CH, even if we know that there are observations from multiple classes, with the class boundaries having unknown forms and shapes.Different from the other indices, the results of DB seem to deviate a lot, with high variability over the datasets.The same can happen for RT, with squared Euclidean distance.
) 2, 3, 2 3, 6, 2 2, 3, 2 Based on the observed behavior, the most stable, and therefore the recommended indices, seem to be WB, PBM, and WG.However, when the data is of higher dimension without a Gaussian structure, the curse-of-dimensionality [50] might be the reason for the basic suggestions of a low number of clusters.Therefore, to obtain more fine-tuned clustering results and validation index evaluations, it might be necessary to use the prototype-based clustering in a hierarchical manner, as suggested in [16,51].For example, many indices suggested three clusters for the Sim5 datasets, but after a further partitioning of the data, new index values could be calculated for those three clusters separately, and the correct division into five clusters altogether could be revealed.

Convergence
For each repetition, the number of iterations needed for convergence, T, was saved.Median values of T for synthetic datasets were: 19 for K-medians, 19 for K-means, and 21 for K-spatialmedians.K-spatialmedians requires slightly more iterations than K-means and K-medians.In practice, the effect of the total running time between T = 19 and T = 21 is negligible.For real datasets, median values of T are again similar: 15 for K-medians, 17 for K-means, and 16 for K-spatialmedians.K-means performs slightly worse than the robust K-medians and K-spatialmedians for real datasets.It is known that K-means is sensitive to noise, which could explain these results.Overall, based on the median values of T, there seems to be no practical difference between the convergence of K-medians, K-means, and K-spatialmedians.
We plotted and analyzed the median of T as a function of K for each dataset separately in order to compare the convergence characteristics of K-means, K-medians, and K-spatialmedians.The most relevant plots are shown in Figure A1.Even though the median values of T are close to each other in general, there are some interesting differences with multiple datasets.
From Figure A1, we can observe that the robust location estimates, median and spatial median, clearly converge faster than the mean for the S4 dataset, which has highly overlapping clusters.For the USPS dataset, K-medians seems to converge slightly faster than K-means.In the figure, plots for datasets b2-sub-5, b2-sub-10, b2-sub-15, and b2-sub-20 show that K-means converges faster than K-medians and K-spatialmedians when K is smaller than the number of clusters in the dataset.This might be because the mean location estimate is more sensitive to movement towards the center of multiple combined clusters than the robust location estimates since outlying points within a cluster affect it more than the robust location estimates.The plots of datasets G2-2-10 and G2-64-10 demonstrate how the curse-of-dimensionality greatly affects K-medians when compared to K-means and K-spatialmedians.K-medians converge faster than K-means and K-spatialmedians in the 2-dimensional case; however, from the 64-dimensional case onward the reverse is observed.

Conclusions
Tests for a representative set of previously qualified internal clustering validation indices with many datasets, for the most common prototype-based clustering framework with multiple statistical estimates of cluster location, were reported here.This study further confirmed the conclusions in the previous index comparisons that no single CVI dominates in each context, and some indices are better suited to different kinds of data.Therefore, it is recommended to utilize multiple indices when doing cluster analysis.However, in our tests with the synthetic datasets, the WG index outperformed other indices in all distance measures used, also showing stable behavior with real datasets.It found the correct number of clusters for all datasets, except the Sim5 with different sized clusters and overlap.Due to this high performance, the WG index would be worth studying more in the future.In addition, PBM was an index with a high and stable performance, along with WB for robust clustering and KCE and CH for K-means.The correct number of clusters in the Sim5 datas was only found with WB and KCE for K-means and PBM with robust estimates.Based on these tests, DB and RT are not recommended.In general, the indices seem to perform worse with higher cluster overlap and some of them (CH, DB, RT) fail more often when the number of clusters grows.
Here, we also extended previous index comparisons using K-means clustering, with robust K-medians and K-spatialmedians clusterings.Just like the indices, the different distance measures also seem to work differently with different datasets and, in addition, some indices seem to work better with different distances.For instance, for the synthetic datasets, the PBM index clearly works better with city-block and Euclidean distances.Therefore, the usage of robust K-medians and K-spatialmedians could bring added value when trying to find the optimal number of clusters present in the data.
In addition, the convergence properties of the clustering algorithms were compared.Based on the experiments, the number of iterations needed for convergence varies for different datasets between the methods, e.g., K-means requires more iterations than the robust clustering methods for noisy datasets, while K-medians is most affected by an increase in dimensionality.Moreover, K-means++-type initialization strategy seems to work well with city-block and Euclidean distance.
Moreover, as many indices often suggest quite low numbers of clusters, due to clusters being seemingly close to each other, when observed globally in a high-dimensional space, the hierarchical application of a partitional prototype-based clustering algorithm is recommended, in order to improve recognition of possible clusters at different resolution levels.

Algorithm 1 : 1 .
Prototype-based partitional clustering algorithm.Input: Dataset and the number of clusters K. Output: Partition of dataset into K disjoint groups.Select K points as the initial prototypes; repeat Assign individual observation to the closest prototype; 2. Recompute the prototype with the assigned observations; until the partition does not change;

Figure A1 .
Figure A1.Median of the number of iterations needed for convergence with varying K.

Table 2 .
Description of datasets.

Table 3 .
Right suggestions for the 56 synthetic datasets (number of right suggestions/number of datasets).

Table 4 .
Internal CVIs results for real datasets.