Understanding and Enhancement of Internal Clustering Validation Indexes for Categorical Data

Clustering is one of the main tasks of machine learning. Internal clustering validation indexes (CVIs) are used to measure the quality of several clustered partitions to determine the local optimal clustering results in an unsupervised manner, and can act as the objective function of clustering algorithms. In this paper, we first studied several well-known internal CVIs for categorical data clustering, and proved the ineffectiveness of evaluating the partitions of different numbers of clusters without any inter-cluster separation measures or assumptions; the accurateness of separation, along with its coordination with the intra-cluster compactness measures, can notably affect performance. Then, aiming to enhance the internal clustering validation measurement, we proposed a new internal CVI—clustering utility based on the averaged information gain of isolating each cluster (CUBAGE)—which measures both the compactness and the separation of the partition. The experimental results supported our findings with regard to the existing internal CVIs, and showed that the proposed CUBAGE outperforms other internal CVIs with or without a pre-known number of clusters.


Introduction
Clustering analysis is the unsupervised process of partitioning a group of data objects into clusters, with the objective to grouping objects of high similarity into the same cluster, while separating dissimilar objects into different clusters.Clustering is a main task of data analysis, and it has been studied extensively in the fields of data mining and machine learning [1,2].Clustering techniques can be roughly distinguished as hard and soft clustering, this study is limited to the hard clustering analysis, in which each object belongs to one and only one cluster.
The results of clustering-partitions-vary with parameter settings, clustering methods, and the criteria of similarity (or dissimilarity) [3].The mechanisms of the clustering method, such as random initialization, can cause inconsistencies in clustering results as well.How do we determine the final result from multiple possible partitions?As shown in Figure 1, the standard solution is to conduct several clustering processes with different schemes respectively, then select the partition of the highest quality [1,4,5].The key is to define and measure the 'quality of partitions' by clustering validation indexes (CVIs) in the third step in Figure 1.
By whether to use external information, we can summarize the CVIs into two categories, i.e., external and internal CVIs.External CVIs use external information to evaluate the quality of the clustering results.For instance, if the prior knowledge, such as the true partition (or the partition designated by experts) exists, external CVIs can be used to evaluate the conformity of the clustered partition and the prior partition [6][7][8].Such prior knowledge is absent in the unsupervised scenario, which makes the external CVIs inapplicable.
On the other hand, Internal CVIs require no such prior knowledge, and have extensive practical applications in information retrieval, text and image analysis, biological engineering, and other domains of data mining [9][10][11][12][13][14][15][16].The quality of clustering results is usually inspected internally from two aspects-the intra-cluster compactness, and the inter-cluster separation (also known as isolation) [5,[17][18][19][20][21][22][23].The compactness reflects the degree of similarity of the objects in the same cluster, while the separation reflects how the objects in one cluster are dissimilar to others.

Feature selection Normalization
Step Meanwhile, the internal CVIs for numerical data, such as the Dunn index [24], the I index [25], the Silhouette index [26], and the Calinski-Harabasz index [27], use intuitive geometric information to evaluate the partitions, which makes them unsuitable for categorical data clustering.Considering the increasing amount of categorical data in practical applications and the challenging issues that have not been adequately addressed in the literature, further research on internal CVIs for categorical data is in need [9,14,15,28,29].
Therefore, in this paper, we limit our scope to provide insight and enhancement of the internal CVIs for categorical data: 1. Do internal CVIs for categorical data show monotonicity with respect to the number of clusters?
One should avoid the monotonicity in validation measurement to prevent the bias towards partitions with more clusters, which would leave the performance of the evaluation to the boundary of the number of clusters in the candidate partitions.2. Do internal CVIs for categorical data which use no separation measures really ignore the separation?A partition of good compactness is not necessarily a good partition, since the objects in one cluster may be similar to the objects in other clusters as well.Some internal CVIs for categorical data use no separation measures based on the attribute distribution between clusters, and have been proven to be effective when the number of clusters is constant in reference [9].However, if the impact of separation on the clustering validation is ignored, the compactness measure alone may not be effective in evaluating partitions of different sizes due to the first issue.On the other hand, Internal CVIs require no such prior knowledge, and have extensive practical applications in information retrieval, text and image analysis, biological engineering, and other domains of data mining [9][10][11][12][13][14][15][16].The quality of clustering results is usually inspected internally from two aspects-the intra-cluster compactness, and the inter-cluster separation (also known as isolation) [5,[17][18][19][20][21][22][23].The compactness reflects the degree of similarity of the objects in the same cluster, while the separation reflects how the objects in one cluster are dissimilar to others.
Meanwhile, the internal CVIs for numerical data, such as the Dunn index [24], the I index [25], the Silhouette index [26], and the Calinski-Harabasz index [27], use intuitive geometric information to evaluate the partitions, which makes them unsuitable for categorical data clustering.Considering the increasing amount of categorical data in practical applications and the challenging issues that have not been adequately addressed in the literature, further research on internal CVIs for categorical data is in need [9,14,15,28,29].
Therefore, in this paper, we limit our scope to provide insight and enhancement of the internal CVIs for categorical data: 1.
Do internal CVIs for categorical data show monotonicity with respect to the number of clusters?One should avoid the monotonicity in validation measurement to prevent the bias towards partitions with more clusters, which would leave the performance of the evaluation to the boundary of the number of clusters in the candidate partitions.

2.
Do internal CVIs for categorical data which use no separation measures really ignore the separation?A partition of good compactness is not necessarily a good partition, since the objects in one cluster may be similar to the objects in other clusters as well.Some internal CVIs for categorical data use no separation measures based on the attribute distribution between clusters, and have been proven to be effective when the number of clusters is constant in reference [9].
However, if the impact of separation on the clustering validation is ignored, the compactness measure alone may not be effective in evaluating partitions of different sizes due to the first issue.

3.
What can we offer to enhance performance?After research on the above issues, we wish to offer an alternative internal CVI that has improved performance on categorical data clustering validation measurement.
To better understand the internal CVIs for categorical data, we investigate five well-known internal CVIs for categorical data clustering validation evaluation, i.e., the information entropy function (E) [30], the k-modes objective function (F) [31], the category utility function (CU) [32], the objective function of clustering with slope (Clope r ) [33], and the objective function of categorical data clustering with subjective factors (R) [34].We attempt to reveal the nature of the five internal CVIs by investigating the compactness measures and the separation measures, and discuss whether assumptions of separation exist and can be substituted for the separation measures.Meanwhile, we theoretically analyze whether the compactness measures for categorical data show monotonicity in certain circumstances, and the role of separation measures (assumptions) in neutralizing the monotonicity.
Then, to enhance the internal measurement, we propose a new internal CVI that has improved performance, namely, the clustering utility based on the averaged information gain of isolating each cluster (CUBAGE).CUBAGE uses the proposed averaged information gain of isolating each cluster (AGE) to measure the separation, and the reciprocal entropy of the dataset conditioned on the partition to measure the compactness.
The paper is organized into six sections.Section 2 is the related work.Section 3 provides in-depth analysis and discussions of five CVIs on the first two issues.In Section 4, we present the proposed internal CVI.Section 5 presents our experimental results and detailed discussion.Finally, we state our conclusions in Section 6.

Related Work
In this section, we first clarify our notations throughout this paper, then introduce some widely-used internal CVIs for categorical data.Additionally, we provide a brief comparison of internal and external CVIs.

Notations
Unless stated otherwise, we used the following notations in this paper.U = {X 1 , . . ., X n } is a set of n objects, each object is described by the same m independent attributes A 1 , . . ., A m .The value of attribute A j (j = 1, . . ., m) can be taken only from domain D(A j ) = {a j (1) , . . ., a j (dj) }, where d j is the number of possible values of the attribute.p(a j (i) ) is the probability of attribute A j taking the value a j (i) (i = 1, . . ., d j ).C = Φ is a set of objects (or, a cluster), a partition is the domain of attribute A j in cluster C l , obviously, D(A j |C l ) ⊆ D(A j ).

The information entropy function (E).
The information entropy of a random variable indicates the information and uncertainty that the variable has [35].Considering attribute A as a random categorical variable, the entropy H(A) is defined as follows: Given a set of independent variables V = {A 1 , . . ., A m }, the entropy H(V) is: A lower H(V) indicates less uncertainty of V.
Given a partition P = {C 1 , . . ., C k }, the entropy of V conditioned on P, i.e., H(V|P), is considered as the 'whole entropy of the partition' [30]: where p(a j (i) |C l ) is the conditional probability of the value a j (i) , given cluster C l .Notice that the probability of C l is p(C l ) = |C l |/n.E(P) attempts to represent the total entropy of the partition, which can be construed as the degree of disorder, by summing the weighted entropy of each cluster.To minimize the function E(P) is to find a partition in which the values of attributes describing the objects in the same clusters are centralized, which indicates that the objects are more similar in each cluster.

The k-modes objective function (F).
Similar to the k-means clustering algorithm [36], k-modes compares each object in the same cluster with the cluster center, and sums the dissimilarities [31].Since it is improper to take the means of categorical values as the cluster center, k-modes use the modes of values of each attribute.The dissimilarity between the object and the center is defined as: where X li is the ith object in cluster C l , x lij is the value of attribute A j describing object X i , Z l is the center of cluster C l , z lj is the value of attribute A j describing center Z l , and: Therefore, the k-modes objective function is: where d cluster (C l ) is the sum of the dissimilarity between each object in cluster C l and its center: F(P) describes the overall dissimilarities between objects and centers; a lower F(P) indicates that partition P has a higher quality.

The category utility function (CU).
For the objects in the same cluster, CU measures the possibility of these objects taking the same attribute values [32]: This process attempts to maximize both the probability that two objects in the same category have attribute values in common by p(a j (i) |C l ) 2 , and the probability that objects from different categories have different attribute values by −p(a j (i) ) 2 .However, the last term −p(a j (i) ) 2 is invariable when the dataset is given.Therefore: where C is a constant.This means that the CU only measures how similar the objects in the same cluster are.The authors of references [37,38] further averaged the values of the CU(P) measure over clusters, i.e., they used CU(P)/k instead of CU(P) to compare the partitions of different size.In this paper, we refer to the modified function as CU 1/k (P).

The CLOPE objective function (Clope r )
CLOPE is an efficient clustering algorithm for large scaled datasets, and the basic idea of its criterion function is simple and straightforward [33,39].CLOPE first defined the size and the width of cluster C l : where Occ(a j (i) |C l ) is the number of occurrences of value a j (i) in cluster C l : Then, the objective is to maximize the following function: where r is the parametric power.Moreover, as we can show that Size(C l ) = m|Cl|, for the consistency of expression, we rewrite the function as: In this paper, we refer to the function using the parameter of r as Clope r (P).

The CDCS objective function (R).
The CVIs above only used the intra-cluster information to measure the partition.CDCS use both intra-cluster similarity and inter-cluster similarity, which are [34]: where Sim(C t , C s ) is the similarity score between two clusters C t and C s : and where ε is a small value preventing Sim(C t , C s ) = 0.The objective of CDCS is to maximize the ratio of intra(P) to inter(P).Partitions that have both higher intra-cluster similarity and lower inter-cluster similarity will receive better scores: We refer to this function as R(P) in this paper.

Comparison of Internal and External Clustering Validation Indexes
Internal CVIs use only internal information to identify commonalities in the data and react based on the presence or absence of such commonalities, to measure the quality of the clustering result [2,9,17,23].The internal information includes, but is not limited to, the attribute value distribution, similarities of objects or clusters, and the partition size, which are quantities and features inherited from the dataset and the clustering process.
The avoidance of requiring external information makes internal CVIs applicable to the unsupervised scenarios, and can act as the objective functions of the clustering process.For instance, internal CVIs are used as objective functions of COOLCAT clustering [30], k-modes clustering [31], CLOPE clustering [33], and CDCS clustering [34].
External CVIs use external information.In the literature, there are two types of overall expression of the external information: (1) Explicitly expressed as 'true partition', 'class labels', 'data division', and 'pre-specified/pre-known structure' [1,4,5,23,[40][41][42][43][44][45][46][47][48]; (2) used vague expression, such as 'prior knowledge' and 'ground truth' [17,49,50].In the literature adopted the first type of expression, the usage of external CVIs was representatively described as in References [5]: 'Based on the external criteria we can work in two different ways.Firstly, we can evaluate the resulting clustering structure C, by comparing it to an independent partition of the data P built according to our intuition about the clustering structure of the data set.Secondly, we can compare the proximity matrix P to the partition P.' The external CVIs used in the references that adopted the second type of expression also required the pre-known partition, although the universal requirement was not explicitly stated.Therefore, the applications of typical external CVIs are quite limited to the scenarios where the true or designated partitions can be compared with, for instance, choosing the optimal clustering algorithm on a specific dataset.
In the experimental section, we use external CVIs as the evaluation metrics to examine the performance of internal CVIs.

Understanding of Internal Clustering Validation Indexes
In this section, we theoretically analyzed the effectiveness of the clustering validity evaluation of the five internal CVIs mentioned above, i.e., the entropy function (E), the k-modes objective function (F), the CLOPE objective function (Clope r ), the averaged category utility function (CU 1/k ), and the CDCS objective function (R).We pointed out the compactness measure and the separation measures (assumptions) of them.We also analyzed the ineffectiveness of using compactness alone in clustering validation measurement.

Generalization and an Example
To better understand the composition of the CVIs, we first generalized them into Table 1.The compactness cores use intra-cluster information to measure the compactness of each cluster based on the consensus that the attribute values in a compact cluster should be concentrated.As shown in Table 1, all five CVIs can evolve the compactness measure with different cores.

CVI
Compactness Core Average Compactness (Com) Objective * Note that the number attributes m and the number of objects n are invariable to any partition.
We addressed our concerns about the effectiveness of these CIVs with an example.Table 2 is an example of a dataset and five partitions of it.We evaluated these five partitions with E, F, Clope 1~3 , CU 1/k , and R, respectively; the results are shown in Table 3, the optimal scores of each CVI are bolded.
In Table 3, the bolded values are the optimal values measured by each index.As we can see, the opinions of different CVIs did not agree, and as the number of clusters increased, the values of E, F, Clope 1 , and R, monotonically decreased.Generally, if the valuation outcome tended to change with the number of clusters monotonically, the evaluation methods may not be suited for comparing partitions of different cluster numbers, since they will bias towards choosing the partition of less or more clusters.We start our discussion on the effectiveness of E and F-the CVIs showed monotonicity, and only use compactness measures.

Analysis of Indexes E and F
The indexes E and F use compactness measures only, and average the compactness by a weight that was linear to the cluster size p(C l ).We can show that the values of index E and F are monotonic to the number of clusters when clustering hierarchically: Theorem 1.Given dataset U, described by a set of independent attributes V = {A 1 , . . ., A m }, P 1 and P 2 are two partitions of U. If P 1 = {C 1 , . . ., C k } and P 2 = {C 1 , . . ., C k-1 , C s1 , . . ., C st }, where C k = C s1 ∪ C s2 ∪ . . .∪ C st , C sl = Φ, and C sl ∩ C sl' = Φ (l = l'; l, l' = 1, . . ., t), then E(P 1 ) ≥ E(P 2 ), and the equality holds if and only if p(V|C s1 ) = p(V|C s2 ) = . . .= p(V|C st ).
Proof of Theorem 1.According to Equations ( 2) and (3), we know that: Then: Therefore, the necessary and sufficient condition of E(P 1 ) ≥ E(P 2 ) is: Since Meanwhile, due to Bayes' theorem, we can establish that: Since the attributes are independent, and we know that H(X) is a concave function, we can prove Inequality (20) to be true by Jensen's inequality [51].
Theorem 2. If the conditions are the same as in Theorem 1, then F(P 1 ) ≥ F(P 2 ), and the equality holds if and only if z 1j = z 2j = . . .= z tj (j = 1, . . ., m), where z lj is the mode of the values of attribute A j in cluster C sl (l = 1, . . ., t).
Proof of Theorem 2. Similar to the proof of Theorem 1, we can show that: Since . ., t), to any attribute A j , the occurrence of value a j (i) in cluster C k is equal to the sum of the occurrence of the same value in cluster C s1 to C st : Meanwhile, we can rewrite the dissimilarity of any cluster C q of dataset U as: Applying Equation (25) to Equation (23) yields: By the definition of mode, we know that: Then: Therefore: By Equation (24), we can show that the right-hand part of Inequality (29) equals 0. Therefore, F(P 1 ) ≥ F(P 2 ), and the equality holds if and only if: Theorems 1 and 2 show that one cannot determine whether any clusters in the partition should be merged by E or F, since the evaluation results always suggest to divide.Even if the attribute distribution in the candidate clusters are equivalent (in which case they are generally regarded as the most similar clusters), the scores of whether to merge them would be in a tie.This most affects the hierarchical clustering, in which objects are clustered in either agglomerative or divisive manners, and the suggested layer of hierarchy by these CVIs would always be the layer with the most clusters, unless the layer with the second-most clusters has the same evaluation score.
We should point out that, in some researches, the separation coefficient 1/k of index CU 1/k (which will be discussed shortly) was multiplied to the function E(P) directly, which would aggravate the monotonicity, since 1/k is also a monotonically decreasing function with respect to the partition size.Therefore, we will not discuss such a method in this paper.

Analysis of Indexes CU 1/k , Clope r , and R
Besides the compactness measure, indexes CU 1/k , Clope r , and R use separation measures or assumptions as well.CU 1/k (P) is the averaged compactness over k clusters weighted by p(C l ), and CU 1/k (P) is the further averaged CU(P).The role of the multiplicand 1/k is a crude overfitting control, and can be regarded as the assumed separation coefficient with respect to the number of clusters.We can show that the compactness measure in CU 1/k (P) also shows monotonicity in the previous scenario: Theorem 3. If the conditions are the same as in Theorem 1, then CU(P 1 ) ≤ CU(P 2 ), and the equality holds if and only if p(V|C s1 ) = p(V|C s2 ) = . . .= p(V|C st ).
Proof of Theorem 3. Similar to the proof of Theorem 1, we can show that: C k = Φ, so we can rewrite Equation (31) as: Therefore, the necessary and sufficient condition of CU(P 1 ) ≤ CU(P 2 ) is: We know that y = x 2 is a convex, and not a linear function.Therefore, Inequality ( 33) is true, due to Jensen's inequality with Equations ( 21) and ( 22) in the proof of Theorem 1, and the equality holds if and only if p(V|C s1 ) = p(V|C s2 ) = . . .= p(V|C st ).
Therefore, using the category utility function to evaluate partitions of different size without the separation coefficient 1/k is questionable.However, the act of multiplying the compactness with such a coefficient is based on the assumption that the separation of the partition is negatively correlated with the partition size.Such an assumption ignores the attribute value distribution between clusters, and may generate a new bias to the partitions of fewer clusters, since 1/k would be dominant to the result when the change in compactness is relatively gradual.
The compactness core of Clope r (P) is also monotonic.However, different to other indexes, Clope r (P) uses the quadratic weight p(C l ) 2 instead of p(C l ) to average the compactness.In consequence, the partitions in which the objects are more concentrated in fewer clusters would score better with Clope r (P), which is also an assumption of separation for adjusting the evaluation results.Like the separation coefficient 1/k of CU 1/k (P), such an assumption is irrelevant to the differences in attribute values between clusters.If we remove the assumption, i.e., we use p(C l ) instead of p(C l ) 2 , the averaged compactness would also be monotonic to the partition size in the previous scenario: Theorem 4. If the conditions are the same as in Theorem 1, and we have the compactness measure as followed: where r is a parameter greater than zero, then G(P 1 ) ≤ G(P 2 ), and the equality holds if and only if D(A j |C k ) = D(A j |C s1 ) = D(A j |C s2 ) =. . .= D(A j |C st ), (j = 1, . . ., m).
Proof of Theorem 4. Similar to the proof of Theorem 1, we can show that: Since Therefore, the value of Equation ( 35) is not greater than zero, the equality holds if and only if To look more deeply, the essential effect of parameter r in Clope r (P) is the subjectively adjusted trade-off between compactness and separation (assumption).The compactness would be less important in the evaluation as the value of r decreases, and it would be entirely ignored when r = 0 (although it is avoided).As a result, in Table 3, Clope 2 (P) and Clope 3 (P) choose the partition with more clusters, due to the effect of compactness, and Clope 1 (P) chooses the partition of least clusters, due to the effect of the separation assumption.Therefore, setting r is actually setting the preference for compactness or separation, and could be unfounded in the unsupervised scenario.
The compactness measure of index R is also monotonic to the partition size, since maximizing the compactness of R(P) can be easily proven to be equivalent to minimizing the function F(P).The separation measure of R(P) evaluates the inter-cluster similarity Sim(C t , C s ) pairwise.The compactness and the separation of R(P) only considers the most and least common values, respectively, which might lower the sensitivity to the value distribution.We will test and discuss the effectiveness of such methods in the experimental section.

Internal Clustering Validation Index: CUBAGE
As discussed previously, the CVIs cannot effectively evaluate the partitions of different sizes if the separation measures or assumptions are absent, and the crude separation assumptions without respect to the attribute values are rather questionable.
In this section, we first proposed a new method to measure the inter-cluster separation.Then, we present our algorithm to internally measure the clustering validation, namely, the clustering utility based on the averaged information gain of isolating each cluster (CUBAGE).

Inter-Cluster Separation Measure: AGE
Our measure of inter-cluster separation-averaged information gain of isolating each cluster, henceforth AGE-was based on the idea of information gain in information theory.Before presenting AGE, we will review the concept of information gain.
The mutual information is a measure of the shared information of two discrete variables X and Y [52]: ).
In machine learning, the mutual information I(X; Y) is the expected information gain; that is, the reduction in the entropy of X that is achieved by learning the state of Y [53].In general terms, the expected information gain is the change in entropy from a prior state (X) to a state that takes some information (X|Y): Given a partition P = {C l } (l = 1, . . ., k) of dataset U, which is described by V = {A 1 , . . ., A m }, we define the information gain of separating cluster C l from other clusters as: where P l = {C l , U − C l } is the partition that separates C l from other clusters, and U − C l is the complementary set of C l .GE(C l ) is the information gain of V from the unpartitioned state to the state where the objects are divided into C l and U − C l ; in other words, the degree of certainty that we can gain by separating the objects in C l from other objects.Therefore, GE(C l ) equals the dissimilarity between C l and other clusters (U − C l ); a higher value of GE(C l ) indicates that more separation is achieved by separating C l with other clusters.
As Figure 2 illustrates, we average the value of GE over all the scenarios of isolating each cluster to measure the overall separation: Explicitly, AGE(P) is calculated as: where V C l is the set of attributes describing the objects in C l , V U−C l is the set of attributes describing other objects, and:

Upper and Lower Bounds of AGE
By the property of information gain, the lower bound of AGE(P) is zero: We will discuss the upper bound respectively: 1.When the number of clusters k ≤ 2: This indicates that for a given dataset, maximizing AGE(P) is equivalent to minimizing E(P) when the size of the partition is no greater than 2. This is because that the whole partition is affirmatory when one of the clusters is learned.Moreover, when the objects are not separated at all, i.e., k = 1, the value of AGE(P) is 0; 2. When the number of clusters k > 2, due to Theorem 1, we can establish that: The equality of Inequality (45) holds if and only if the attribute value distributions in each cluster are all the same, under which circumstances the value of E(P) would be equal to H(V); therefore, AGE(P) = 0.
To sum up, the upper bound of AGE(P) is H(V) − E(P) and the lower bound is 0. The value of the upper bound becomes equal with the lower bound 0 when the objects are not separated at all, or the attribute value distributions in each cluster are all the same; this means that the AGE(P) yields the minimum possible value when the objects are least separated.

Upper and Lower Bounds of AGE
By the property of information gain, the lower bound of AGE(P) is zero: We will discuss the upper bound respectively: 1.When the number of clusters k ≤ 2: This indicates that for a given dataset, maximizing AGE(P) is equivalent to minimizing E(P) when the size of the partition is no greater than 2. This is because that the whole partition is affirmatory when one of the clusters is learned.Moreover, when the objects are not separated at all, i.e., k = 1, the value of AGE(P) is 0; 2. When the number of clusters k > 2, due to Theorem 1, we can establish that: The equality of Inequality ( 45) holds if and only if the attribute value distributions in each cluster are all the same, under which circumstances the value of E(P) would be equal to H(V); therefore, AGE(P) = 0.
To sum up, the upper bound of AGE(P) is H(V) − E(P) and the lower bound is 0. The value of the upper bound becomes equal with the lower bound 0 when the objects are not separated at all, or the attribute value distributions in each cluster are all the same; this means that the AGE(P) yields the minimum possible value when the objects are least separated.

CUBAGE Index
Our internal clustering validation index, CUBAGE, uses AGE(P) as the inter-cluster separation measure, and uses the reciprocal of the conditional entropy, E(P) −1 , as the intra-cluster compactness: This index takes a form of the product of the separation and the compactness.A higher value of CUBAGE(P) indicates a better clustering result.As shown in Table 4, neither AGE(P) nor CUBAGE(P) showed monotonicity with respect to the partition size.Additionally, when the size of the partition is no greater than 2, we can establish the following by Equations ( 44) and (46): Equation ( 47) is actually the information gain ratio of the partition.This means that for a given dataset, maximizing CUBAGE(P) is equivalent to minimizing E(P) when the number of clusters is no greater than 2, since the term H(V) is a constant to the dataset.
Given a partition P, the value of CUBAGE(P) can be calculated, as shown in Figure 3. Algorithms 1 and 2 are the pseudocode of CUBAGE.

Algorithm 1 Clustering Utility based on Entropy (CUBAGE)
Input: dataset with n objects: U = (X i ); label of a partition with k clusters; Output: CUBAGE value of the partition; Called Function: entropy calculation function: Entropy(objects); Begin: 1.
Calculate the entropy of the whole dataset, save as HU: HU = Entropy(U); 2.
For each cluster C l : 3.
Calculate the entropy of objects in C l , save in vector H:H(l) = Entropy(C l ); 4.
Calculate the entropy of objects in U − C l , save in vector HC:HC(l) = Entropy(U − C l ); 5.
Generate weight vector: Calculate the dot product E = W•H; 8. Calculate The time complexity of CUBAGE(P) is O(kmn), where k is the number of clusters, m is the number of attributes, and n is the total number of objects.Note that there is no extra time cost for computing the compactness E −1 (P), since the weighted entropy of each cluster is already calculated during computing the separation AGE(P).The time cost could be lower if the data is sparse.Furthermore, one can easily apply parallel or distributed computing to CUBAGE(P) by the objects or the attributes to reduce the computing time.Therefore, such time complexity makes CUBAGE(P) scalable to large datasets.

Algorithm 2 Entropy calculation function
Input: a set of x objects; Output: Entropy of objects in a single set; Begin: 1.
For each attribute A j : 2.
Calculate the entropy of the attribute by Equation (1), save in vector HA; 3.

Isolated all clusters
Calculate AGE(P)

Experiments and Discussion
In this section, we present the results of the comparative experiments to evaluate the effectiveness of CUBAGE, along with the five internal CVIs mentioned above.We used −F(P) and −E(P) instead of F(P) and E(P) to unify the objectives, and maximized each function to search for the local optimal partition.

Experimental Methods
We tested the CVIs on eight datasets from the UCI (University of California, Irvine) Machine Learning Repository (http://archive.ics.uci.edu/mL/index.php), as shown in Table 5; records with missing values are removed.To compare the quality of the partitions chosen by the internal CVIs, we used two external CVIs as the benchmark evaluation criteria, respectively: 1.The adjusted Rand index (ARI)-the corrected-for-chance version of the Rand index-is based on the numbers of objects in common (or not) between the pre-defined classes and the produced clusters [6].Given two partitions P = {C1, …, Ck} and P' = {C'1, …, C'k'} ARI is defined as: where nij is the number of common objects in Ci and In a specific dataset, a partition that is more similar to the pre-defined classes would score higher values in ARI.Note that ARI may take negative values.

Experiments and Discussion
In this section, we present the results of the comparative experiments to evaluate the effectiveness of CUBAGE, along with the five internal CVIs mentioned above.We used −F(P) and −E(P) instead of F(P) and E(P) to unify the objectives, and maximized each function to search for the local optimal partition.

Experimental Methods
We tested the CVIs on eight datasets from the UCI (University of California, Irvine) Machine Learning Repository (http://archive.ics.uci.edu/mL/index.php), as shown in Table 5; records with missing values are removed.To compare the quality of the partitions chosen by the internal CVIs, we used two external CVIs as the benchmark evaluation criteria, respectively: 1.
The adjusted Rand index (ARI)-the corrected-for-chance version of the Rand index-is based on the numbers of objects in common (or not) between the pre-defined classes and the produced clusters [6].Given two partitions P = {C 1 , . . ., C k } and P' = {C' 1 , . . ., C' k' } ARI is defined as: where n ij is the number of common objects in C i and C' j , In a specific dataset, a partition that is more similar to the pre-defined classes would score higher values in ARI.Note that ARI may take negative values.

2.
The normalized mutual information (NMI) calculates the mutual information of two partitions, and normalizes it with the sum of their entropy [16]: In a specific dataset, a higher value of NMI indicates that the partition is more proximal to the pre-defined classes.
For example, in one measurement of several partitions, partitions P 1 and P 2 scores best in internal CVIs IV 1 and IV 2 , respectively.EV is an external CVI, if EV(P 1 ) > EV(P 2 ), we can establish that partition P 1 is more proximal to the pre-defined classes than partition P 2 in the opinion of EV.Therefore, IV 1 performs better than IV 2 in this example.
We first compare the NMI values of the partitions selected by each internal CVIs to find out which internal CVI can select the better partition, i.e., performs better in the opinion of NMI.Then we use ARI to evaluate the internal CVIs in the same manner.
On each dataset, we used the classical agglomerative hierarchical clustering and the k-modes clustering to produce the partitions, and both clustering methods applied the same cluster dissimilarity that was defined in F. The experimental procedures are shown in Figures 4 and 5.

1.
The agglomerative hierarchical clustering is a 'bottom-up' approach; each object is treated as an individual cluster in the beginning.When moving up the hierarchy, pairs of clusters are merged progressively if the dissimilarity of their union is lower than the other pairs in the same layer, until all objects are merged into one cluster eventually.The layers in the hierarchy are different partitions of the dataset.From the generated partitions with the number of clusters ranging from 2-10, we selected one 'optimal' partition by each internal CVI, then compared the external CVI values (NMI and ARI, respectively) of the selected partitions.

2.
The k-modes clustering is a partitioning approach that is similar to the more famous k-means clustering.It starts with k randomly-generated cluster centers (seeds), and each object is assigned to the most appropriate cluster if the dissimilarity of their union is the lowest.In the next iteration, the centers of the clusters are updated by the attribute modes, and the objects are reassigned in the previous manner.The iteration ends if the value of the objective function F stabilizes.The clustering results are inconsistent over different seeds, even if the number of clusters k is fixed.
To test the performances when the number of clusters is unknown, we used the internal CVIs to search for the optimal partition from all the partitions produced by k-modes with k ranging from 2-10 (each value of k is conducted 100 times, therefore 900 candidate partitions generated).We further repeated the process 100 times and compared the average external CVI values (ARI and NMI, respectively) of the partitions selected by each internal CVI.Additionally, we tested the internal CVIs with k set to the pre-defined number of clusters to examine the performance when the number of clusters is determined.

Results of the Hierarchical Clustering Validation Evaluation
The NMI and ARI scores of the partitions chosen by each internal CVI in the hierarchical clustering are shown in Tables 6 and 7; the bracketed figures are the performance ranks over indexes.In general, the indexes CUBAGE and CU1/k performed the best when evaluating the layers of the

Results of the Hierarchical Clustering Validation Evaluation
The NMI and ARI scores of the partitions chosen by each internal CVI in the hierarchical clustering are shown in Tables 6 and 7; the bracketed figures are the performance ranks over indexes.In general, the indexes CUBAGE and CU1/k performed the best when evaluating the layers of the

Results of the Hierarchical Clustering Validation Evaluation
The NMI and ARI scores of the partitions chosen by each internal CVI in the hierarchical clustering are shown in Tables 6 and 7; the bracketed figures are the performance ranks over indexes.In general, the indexes CUBAGE and CU 1/k performed the best when evaluating the layers of the hierarchical clustering.The results of indexes −F and −E were worse than other indexes in both NMI and ARI. Figure 6 illustrates the changing of validation scores over the layers of the hierarchy; as we can see, the indexes CUBAGE and CU 1/k matched the benchmark evaluation criteria the best.

Results of the k-Modes Clustering Validation Evaluation
The NMI and ARI scores of the partitions chosen by each internal CVI in the k-modes clustering are shown in Tables 8-11.
When k was not determined (Tables 8 and 9), CUBAGE outperformed the other indexes on most of the datasets, and CU1/k came second.When k was determined (Tables 10 and 11), CUBAGE still outperformed others, and indexes −E and −F advanced in performance.Indexes F and R had lower sensitivities to the value distribution, since they only consider the most and/or the least common values in the cluster.The performances of these CVIs were below average in the experiments.The performance drop of F appeared when the number of clusters was unknown, similar to index E.As the compactness measure of F and R are equivalent, by comparing the trends in the hierarchical clustering experiments (Figure 6), we can observe that the compactness measure of R had little effect on the partition evaluation, and the role of the separation was dominant.
Index CU 1/k uses 1/k as the separation coefficient without respect to the value distribution between clusters.In the k-modes clustering experiments, the performance of CU 1/k dropped on most of the occasions when the number of clusters changed from known to unknown, as shown in Figure 7.This indicates that the separation assumption is not universally suitable, although it corrected the monotonicity of the compactness core.
Algorithms 2018, 11, x FOR PEER REVIEW 22 of 25 coordinated well with the compactness measure E −1 .When the number of clusters was unknown, CUBAGE had improved performance compared to index E for introducing inter-cluster information.
When the number was pre-known, the partitions chosen by E were identical to those chosen by CUBAGE on the datasets Voting, Breast, and Mushroom (datasets with two classes), which agreed with Equation (47); CUBAGE performed better than E on other datasets when k is fixed as well, which indicates that the separation measure AGE not only contributes to the performance of evaluating partitions of different sizes, it also has a positive impact when evaluating the partitions of the same size.For comparison, R was outperformed by its equivalent compactness measure F when k was fixed, and the separation coefficient of CU1/k had no impact on the performance, since 1/k is a constant under such circumstances.As a result, CUBAGE performed better than other internal CVIs in general in the conducted experiments.

Conclusions and Future Work
This paper studies internal clustering validation measures for categorical data.We analyzed the compactness and separation measures or assumptions of five well-known internal CVIs, and proposed a new index-CUBAGE.
Analysis results showed that the indexes without separation measures based on the attribute distribution do not necessarily ignore the impact of separation, since some indexes (i.e., CU1/k and Cloper) adjusted the evaluation results by the separation assumptions with respect to the partition size or the object distribution, although the assumptions may be crude and not universally suitable.
The compactness cores of the indexes are all monotonic with respect to the number of clusters in the hierarchical clustering, which makes them biased toward partitions with more clusters.The separation measures or assumptions corrected such biases by their preferences for the concentrated partitions.Therefore, the coordination of separation and compactness affects the evaluation considerably.For instance, as discussed, the role of parameter r is to adjust the importance of the compactness and the separation of the index Cloper, which influenced the effectiveness.
The proposed internal CVI-CUBAGE-is based on a new separation measure that uses the averaged information gain of isolating each cluster to measure the overall separation of the partition.Theoretical analysis showed that this separation measure scores the minimum possible value when the objects are least separated.Meanwhile, CUBAGE uses the reciprocal entropy of the dataset conditioned on the partition-which is also the reciprocal of index E-to measure the whole compactness.As the product of the separation and compactness measures, CUBAGE showed better performance in the experiments than other indexes, which indicates that the separation and the compactness measures are accurate, and that they coordinate well on most datasets.The performance of the index Clope r is highly dependent on the parameter r.As we discussed, the effect of compactness drops as r decreases.In the hierarchical clustering experiments, the value of Clope 1 (P) monotonically decreased as the number of clusters increased under the effect of the separation measure.For the same reason, in the k-modes clustering, Clope 1 (P) was outperformed by Clope 2 (P) and Clope 3 (P).However, Clope 2 (P) and Clope 3 (P) outperformed each other on different datasets, which indicates that setting r appropriately to compromise the compactness and the separation is difficult.Additionally, the performance drop of Clope r when the number of clusters changed from known to unknown, as shown in Figure 7, indicates that the separation assumption of index Clope r is not suitable for most datasets as well.
Such a performance drop happened least with our internal CVI-CUBAGE.The separation measure AGE showed advantageous applicability to the datasets and clustering methods, and coordinated well with the compactness measure E −1 .When the number of clusters was unknown, CUBAGE had improved performance compared to index E for introducing inter-cluster information.When the number was pre-known, the partitions chosen by E were identical to those chosen by CUBAGE on the datasets Voting, Breast, and Mushroom (datasets with two classes), which agreed with Equation (47); CUBAGE performed better than E on other datasets when k is fixed as well, which indicates that the separation measure AGE not only contributes to the performance of evaluating partitions of different sizes, it also has a positive impact when evaluating the partitions of the same size.For comparison, R was outperformed by its equivalent compactness measure F when k was fixed, and the separation coefficient of CU 1/k had no impact on the performance, since 1/k is a constant under such circumstances.As a result, CUBAGE performed better than other internal CVIs in general in the conducted experiments.

Conclusions and Future Work
This paper studies internal clustering validation measures for categorical data.We analyzed the compactness and separation measures or assumptions of five well-known internal CVIs, and proposed a new index-CUBAGE.
Analysis results showed that the indexes without separation measures based on the attribute distribution do not necessarily ignore the impact of separation, since some indexes (i.e., CU 1/k and Clope r ) adjusted the evaluation results by the separation assumptions with respect to the partition size or the object distribution, although the assumptions may be crude and not universally suitable.
The compactness cores of the indexes are all monotonic with respect to the number of clusters in the hierarchical clustering, which makes them biased toward partitions with more clusters.The separation measures or assumptions corrected such biases by their preferences for the concentrated partitions.Therefore, the coordination of separation and compactness affects the evaluation considerably.For instance, as discussed, the role of parameter r is to adjust the importance of the compactness and the separation of the index Clope r , which influenced the effectiveness.
The proposed internal CVI-CUBAGE-is based on a new separation measure that uses the averaged information gain of isolating each cluster to measure the overall separation of the partition.Theoretical analysis showed that this separation measure scores the minimum possible value when the objects are least separated.Meanwhile, CUBAGE uses the reciprocal entropy of the dataset conditioned on the partition-which is also the reciprocal of index E-to measure the whole compactness.As the product of the separation and compactness measures, CUBAGE showed better performance in the experiments than other indexes, which indicates that the separation and the compactness measures are accurate, and that they coordinate well on most datasets.
In the future, we will investigate the internal CVIs in an extended range to provide systematic studies on the relation of within-cluster compactness and intracluster-separation.Secondly, the performance on evaluating the partitions of different characteristic patterns can be studied.Additionally, we will analyze the possibility of using CUBAGE as the objective function in the clustering process from the aspects, such as convergence speed.

Figure 1 .
Figure 1.Clustering procedure consists of four steps with a feedback pathway.

Figure 1 .
Figure 1.Clustering procedure consists of four steps with a feedback pathway.

Figure 2 .
Figure 2.This figure illustrates how the information gain of separating each clusters GEl are calculated and averaged to represent the whole separation of the partition.In the figure, HU is the entropy of unpartitioned dataset U (n objects), Hl is the entropy of cluster l (nl objects), and Hrest is the entropy of the rest of the objects.

Figure 2 .
Figure 2.This figure illustrates how the information gain of separating each clusters GE l are calculated and averaged to represent the whole separation of the partition.In the figure, H U is the entropy of unpartitioned dataset U (n objects), H l is the entropy of cluster l (n l objects), and H rest is the entropy of the rest of the objects.

Figure 4 .Figure 5 .
Figure 4. Procedures of the hierarchical and k-modes clustering validation experiments.

Figure 5 .
Figure 5. Determining the external index value of the partition selected by an internal index, where IV and EV are the values of the internal and the external index, respectively.

Figure 6 .
Figure 6.The validation scores of each layer of the hierarchy on different datasets.On axis OX are the layers, i.e., partitions of different size; on axis OY are the values of the validation (quality) of each partition measured by different CVIs.Note that the values of Cloper(P) with the parameter r = 1~3 have been normalized for convenience of comparison.

Figure 6 .
Figure 6.The validation scores of each layer of the hierarchy on different datasets.On axis OX are the layers, i.e., partitions of different size; on axis OY are the values of the validation (quality) of each partition measured by different CVIs.Note that the values of Clope r (P) with the parameter r = 1~3 have been normalized for convenience of comparison.

Figure 7 .
Figure 7.The number of incidences of increased, equivalent, and decreased performances in both NMI and ARI when the number of clusters changes from pre-known to unknown in the k-modes clustering.

Figure 7 .
Figure 7.The number of incidences of increased, equivalent, and decreased performances in both NMI and ARI when the number of clusters changes from pre-known to unknown in the k-modes clustering.

Table 2 .
Example of a dataset with five partitions.The number of clusters is 2, 3, 4, 5, and 6, respectively.

Table 3 .
Evaluation results of the partitions in Table2.
* To optimize E and F is to minimize them, while other functions are to be maximized.

Table 4 .
The AGE (averaged information gain of isolating each cluster) and CUBAGE (clustering utility based on AGE) outcomes of the partitions in Table2.