A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets

: Many mixed datasets with both numerical and categorical attributes have been collected in various ﬁelds, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are ﬁrst transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI. on the an automatic the numerical datasets. a similarity measurement Extensive experimental results show that the iterative algorithm based on the proposed similarity measurement can achieve higher clustering accuracy and is superior to the existing clustering algorithms on datasets from UCI. The results also validate the feasibility of handling different types of attributes and verify that various attributes contribute differently in similarity measurements when clustering.


Introduction
The main purposes of clustering analyses are to discover the implicit class structure in the data and divide the physical or abstract objects into different classes, where the similarity between a pair of objects in the same class is large and in different classes is small. As a major exploratory data analysis tool, clustering analysis has been widely researched and applied in many fields, such as sociology, biology, medicine, etc. [1][2][3]. Most current methods are designed to address single dataset types (numerical or categorical). For example, classical clustering methods, such as the k-means algorithm [4,5], the EM algorithm [6], etc., are limited to numerical datasets, while some algorithms are also proposed for clustering categorical datasets [7,8]. However, in the medical and biology fields, many datasets are collected with both numerical and categorical attributes. Hence, many researchers are dedicated to discovering clustering algorithms for mixed types of datasets with categorical and numerical attributes [9,10].
Many unsupervised clustering algorithms for mixed datasets have been proposed over the years, which can be classified into two types. The first type designs different similarity measurements for numerical and categorical data and then calculates the weighted sum of the two parts. For example, the K-Prototypes algorithm [11] for clustering mixed datasets was put forward simply by combining the k-means algorithm and the K-Modes algorithm, which are used for single types of numerical and categorical datasets, respectively. Additionally, the OCIL algorithm proposed by Cheung and Jia [12] is an iterative clustering learning algorithm based on object-cluster similarity metrics.
In the second type, the algorithms transform categorical attributes into numerical ones, and then the algorithms apply clustering methods designed for purely numerical datasets to the transformed dataset or vice versa. The most direct method is to map categorical values into numerical vectors. If a categorical attribute contains n unique values, then each value is mapped into a n-dimensional vector. This strategy increases the dataset dimensions, resulting in higher computational complexity. It could also transform numerical attributes into categorical ones. For instance, SpectralCAT, proposed by David and Averbuch [13], automatically transforms high-dimensional data into categorical data and then applies spectral clustering [14] to reduce the dimensionality of the transformed datasets through automatic non-linear transformations.
When designing clustering algorithms, the similarity or dissimilarity measurement plays an important role. Due to the different nature of numerical attributes and categorical attributes, they should be handled differently. Numerical data use a continuous variable to represent the values of each attribute, and a common distance such as the Euclidean distance usually measures the similarity between numerical objects. However, the values of the categorical data have neither a natural ordering nor a common scale. Due to this distinct nature of these two different data types, methods designed for single-type datasets cannot be applied to other types of datasets. The most direct way is the second of the two types mentioned above. However, this method ignores the similarity information in the categorical attribute values [15]. Therefore, the Hamming distance is used in many dissimilarity measurements. For example, in the K-Prototypes algorithm, the dissimilarity measurement uses the Euclidean distance for the numerical attributes and the Hamming distance for the categorical attributes. This algorithm also controls the contribution of the numerical attributes and the categorical attributes through a user-defined parameter. The K-Prototypes algorithm is simple and easy to implement, so it has been widely used in clustering mixed datasets. However, when implementing similarity measurements for categorical attributes, the Hamming distance is rough, and the clustering result is very sensitive to this parameter in the K-Prototypes algorithm. Subsequently, some improved similarity measurements for categorical attributes are proposed, which are based on the frequency of categorical values, the co-occurrence, and the conditional probability estimate [7,16,17]. Based on these improved similarity measurements for categorical attributes, some combined similarity measurements for both categorical and numerical datasets have been developed. For instance, the OCIL algorithm [12] uses the frequency of categorical object values that occur in the cluster for categorical attributes and the numerical distance for numerical attributes when measuring similarity.
It can be found that each attribute often contributes differently to the desired clustering results in many practical applications, which should be considered when measuring the similarities. For example, we want to cluster a mammographic mass dataset into two groups, corresponding to benign types and malignant types. In this task, the age attribute may play a more important role than the mass density attribute. Therefore, it is very important to identify different attribute contributions to improve the quality of the clustering results. Actually, some researchers have realized this problem and proposed several strategies. However, most research focuses on single-type datasets; e.g., for categorical attributes, the weights could be assigned based on the overall distribution of attribute values [18] or based on the frequency the class center appearances and the average distance between objects and the clustering center [19]. When handling mixed datasets, current algorithms only assign weights for single-type attributes. For example, when the OCIL algorithm [12] measuring the similarity, which only assigns weight for each categorical attribute based on information entropy, and it lets each numerical attribute take the same weight. The result is to weakens the importance of the numerical attributes. Ahmad and Dey proposed an algorithm for mixed datasets by adding weights to only numerical attributes [20]. Actually, both the numerical attributes and the categorical attributes should be evaluated when designing the similarity, and the weight strategy should be applied to both types of attributes in order to simplify the computational complexity.
In this paper, we propose a similarity measurement with entropy-based weighting for mixed datasets with both categorical and numerical attributes. First, a similarity metric for the categorical attributes is designed by assigning a different weight to each attribute based on information entropy theory. Second, we present an automatic categorization technique that transforms numerical data into categorical data, which is achieved by automatically discovering the optimal number of categorizations for each attribute based on the Calinski-Harabasz index. Then, the similarity metric for categorical data can be used to measure the similarity for transformed data. In this way, this similarity measurement can be applied to the mixed dataset containing both numerical and categorical attributes. Subsequently, this similarity measurement with entropy-based weighting is applied to the k-means framework. We accessed several datasets from UCI and compared the proposed algorithm with the OCIL and K-Prototype methods on mixed datasets as well as with the k-means algorithm on numerical datasets. The experimental results show that the iterative clustering algorithm based on the proposed similarity measurement is superior to these three algorithms.
The remainder of this paper is organized as follows. Section 2 introduces the problem formulation and then proposes a similarity measurement with entropy-based weighting for mixed datasets and applies this similarity measurement to the k-means algorithm framework. In Section 3, experiments are conducted to compare the proposed algorithm with three existing methods. Finally, we draw conclusions in Section 4.

Problem Formulation
Clustering means classifying the given unlabeled objects into several clusters according to certain criteria, so similar objects are classified as one cluster, and dissimilar objects are assigned to different clusters.
For a given mixed dataset X consisting of m objects, denoted as {x 1 , x 2 , . . . , x m }, suppose X has d c categorical attributes and d u numerical attributes. Then, The requirement is to cluster the dataset X into k different clusters, denoted as C 1 , . The optimal partition matrix T * can be found through the following objective function: where s(x i , C j ) is the similarity between object x i and cluster C j , T = (t ij ) is an m × k partition matrix with t ij ∈ {0, 1} and ∑ k j=1 t ij = 1, i = 1, 2, . . . , m, j = 1, 2, . . . , k. t ij = 1 indicates that object x i is assigned to cluster j.
According to Equation (1), the clusters can be obtained as long as the metric function of similarity between object x i and cluster C j is determined. Because implied information of each attribute is different, the contribution to cluster result is also different., we define a new similarity, in which each attribute is assigned a weight, denoted as w r , satisfying Then the similarity between object x i and cluster C j can be measured by the following equation: where w c r and s c (x c i,r , C j ) are the weight and similarity on the categorical attribute, respectively, w u r and s u (x u i,r , C j ) are the weight and similarity on the numerical attribute, respec- represents the similarity on categorical attributes and s u (x u i , C j ) = ∑ d u r=1 w u r s u (x u i,r , C j ) represents the similarity on numerical attributes. In the following sections, we study how to calculate the weight and similarity on each attribute.

Similarity Measurement for Categorical Attributes
For categorical attributes, each pair of values chosen from the value domain are considered to have the same distance as they do not have a natural ordering. By contrast, each pair of values of a numerical attribute has a numerical distance. Due to this different characteristic, it is not appropriate to use the Euclidian distance to evaluate categorical attributes-clustering similarity. Hereby, we adopt the frequency that the value x c i,r appears in the cluster C j for the categorical attribute A c r , where A c r (r = 1, 2, . . . , d c ) represents the rth categorical attribute. Definition 1. The similarity between a categorical attribute value x c i,r and cluster C j , where i ∈ {1, 2, . . . , m}, r ∈ {1, 2, . . . , d c }, j ∈ {1, 2, . . . , k}, is defined as where σ A c r =x c i,r (C j ) represents the number of objects in cluster C j , whose value for the categorical attribute A c r is equal to x c i,r , NULL means empty, and σ A c r =NULL (C j ) represents the number of objects in cluster C j , whose value for the categorical attribute A c r is not empty. From Definition 1, we can find the following properties: s values of the objects belonging to cluster C j are equal to x c i,r . Optimizing attribute weights can improve the clustering performance. In information theory, the inhomogeneity degree of the dataset with respect to an attribute can be used to measure the significance of this attribute. In addition, according to Measure III proposed in [21], the higher the information content of an attribute, the higher the inhomogeneity degree of this attribute.

Definition 2.
Since the value domain of each attribute is definite, values of each attribute can be regarded as discrete and independent. The significance of an arbitrary categorical attribute A in dataset X can be quantified by the following entropy metric: where A has a value domain, denoted as dom(A), which consists of all the possible values that attributes A can choose, and dom(A) can be represented with where a g is a value of attribute A, a g ∈ dom(A), g = 1, 2, . . . , h. Therefore, p(a g ) is the probability density function of a g in dataset X for attribute A. According to Equation (4), an attribute with more varying values has higher significance. However, in practice, an attribute with too many different values may have little clustering contribution, such as the instance ID number, which is unique for each instance; however, this information is useless for clustering analysis [12]. Thus, Equation (4) can be modified with Equation (5), Then, the weight of each attribute based on information entropy is defined as in Equation (6), denotes the sum of modified information entropy of all the numeric attributes, which will be described in detail in the next section. Therefore, the metric function of similarity between object x c i and cluster C j on categorical attributes is modified as Equation (7).

Similarity Measurement for Numerical Attributes
Since the entropy-based weighting strategy proposed in Section 2.2 is not applicable to numerical attributes, we made numerical data discrete at first. Then, the similarity measurement was used for the discretized data which are categorical data now. Discretization of numerical data is gaining more attention from the machine learning community [22]. Discretization of a given continuous attribute is also called quantization, which divides the range of attributes into intervals. Then, an interval label marks each interval. As a result, interval labels replace the original continuous data. Obviously, discretization can reduce the number of continuous attribute values [23], thereby simplifying the original data. Discretization also makes it possible for methods of categorical data clustering to be applied to cluster numerical or mixed datasets. There are many methods for numerical dataset discretization, such as discretization by intuitive division, histogram analysis, cluster analysis and entropy-based discretization. This section defines a smart way to automatically discretize numerical data by cluster analysis so that numerical data are transformed into categorical data. This method also provides a measure to find the optimal clusters to discretize the original data.
In order to transform numerical data into categorical data, we transformed numerical data by each attribute. Formally, let X l = [x u 1,l , x u 2,l , . . . , x u m,l ] be a single numerical attribute in X. X l is transformed into the categorical valuesX l = [x 1,l ,x 2,l , . . . ,x m,l ], (l = 1, 2, . . . , d u ). As a result, each point Before categorizing numerical data by applying a clustering method to the data, the optimal number of categories is required, which is critical for the success of the categorization process.
Different methods have been proposed to find the optimal number of clusters for numerical attribute data [24,25]. The most common way is to apply a clustering algorithm to the data and calculate the cluster validity index. This process is repeated with an increasing number of clusters until it achieves the first local maxima. The number of clusters corresponding to the first local maxima is chosen as the optimal number of categories.
Let q be the number of clusters, which is unknown at first, and let f q be a clustering function that assigns each x u i,l ∈ X l (i = 1, 2, . . . , m) to one of the q clusters in Z l , where Z l = {z l 1 , z l 2 , . . . , z l q } and z l j = {x u i,l | f q (x u i,l ) ∈ z l j , i = 1, 2, . . . , m, z l j ∈ Z l }. The total sum of squares of X l is defined as The within-cluster sum of squares is defined as S l w (q) = ∑ q j=1 ∑ x l ∈z l j (x l − u l j )(x l − u l j ) T , where the mean of each cluster u l j is defined as u l j = ( 1 |z l j | ) ∑ x l ∈z l j x l . It can be found that S l w (q) denotes the sum of deviations from each point to the center of their associated clusters, and the S l w (q) of a good cluster should be a small value. The between-cluster sum of squares is defined as S l b (q) = ∑ q j=1 |z l j |(u l j − x l )(u l j − x l ) T , which denotes the sum of the weighted distances between each center of the q clusters and the center of data, and S l b (q) of a good cluster result should be of a large value. It is clear that S l = S l w (q) + S l b (q); thus, the total sum of squares equals the sum of the within-cluster sum of squares and the between-cluster sum of squares.
The Calinski-Harabasz index is adopted to evaluate the clustering validity, which . The proof of the effectiveness of the Calinski-Harabasz index is shown in [13]. We applied a clustering method f q to the data X l and calculated the corresponding Calinski-Harabasz index S l q,m of clusters, q = 2, 3, . . .. When the validity index S l q,m achieved the first local maximum, we chose the corresponding q as the optimal number of categories, denoted as q l best . To demonstrate the automatic categorization process, an example of the Calinski-Harabasz index calculation results is shown as Figure 1. In this example, the k-means method was chosen as f q . When q = 2, 3, . . . , 100, the k-means method was applied to the data and the validity index of the corresponding cluster result was calculated. When q = 8, the first local maxima of the Calinski-Harabasz index is found; therefore, q l best = 8. The automatic categorization process can be summarized as Algorithm 1.

Algorithm 1 Automatic categorization for numerical attributes.
Input: X l = [x u 1,l , x u 2,l , . . . , x u m,l ]: the lth numerical attribute in the dataset X; f q : a clustering function that partitions X l into q clusters and returns the corresponding assignments; q max : maxmum number of categories to examine; Output:X l = [x u 1,l ,x u 2,l , . . . ,x u m,l ]: the categorical values of X l ; 1: for q = 2 to q max do 2: S(q) = CalinskiHarabasz( f q (X l ), X l ) (the Calinski-Harabasz index of the clustering result); 3: q best = min q∈{2,3,...,q max } {localMax(S(q)) = True} (the first q for which S(q) achieves a local maxmum) 4:X l = f q best (X l ) 5: returnX l When the optimal number of categories q l best is found, each x u i,l ∈ X l , i = 1, 2, . . . , m is allocated to one of q l best clusters z l j ∈ Z l by clustering method f ( q l best ), then the correspond-ing categorical value of x u i,l is set as j. This process repeats for each numerical attribute X l in dataset X, l = 1, 2, . . . , d u . After this automatic categorization process, we found the optimal number of categories of each numerical attribute and transformed the original numerical data into categorical data, withx u i = [x u i,1 ,x u i,2 , . . . ,x u i,d u ], (i = 1, 2, . . . , m). Since the numerical data of the original dataset X is transformed into categorical data, we can use the similarity measurement for categorical data to the transformed data. The weight of each numerical attribute is calculated based on Equation (8): , r = 1, 2, . . . , d u (8) where A u r , (r = 1, 2, . . . , d u ) represents each attribute of transformed data. Therefore, the metric function of similarity between object x u i and cluster C j on numerical attribute is defined as:

Similarity Measurement for Mixed Data
Combining Sections 2.2 and 2.3, the similarity measurement for mixed data is defined as:

Iterative Clustering Algorithm
Based on Equation (10), the similarity measurement with entropy-based weighting applied to the k-means framework can be conducted as Algorithm 2.
Steps 1-3 utilize the automatic categorization process to obtain transformed categorical datasets based on Algorithm 1. Since the attributes of the transformed dataset are all categorical, the weight of each attribute can be calculated with the entropy-based weighting strategy, and Steps 4-9 show the process. Steps 10-21 are the iterative process that applies the similarity measurement based on Equation (10) into the k-means algorithm framework to address the transformed dataset.

Algorithm 2 Iterative clustering algorithm with entropy-based weighting.
Input: X = {x 1 , x 2 , . . . , x m } (dataset to cluster with d c categorical attributes and d u numerical attributes); k (number of clusters); f q (a clustering function that partitions X l into q clusters and returns the corresponding assignments); q max (maximum number of categories to examine); Output: idx = {idx 1 , idx 2 , . . . , idx m } (an assignment of each point in X to one of k clusters); 1: for l = 1 to d u do 2:X l = Categorize(X l , f q , q max ) (automatic categorization of X l ); 3: for r = 1 to d c do 4: (calculate the importance of each categorical attribute); 5: for r = 1 to d u do 6: (calculate the importance of each numerical attribute); 7: Set idx = {0, 0, . . . , 0} and select k initial objects randomly as k initial centroids for each cluster 8: noChange = true; 9: repeat 10: for i = 1 to m do 11: idx 13: noChange = f alse; 14: Update the information of clusters C , including the frequency of each categorical value. 15: until (noChange = true) 16: return idx

Results and Discussion
To test the effectiveness of the similarity measurement with the entropy-based weighting proposed in this paper, two different types of datasets, mixed and numerical datasets, were selected from the UCI Machine Learning Data Repository [26], and most datasets were collected from the field of biology and medicine. The iterative clustering algorithm based on the proposed similarity measurement was compared with existing clustering algorithms, including OCIL [12], K-Prototype [9] and k-means [4]. k-means was used for dataset made of numerical variables only. In the experiments, the clustering accuracy [27] was adopted to evaluate the three mentioned methods. The clustering accuracy is defined , where m denotes the number of instances of the dataset, l i denotes the provided label, idx i denotes the obtained cluster label, map(idx i ) is a mapping function that maps idx i to the equivalent label from the data corpus, and the function δ(l i , map(idx i )) = 1 only if l i = map(idx i ); otherwise, the value is 0. Correspondingly, the clustering error rate is defined as error = 1 − AC.
In the experiments, considering that the clustering results are affected by the selected initial centroids, we set the same initial centroids for all methods during each test, and the following experimental results were averaged from 100 random runs. In addition, k-means was chosen as the clustering method in the automatic categorization process that transforms numerical attributes into categorical attributes. Before using k-means method to cluster, the original data were normalized to be between 0 and 1.

Experiments on Mixed Datasets
In this section, we investigated the performance of the iterative clustering algorithm based on the proposed similarity measurement on mixed datasets. Table 1 shows the information of each dataset. Note that the second column presents the number of samples, the third column presents the number of the two types of attributes, and the last column presents the probability distribution of samples in different classes.
To evaluate the performance of the iterative clustering algorithm based on the proposed similarity measurement, we compare its clustering results with OCIL and K-Prototype. The average value and standard deviation of the clustering error of these clustering algorithms are statistically summarized in Table 2. In the experiments, the weight parameter γ was set to 1.5 for the K-Prototypes algorithm.  From Table 2, it can be observed that the iterative clustering algorithm based on the proposed similarity measurement outperforms the OCIL and K-Prototype methods for six datasets, although the ratios of the numbers of categorical attributes to numerical attributes differ greatly, as shown in Table 1. Compared to the other two methods, the iterative clustering algorithm can improve the accuracies of clustering results by 2.13% and 4.28%, respectively. Especially in the Heart dataset, the iterative clustering algorithm improves the accuracy by 2.85% and 5.86%, respectively. This result indicates that the proposed similarity measurement is applicable to mixed datasets of variant compound styles and does not need any parameter to give weights to the two types of attributes. Furthermore, for datasets that have very uneven class distributions, the proposed similarity measurement can also achieve adequate clustering results.
To study why the iterative clustering algorithm based on the proposed similarity measurement outperforms the OCIL and K-Prototype methods, we analyzed the correlation between each attribute and the label attribute in the dataset by calculating the correlation coefficients in statistics. Since the label attribute is a categorical attribute, the Pearson correlation coefficient and the Spearman correlation coefficients are not suitable. The Kendall correlation coefficient requires that the categorical attribute be ordered; therefore, it also cannot be used to calculate the correlation between the categorical attribute and the label attribute. Here, we adopt the ReliefF algorithm, which can estimate the quality of dependencies between each attribute and label attribute [28].
To see whether the correlation between the dependencies and weights affects the clustering results, we calculate the Pearson correlation coefficient between the dependencies calculated by the ReliefF algorithm and the weights calculated by Equations (6) and (8) in each dataset. The result is shown in Table 3. Since the Dermatology dataset as well as the Zoo dataset have only one numerical attribute, calculating the correlation for numerical attribute is meaningless.  Tables 2 and 3, it can be seen that the Dermatology dataset, with a good clustering result, has a strong correlation between dependencies and weights for categorical attributes. In addition, the Australian dataset has a strong correlation not only for categorical attributes but also for numerical attributes; this dataset also has a good clustering result. However, in the Hepatitis dataset it has a strong correlation for categorical attributes, but has a weak correlation for numerical attributes. There are only six numerical attributes and 13 categorical attributes in the Hepatitis dataset. The influence of category attributes is much greater than that of numerical attributes. So, the clustering result of Hepatitis dataset is not good, which does not violate the theory in the article. Therefore, a good clustering result may be obtained due to the reasonable weight assigned to each attribute by the proposed similarity measurement.

Experiments on Numerical Datasets
Then, we further investigated the performance of the proposed similarity measurement on pure numerical datasets. Table 4 shows the information of six numerical datasets, including the number of samples, attributes and classes, and class distribution. To evaluate the performance of the proposed similarity measurement applied to k-means on numerical datasets, we also conducted experiments compared to the most classical numerical data clustering algorithms, k-means algorithms. These two clustering algorithms were applied to different numerical datasets; Table 5 shows the mean and variance of the clustering error of clustering by applying different algorithms. In addition, before using the k-means method to cluster, the dataset was normalized to between 0 and 1.  It can be seen that except for the Wine dataset and the Fertility dataset, the proposed similarity measurement applied to k-means outperforms the k-means method on other datasets. Normally, the clustering accuracy of the iterative clustering algorithm based on the proposed similarity measurement is 6.09% higher than that of k-means, especially for the Mass dataset, with a similarity that is 26.69% higher than that of k-means.
Similarly, the correlation coefficient between dependencies and weights was calculated to analyze why the iterative clustering algorithm outperforms the k-means method, and the correlation coefficient of each dataset is shown in Table 6. It can be found that the Mass dataset with the best clustering result has the highest correlation coefficient. Perhaps the weights of this dataset are well allocated according to the contribution of each clustering attribute.

Conclusions
In this paper, a similarity measurement with entropy-based weighting is proposed for mixed datasets with numerical and categorical attributes. For categorical datasets, a similarity metric is designed by assigning different weights to each attribute based on information entropy theory. For numerical datasets, the original high-dimensional numerical data are transformed to categorical data by an automatic categorization technique, so the similarity metric for categorical datasets can be applied to numerical datasets. Then, a similarity measurement for mixed datasets was obtained. Extensive experimental results show that the iterative clustering algorithm based on the proposed similarity measurement can achieve higher clustering accuracy and is superior to the existing clustering algorithms on datasets from UCI. The results also validate the feasibility of handling different types of attributes and verify that various attributes contribute differently in similarity measurements when clustering.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.