Next Article in Journal
Tuning the Quantum Properties of ZnO Devices by Modulating Bulk Length and Doping
Next Article in Special Issue
Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design
Previous Article in Journal
Multifractal Analysis of MODIS Aqua and Terra Satellite Time Series of Normalized Difference Vegetation Index and Enhanced Vegetation Index of Sites Affected by Wildfires
Previous Article in Special Issue
Mildly Explosive Autoregression with Strong Mixing Errors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Nonparametric Clustering of Mixed Data Using Modified Chi-Squared Tests

Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(12), 1749; https://doi.org/10.3390/e24121749
Submission received: 4 October 2022 / Revised: 27 November 2022 / Accepted: 28 November 2022 / Published: 29 November 2022
(This article belongs to the Special Issue Recent Advances in Statistical Theory and Applications)

Abstract

:
We propose a non-parametric method to cluster mixed data containing both continuous and discrete random variables. The product space of the continuous and discrete sample space is transformed into a new product space based on adaptive quantization on the continuous part. Detection of cluster patterns on the product space is determined locally by using a weighted modified chi-squared test. Our algorithm does not require any user input since the number of clusters is determined automatically by data. Simulation studies and real data analysis results show that our proposed method outperforms the benchmark method, AutoClass, in various settings.

1. Introduction

Mixed data that contain both continuous and discrete data are abundant in scientific research, especially in medical or biological studies. An effective clustering method for mixed data should partition a large complex data set into homogeneous subgroups that are manageable in statistical inference. Clustering methods thus have a wide range of applications in almost all scientific studies including financial risk analysis, genetic analysis, and medical studies. They are essential tools in analyzing large data sets.
Most of the clustering methods in the literature have been mainly focused on either continuous data or categorical data alone. The K-means algorithm has been widely used in industrial applications for a long time. Detailed descriptions and discussions can be found in Kaufman and Rousseeuw (2009) [1]. Non-Euclidean distances such as the Manhattan distance or Mahoblis distance have also been used. Model-based clustering methods for continuous data have been proposed in the literature, see for example Banfield and Raftery (1993) [2]. One of the most prominent methods in parametric clustering based on a mixture model is proposed by Bradley et al. (1998) [3]. The number of clusters and outliers can be handled simultaneously by the mixture model. Fraley and Raftery (1998) [4] propose choosing the number of clusters automatically using the model-based clustering method. For clustering categorical data, there are far fewer reliable methods. K-modes algorithm has been proposed by Huang (1998) [5] to extend the K-means to clustering categorical data. The AutoClass method proposed by Cheeseman and Stutz (1996) [6] is a well-known method in clustering. The AutoClass takes a data set containing both real and discrete-valued attributes and automatically computes the number of clusters and group memberships. This method has been used by NASA and helped to find infra-red stars in the IRAS Low-Resolution Spectral catalog and discovery classes of proteins (Cheeseman and Stutz 1996 [6]).
In clustering mixed data, the main difficulty lies in the fact that continuous and categorical sample spaces are intrinsically different. Although both can be made into metric spaces, the continuous sample space resides on a differentiable manifold while the categorical one is defined entirely on a lattice. Attempts have been made in the literature to combine the two spaces by using a global and general distance function (Ahmad and Dey 2007 [7]). This naive approach ignores the fact that the two sample spaces are topologically incompatible. Another approach is to apply different clustering algorithms to the continuous and categorical portions separately and combine the results. This approach, however, would sever the intrinsic connection between the continuous and categorical parts of one record. Each record is often assigned to different clusters for the continuous and categorical parts. It is often hard to reconcile this except by expanding the total number of clustering. Not only does this produce a larger than necessary number of clusters for the entire data sets, but a true cluster is also often found being split across many small clusters and renders the results to be very inaccurate. Alternatively, AutoClass combines information across probability spaces. However, the effectiveness of AutoClass depends on the validity of the assumed parametric model. Zhang et al. (2006) [8] showed that both K-modes and AutoClass do not perform very well when applied to benchmark categorical data sets from the UCI machine learning depository. Therefore, there is a need for a non-parametric clustering method for mixed data.
We extend the work by Zhang et al. (2006) [8] to cluster mixed data by using adaptive quantization of the continuous sample space. The quantization process was developed in the 1950s and it partitions the sample space through a discrete-valued map (Gersho and Gray 1992) [9]. For univariate cases, quantization is known as vector quantization and it is the fundamental process for converting analog signals or information into digital forms (Gersho and Gray, 1992) [9]. It has been used in studying pricing in finance as well as engineering. Theoretical properties of quantization in probability distributions can be found in Graf and Luschgy (2000) [10]. The process of clustering mixed data is then performed on the quantized product space. The key idea is inspired by the fact that any manifold can be locally modeled by a Euclidian space. Therefore, each neighborhood in the transformed product space can be locally characterized as a fine grid endowed with a Hamming Distance. The Hamming Distance is widely used in information and coding theory (Roman 1992 [11]; Laboulias et al., 2002 [12]). The statistical significance of a detected cluster is determined by a weighted local Chi-squared test. The advantage of our proposed method over AutoClass is demonstrated in simulations and by using two benchmark data sets from the UCI machine learning depository.
This paper is organized as follows. The method is proposed in Section 2. The clustering algorithm is presented in Section 3. Simulation results are provided in Section 4.

2. Clustering Methodology

In this section, we introduce quantization of the mixed sample space on which we adopt the Hamming Distance function to measure the relative positions of two data points. We also define a distance vector and an optimal separation point which are essential to measuring spatial patterns as well as the size of any detected clusters. Separation points are introduced to extract detected cluster patterns.

2.1. Joint Sample Space of Mixed Data

Consider a general data structure for a mixed data set with p nominal categorical attributes and q continuous attributes. The categorical sample space is defined on Ω p = R p while the continuous one is defined on Ω q . The product space for mixed data is then defined on the product space Ω p Ω q . The sample size is denoted by n.
The categorical part of mixed data is represented by X = ( X i j ) , with i = 1 , 2 , , n and j = 1 , , p . Furthermore, row and column vectors in the categorical portion are denoted by X i [ · ] and X [ · ] j . The j t h categorical attribute is categorized by m j levels defined by set A j = ( a j 1 , , a j m j ) , j = 1 , , p .
We denote the continuous part of a mixed sample with size n by Z = ( Z i k ) , with i = 1 , 2 , , n and k = 1 , , q . Furthermore, we denote the row and column vectors in the categorical portion by Z i [ · ] and Z [ · ] k . The k t h attribute is a continuous random variable.

2.2. Quantization of Continuous Sample Space

Continuous data and discrete data are fundamentally different. Although the description provided by the continuous portion can be very detailed, it could carry excessive information that is not important for the clustering purpose. Furthermore, any pattern derived from the categorical part is based on a much coarse topology than the continuous counterpart. Since it is impossible to define a meaningful and objective manifold from a coarse data structure, the continuous one then must be mapped into a grid that is compatible with the relatively coarse topology from the categorical one.
The quantization is achieved in two steps. Firstly for observed realization z i j , continuous data are mapped onto the unit interval between 0 to 1 by applying the following formula:
z ˜ i k = z i k z m i n k z m a x k z m i n k , k = 1 , , q ; i = 1 , , n ,
where z m i n k and z m a x k represent the minimum and maximum values of the k column. Secondly, for the standardized observations, the continuous random variable is then mapped or quantized into a discrete random variable with M levels in the following way:
Q ( z ˜ i k ) = m , i f ( m 1 ) / M z ˜ i k < m / M ,
where m = 1 , 2 , , M , where M can be any positive integer value. Different numerical values of M could have an impact on the quality of quantization and consequently the clustering result. A finer quantization grid might not be useful and could be more computationally intensive than a coarse one.
The number of levels M can be difficult to specify by a user with no prior information. Thus, we propose to choose level M adaptively by using F statistics based on the clustering results.
For any fixed numerical value of M, we perform an ANOVA test by treating each cluster as a separate group by using the generated clustering memberships. The F-statistic associated with the ANOVA test is recorded for different values of M. We then set the optimal numerical value for M by selecting the value that corresponds to the largest F-statistic computed before. Numerical results of the quantization level will be illustrated in Section 4.1.

2.3. Distance Vectors on Quantized Product Space

We use Hamming Distance (HD) to measure the relative separation of two categorical data points. To be more specific, for any two positions in the categorical sample space Ω p , Q h [ · ] = ( Q h [ 1 ] , , Q h [ p ] ) and Q i [ · ] = ( Q i [ 1 ] , , Q i [ p ] ) , the HD between Q h [ j ] and Q i [ j ] on the jth attribute is
d ( Q h j , Q i j ) = 0 i f Q h j = Q i j , 1 i f Q h j Q i j .
Further, we define the distance between the two positions, that is, the summation of the distance from each pair of the components. Therefore, we have the following:
H D ( Q h [ · ] , Q i [ · ] ) = j = 1 p d ( Q h j , Q i j ) .
After quantization, the new product space now resides on a high-dimensional grid. Since for a grid, there is no natural origin. We can define a reference point ( S , T ) in the quantized product space with S = ( s 1 , , s p ) R p and T = ( t 1 , , t q ) R q . For the categorical portion, H D C ( X i , S ) can take values ranging from 0 to p; and for quantized continuous data, we have H D Q ( Z i , T ) can take values ranging from 0 to q.
We then define the Distance Vector (DV) based on Hamming distance for the categorical and quantized continuous portion, respectively. We define two individual vectors to record the frequencies of each categorical and quantized continuous distance value accordingly, that is, a ( p + 1 ) -element vector D V C ( S ) for the categorical data and a ( q + 1 ) -element vector D V Q ( T ) for the quantized part. To be more specific, D V C is defined as
D V C ( S ) = ( D V C [ 0 ] ( S ) , D V C [ 1 ] ( S ) , , D V C [ p ] ( S ) ) ,
and D V Q is defined as
D V Q ( T ) = ( D V Q [ 0 ] ( T ) , D V Q [ 1 ] ( T ) , , D V Q [ q ] ( T ) ) .
The j t h component in D V C and h t h component in D V Q are given as the following:
D V C [ j ] ( S ) = i = 1 n I [ H D C ( X i [ · ] , S ) = j ] , j = 0 , 1 , , p ;
D V Q [ h ] ( T ) = i = 1 n I [ H D Q ( Q i [ · ] , T ) = h ] , h = 0 , 1 , , q ;
where I ( A ) is an indicator function that takes value 1 when event A happens and 0 otherwise.
If there is no cluster pattern at all, we would expect a uniform distribution of all possible cases. Then it is equally likely for a randomly chosen data point to take any possible position in the joint sample space. The DV vectors under uniform distribution are referred to as a uniform distance vector (UDV). Thus, a UDV records the expected frequencies under the null hypothesis that there are no clustering patterns in the data. Let X be a categorical portion of data and Z be a continuous portion of the data from a sample of size n, with each observation having an equal probability of locating at any position on space Ω p Ω q . The expected value of DV and DV associated with the null hypothesis is denoted by U D V C , U = ( U 0 , , U p ) for categorical data and U D V Q , V = ( V 0 , , V q ) for continuous data, respectively.
Zhang et al. (2006) [8] provide the exact form of U D V C = n M 1 U , where M 1 = j = 1 p m j , j = 1 , 2 , , p ; m j is the number of states in set A j for the j t h attribute; and U = ( U 0 , U 1 , , U p ) with
U 0 = 1 ; U 1 = ( m 1 1 ) + ( m 2 1 ) + + ( m p 1 ) ; U 2 = i < j p ( m i 1 ) ( m j 1 ) ; U p = ( m 1 1 ) ( m 2 1 ) ( m p 1 ) .
Similarly, we obtain the exact form of the U D V Q for the quantized continuous part of the data. U D V Q = n M 2 V , where M 2 = j = 1 q l j , j = 1 , 2 , , q ; l j is the the number of levels of quantization for the j t h continuous attribute; and V = ( V 0 , V 1 , , V q ) with
V 0 = 1 ; V 1 = ( l 1 1 ) + ( l 2 1 ) + + ( l q 1 ) ; V 2 = i < j q ( l i 1 ) ( l j 1 ) ; V q = ( l 1 1 ) ( l 2 1 ) ( l p 1 ) .

2.4. Optimal Separation Point

If the initial starting point is chosen to be the center of one particular cluster, then the frequency of HD should demonstrate a decreasing pattern in a local region as the HD function records the frequency of data points from the center of the cluster and outwards. Small local bumps at the beginning part of the HD curve are expected if the initial starting point deviates slightly from the cluster center. The recorded frequencies might increase afterward when the function begins to record distances from another cluster. Therefore, the valley area indicates a natural place to separate one cluster from the rest. Separation points are, therefore, defined for this identification purpose.
Assume that the categorical data X and quantized continuous data Z are not uniformly distributed in the sample space Ω p Ω q . Let D V C ( S ) = ( D V C [ 0 ] ( S ) , D V C [ 1 ] ( S ) , , D V C [ p ] ( S ) ) T , S Ω p be the collection of all ( p + 1 ) -element D V C in the space Ω p and D V Q ( T ) = ( D V Q [ 0 ] ( T ) , D V Q [ 1 ] ( T ) , , D V Q [ q ] ( T ) ) T , T Ω q be the collection of all ( q + 1 ) -element D V Q in the space Ω q , and let U = ( U 0 , U 1 , , U p ) T be the D V C vector and V = ( V 0 , V 1 , , V q ) T be the D V Q vector defined in the previous subsection. For a given distance value j C , j C = 0 , 1 , , p , for categorical distance values and j Q , j Q = 0 , 1 , , q , for quantized continuous distance values, there always exists at least one position ( S , T ) Ω p Ω q , such that the frequency at this distance value is larger than the corresponding component, U j of the U D V C vector and V j of the U D V Q vector.
In order to proceed to a comparison between D V C and U D V C and between D V Q and U D V C , we introduce a selection criterion for an optimal cut-off r . The categorical cut-off point was defined and proved by Zhang et al. (2006) [8]. Because our quantized continuous data behaves as categorical data, we extend that concept to a quantized portion of the data. If the cluster structure is present, the early segment of a D V C and D V Q with respect to a data center should contain substantially larger frequencies than the corresponding frequencies of the U D V C vector and U D V Q vector. Therefore, the range corresponding frequencies of the U D V V vector and U D V Q a vector that is consistently larger than the U D V C vector and U D V Q vector gives a reasonable indication of the r. This leads to an optimal r C for the categorical portion of data:
r C ( S ) = min j C > 0 { j C | D V C [ j C ] ( S ) U j C < 1 } 1 , S Ω p .
Similarly, optimal r Q for the quantized portion of data be:
r Q ( T ) = min j Q > 1 { j Q | D V Q [ j Q ] ( T ) V j Q < 1 } 1 , T Ω q .
The two quantities are used to identify relatively dense regions in the space of mixed data to help us to extract clusters accurately. Zhang et al. (2006) [8] gave a detailed explanation of the tuition of radius which is the maximum distance of the data points in this cluster to its center.

3. Algorithm

There are two key parts of the algorithm. Firstly, we detect whether there exist any statistically significant clustering patterns. We propose a weighted local Chi-squared test to determine if the observed distance vectors differ significantly from the uniform distance vectors associated with no cluster pattern. Secondly, if the patterns are significant, we further extract the clusters based on the optimal separation strategies described in the previous section.
We consider the null hypothesis H 0 : There is no clustering pattern in the data set. The weighted local Chi-squared test statistic χ w 2 ( S , T ) is defined as:
χ w 2 ( S , T ) = ( 1 p + 1 q ) [ 1 p χ C 2 ( S ) + 1 q χ Q 2 ( T ) ] = p q p + q 1 p χ C 2 ( S ) + p q p + q 1 q χ Q 2 ( T ) = q p + q χ C 2 ( S ) + p p + q χ Q 2 ( T ) , ( S , T ) Ω p q
The weighted local Chi-squared statistic χ w 2 ( S , T ) is constructed to address the issue of an unequal number of variables for the continuous and categorical parts. We expect that a large number of variables tend to produce a large numerical value for the modified χ C 2 and χ Q 2 . Therefore, each modified Chi-squared statistic is normalized by its corresponding number of variables for the categorical and continuous parts respectively. To ensure the total of the two weights to equal to 1, we further divide the sum of two normalized modified Chi-squares by the total of the two weights which equals 1 / p + 1 / q .
Where the categorical part χ C 2 ( S ) takes form as:
χ C 2 ( S ) = j = 0 r C ( D V C [ j ] ( S ) U j ) 2 U j + ( j = 0 r C D V C [ j ] ( S ) j = 0 r C U j ) 2 j = r C + 1 p U j ,
and the quantized continuous part χ Q 2 ( T ) takes the form:
χ Q 2 ( T ) = j = 1 r Q ( D V Q [ j ] ( T ) V j ) 2 V j + ( j = 1 r Q D V Q [ j ] ( T ) j = 1 r Q V j ) 2 j = r Q + 1 q V j ,
where p and q are the numbers of attributes from categorical and continuous data, respectively.
If the detected pattern passes a statistical test, we then proceed to extract a cluster by determining the cluster center C and estimating cluster radius R for mixed data. Therefore, a cluster center C is chosen where the χ w 2 has the maximum value. It is chosen to be:
C = arg max ( S , T ) χ w 2 .
Zhang et al. (2006) [8] gave the definition of radius which is the maximum distance of the data points in this cluster to its center. Radius is the distance at which the DV has its very first local minimum. Therefore, it is defined categorical Radius R C ( C ) as:
R C ( C ) = mim 0 < j < p C { j | D V C [ j ] ( C ) < mim ( D V C [ j 1 ] ( C ) , D V C [ j + 1 ] ( C ) ) } 1 .
For the quantized continuous part of the data, the optimal cut-off point is used as the quantized continuous radius R Q ( C ) .
The step-by-step guide to our method is
Step 1.
For each position S, we calculate HD in the categorical data; further, we obtain D V C .
Step 2.
Standardize the continuous data and quantize the standardized data at a selected level. For each position calculate Hamming distance for quantized continuous data to obtain D V Q .
Step 3.
Compare D V C , D V Q with corresponding expected values U D V C and U D V Q .
Step 4.
Determine cut-off points r C ( S ) and r Q ( T ) for categorical and quantized continuous data respectively; and further calculate the corresponding modified Chi-squared statistic χ C 2 ( S ) and χ Q 2 ( T ) and obtain the weighted local chi-square test statistic
χ w 2 ( S , T ) = q p + q χ C 2 ( S ) + p p + q χ Q 2 ( T ) .
Step 5.
Corresponding to the weighted local Chi-squared test, select the largest test statistic χ w 2 ( S , T ) ; compare it with critical value χ ( 0.05 ) 2 at the right tail. If the max ( χ w 2 ( S , T ) ) is smaller than χ ( 0.05 ) 2 , stop the algorithm; otherwise, continue to step 6.
Step 6.
Assign the position that has the largest test statistic χ w 2 ( S , T ) as a center. Categorical data and continuous data share the same center position but with their own data points.
Step 7.
Calculate categorical a radius R C and continuous radius R Q ; label all data points within a radius in the cluster; record corresponding χ C 2 ( S ) and χ Q 2 ( T ) ; remove them from the current data set.
Step 8.
Repeat Steps 1 to 6 until no more significant clusters are detected.
Step 9.
Prune the membership assignment by calculating the minimum distance from each data point to center positions; If the membership is assigned differently to categorical data and continuous data, we further compare their p-values which are calculated from χ C 2 ( S ) and χ Q 2 ( S ) ; Re-assign the membership to the one with the larger p-value by the one with the smaller p-value.
Step 10.
Compute F-test statistic to choose the best-quantized level and corresponding clustering results as the final results.

4. Results

We conduct simulation studies and real data analysis to examine the performance of our proposed method. Classification rates and information gains are calculated to compare the performance of our proposed method with AutoClass.

4.1. Simulation Studies

In this section, we compare our method with AutoClass under various simulation settings. The simulation results are shown in Table 1, Table 2, Table 3 and Table 4. All attributes are generated independently. The simulation setting is as the following:
  • Set the number of categorical attributes p = 10 and each attribute takes m j levels which are randomly selected from the set { 4 , 5 , 6 } ; Set the number of continuous attributes q = 9 .
  • Set the number of clusters K C = K Q = 3 or K C = K Q = 5 . The 3 cluster centers C k are denoted as C k = ( c k , 1 , , c k , 10 ) , k = 1 , , 3 . The 5 cluster centers C k are denoted as C k = ( c k , 1 , , c k , 10 ) , k = 1 , , 3 . For categorical centers, ensure the Hamming distance between any two of the centers is at least great than 5. For the continuous portion of data, choose a set of cluster means as 2, 8, and 16 for 3 clusters, or 2, 8, 16, 20, and 35 for 5 clusters;
  • Set sample size N = 200 with cluster size n 1 = 130 , n 2 = 45 , and n 3 = 25 ; or set sample size N = 1000 with the cluster size n 1 = 500 , n 2 = 200 , n 3 = 100 , n 4 = 100 , and n 5 = 100 ; or set sample size N = 10,000 with the cluster size n 1 = 5500 , n 2 = 3000 , n 3 = 1500 ;
  • For categorical data, in the k t h cluster with center C k , generate n k 10-attributes vectors independently. More specifically, generate for each attribute from a multinomial distribution with a center probability of 0.7 and the rest probabilities are identically equal to 0.3 / ( m j 1 ) ; For continuous data, n k 9-attributes vectors are 9 independent normal random variables with μ = C k and σ 2 ranging from 0.25 , 0.5 and 1, respectively.
In our numerical results, the average classification rate (CR) and information gain (IG) rate with their corresponding standard deviations are used to evaluate the method’s performance. The CR measures the accuracy of an algorithm to assign data points to correct clusters. With given K clusters, the CR is defined by
C R ( K ) = k = 1 K n ˜ k n ,
where n is the total number of data points and n k ˜ is the number of data points that have been correctly assigned to cluster k by an algorithm. Obviously, 0 C R ( K ) 1 , and a larger C R ( K ) value indicates better performance of clustering. The information gain is an alternative criterion for assessing the performance of the clustering algorithm. It is the so-called cluster purity proposed by Bradley et al. (1998) [3]. Cluster purity essentially measures the information gain, which is the difference between the total entropy and weighted entropy for a given data partition, namely
i n f o r m a t i o n g a i n ( I G ( K ) ) = t o t a l e n t r o p y w e i g h t e d e n t r o p y ( K ) ,
where the weighted entropy is calculated by
w e i g h t e d e n t r o p y ( K ) = k = 1 K n k n × c l u s t e r e n t r o p y ( k ) ,
with
c l u s t e r e n t r o p y = l = 1 L n ˜ l k n k log 2 n ˜ l k n k ,
where n ˜ l k is the number of data points with true label l in cluster k, n k is the number of data points known in cluster k, and L is the known number of classes. In this chapter, we take a ratio of IG(K)/total entropy, named information gain rate (IGR), which is similar to the classification rate between 0 to 1. It is necessary to point out that in some situations, the information gained may lead to misleading. For example, in our simulation studies, IG may be equal to 1 which means perfect clustering. However, indeed, it splits each true cluster into two clusters which is a wrong classification. This misleading situation happens in Table 1 and Table 2.
Table 1 shows the selection of quantization levels for a continuous portion of the data. As mentioned in Section 2.2, we use the largest F values to choose the selected quantization level which gives the best classification rate. Table 2, Table 3 and Table 4 provide results from simulated data with various settings of different sample sizes, number of clusters, and cluster sizes. The number of replications is 500 for Table 2 and Table 3, and 100 for Table 4. Table 2 is obtained by analyzing simulated data with a sample size of 200 with 3 clusters of the sizes of 130, 45, and 25. Simulated data for Table 3 has a sample size of 1000 and the number of clusters is 5, and each cluster size is 500, 200, 100, 100, and 100, respectively. Table 4 provides results from simulated data having a sample size of 10,000 with 3 clusters and each cluster size of 5500, 3000, and 1500, respectively.
As shown by Table 2, Table 3 and Table 4, our proposed algorithm consistently has a higher classification rate in comparison with that from AutoClass in all three different settings. For the three chosen settings, the mean classification rates, and information gain rates of the two algorithms are getting closer to each other and could even be identical. Table 3 shows us that our algorithm has higher IG rates compared to AutoClass. In Table 2 and Table 4, our algorithm has IG rates varying from 0.8923 to 0.93333 . Although AutoClass could achieve one in some cases, this does not imply a perfect clustering because AutoClass tends to split each true cluster into unnecessary more clusters. Hence, overall, all tables show us that our algorithm has better performance in terms of CR and IGR by comparing it to AutoClass.

4.2. Real Data Analysis

We apply our method on to three real data sets. The first two data sets can be downloaded from the Machine Learning Repository website. One is Heart Data Set and the other one is the Australian Credit Approval Data Set. The third data set is collected by the RAND center at the University of Michigan.
Heart Data and Australian Credit Approval Data are downloaded from the Machine Learning Depository at the University of California at Irvine. Heart data contains 7 categorical, 6 continuous attributes, and 270 observations. The data provided the memberships for each observation. There are 2 clusters, absence, and presence. The cluster sizes are 120 and 150, respectively. In the Australian Credit Approval Data Set, there are 8 categorical attributes and 6 continuous attributes. The data set contains two clusters positive or negative with the corresponding cluster sizes 307 and 383. We compared our method with AutoClass. Table 5 shows the results from these two real data sets. From the table, we can tell that our method correctly identified the number of clusters for both data sets, while, AutoClass could not detect correct cluster numbers. In addition, our method has a higher classification rate compared to AutoClass. Our method has a classification rate of 81.48 % for Heart data and 73.62 % for Credit data. However, AutoClass has 44.44 % and 52.71 % .
We apply our proposed method to the health and retirement study (HRS) data set. Information about health, financial situation, family structure, and health factors was collected by the RAND center at the University of Michigan. We focus on the analysis of the status of depression depicted in the data set. Depression among children and adolescents is common but frequently unrecognized. The clinical spectrum of depression can range from simple sadness to major depressive disorders. A depression diagnosis is often difficult to make because clinical depression can manifest in so many different ways. Observable or behavioral symptoms of clinical depression may be minimal despite a person’s mental turmoil. The general population can then be partitioned naturally into two groups: depressed individuals and not depression people. We choose this scenario as the third test case for our clustering algorithm and compare its performance of ours with AutoClass.
We perform clustering based on six health factors: Smoking, Restless Sleep, High Blood Pressure, Frequent Vigorous Physical Activity, Difficulty in Walking, and Age (in months). Depression status is recorded as a binary response variable with 16,250 depressed and 2608 non-depressed individuals; Categorical variables, Smoking, and Restless-sleep, take binary values; Difficulty-in-Walking, takes values 0 , 1 , 2 , or 9; Frequent Vigorous Physical Activity has values 1 , 2 , 3 , 4 , or 5; High Blood Pressure takes values 0 , 1 , 3 , or 4; continuous variable, Age(in month), has a range from 224 to 1232 with a mean value of 801. For each individual, we include only for which all of the factors were recorded. In total, there are 18,858 people included in the analysis. Our clustering method correctly identified two clusters. AutoClass, however, detects nine clusters. Table 6 and Table 7 report the confusion matrix obtained by our method and AutoClass, respectively. In the Non-depressed group, our method correctly detected 86.75 % of non-depressed individuals. In the depressed group, 30.98 % of individuals are correctly detected. Since AutoClass finds 9 clusters, it is not feasible to make a fair comparison. Therefore, we describe the nine clusters declared by the AutoClass for the sake of completeness. Table 7 listed AutoClass clustered nine groups and the number of true depression and non-depression patients in each group. Since the depressed group is much smaller than the non-depression group, the information gain is a not suitable measure since the percentage of the depressed group is always small in comparison with the non-pressed group. The information gain for both our method and AutoClass is small and deemed not informative.

5. Discussion

We have proposed a clustering method that uses statistical distances and tests. Numerical results show that the proposed method outperforms the AutoClass algorithm based on classification rate and entropy measure. The proposed method does not em- ploy a global distance function or a parametric model. For future work, we could consider extending the proposed method to cluster spatial and temporal data.

6. Conclusions

Mixed data are prolific in scientific research such as business, engineering, life sciences, etc. It is imperative to develop a method that can cluster mixed data in order to discover true and significant underlying structures of a data set and classify observations into different subsets. We propose a non-parametric method that uses a local weighted chi-squared statistic to determine underlying clusters. The proposed algorithm does not require any model assumption for attributes or any expensive numerical optimization procedures. Because the proposed algorithm extracts clusters sequentially with one cluster at each iteration, it does not need any convergence criterion. The algorithm is terminated when all data points have been used and no more cluster centers can be detected. Consequently, our algorithm automatically produces the number of clusters, and the resulting partition is unique. When compared with the benchmark clustering algorithm for mixed data, AutoClass, we find that our algorithm outperforms AutoClass in various settings and produces similar accuracy in other settings.

Author Contributions

Conceptualization, X.W., X.G. and Y.X.; methodology, X.W., X.G. and Y.X.; formal analysis, Y.X. and X.W.; writing—original draft preparation, Y.X.; writing—review and editing, X.W. and X.G.; supervision, X.W. and X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Data Availability Statement

1. Heart disease data: https://archive.ics.uci.edu/ml/datasets/Heart+Disease (accessed on 19 November 2022). 2. Australian Credit Approval: https://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval) (accessed on 19 November 2022). 3. RAND HRS data (Version O): https://hrsdata.isr.umich.edu/data-products/rand-hrs-archived-data-products (accessed on 19 November 2022).

Acknowledgments

Yawen X. acknowledged Xiaogang W.’s and Xin G.’s supervision and support.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HDHamming Distance
DVDistance Vector
UDVUniform Distance Vctor
CRClassification Rate
IGInformtion Gain
IGRInofrmaiton Gain Rate

References

  1. Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: New York, NY, USA, 2009. [Google Scholar]
  2. Banfield, J.D.; Raftery, A.E. Model-based Gaussian and non-Gaussian clustering. Biometrics 1993, 49, 803–821. [Google Scholar] [CrossRef]
  3. Bradley, P.S.; Fayyad, U.M.; Reina, C.A. Scaling EM (Expectation-Maximization) Clustering to Large Databases; Microsoft Research: Redmond, WA, USA, 1998; pp. 0–25. [Google Scholar]
  4. Fraley, C.; Raftery, A.E. How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis. Comput. J. 1998, 41, 578–588. [Google Scholar] [CrossRef]
  5. Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
  6. Cheeseman, P.C.; Stutz, J.C. Bayesian classification (AutoClass): Theory and results. Adv. Knowl. Discov. Data Min. 1996, 180, 153–180. [Google Scholar]
  7. Ahmad, A.; Dey, L. A K-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 2007, 63, 503–527. [Google Scholar] [CrossRef]
  8. Zhang, P.; Wang, X.; Song, P.X.-K. Clustering Categorical Data Based on Distance Vectors. J. Am. Stat. Assoc. 2006, 101, 355–367. Available online: http://www.jstor.org/stable/30047463 (accessed on 19 November 2022). [CrossRef]
  9. Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression; Springer: Berlin, Germany, 1992; pp. 407–485. [Google Scholar]
  10. Graf, S.; Luschgy, H. Foundations of Quantization for Probability Distributions; Springer: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
  11. Roman, S. Coding and Information Theory; Springer Science & Business Media: New York, NY, USA, 1992; p. 134. [Google Scholar]
  12. Laboulais, C.; Ouali, M.; Le Bret, M.; Gabarro-Arpa, J. Hamming distance geometry of a protein conformational space: Application to the clustering of a 4-ns molecular dynamics trajectory of the HIV-1 integrase catalytic core. Proteins. Data Knowl. Eng. 2002, 47, 169–179. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Table 1. Quantization levels. The means of F statistics, CR, and IG are obtained based on 500 replications.
Table 1. Quantization levels. The means of F statistics, CR, and IG are obtained based on 500 replications.
Discretized LevelsMean (F)Mean (CR)Mean (IGR)
5630.15730.83020.7130
61523.45570.84550.7667
71722.32600.82270.6960
83223.94770.86350.7729
93916.33880.88160.7958
103708.52930.86820.7689
116444.70550.90850.8573
124778.98510.88930.8114
134912.84770.89070.8116
144262.39900.89070.8135
154000.39480.88790.8095
164234.99930.88630.7992
173549.86320.87870.7853
184042.08050.87850.7833
193657.45560.87680.7785
204303.86980.88720.8010
Table 2. Average CR and IGR with corresponding standard deviation for each method based on the simulated data of sample size 200 with 3 clusters; each cluster has sizes 130, 45, and 25, respectively. The mean values for each cluster are 2, 8, and 16 respectively. With the same set of means, the different variances, 0.25, 0.5, and 1 are compared. The number of replications is 500.
Table 2. Average CR and IGR with corresponding standard deviation for each method based on the simulated data of sample size 200 with 3 clusters; each cluster has sizes 130, 45, and 25, respectively. The mean values for each cluster are 2, 8, and 16 respectively. With the same set of means, the different variances, 0.25, 0.5, and 1 are compared. The number of replications is 500.
AutoClassOursAutoClassOursAutoClassOurs
(Var = 0.25)(Var = 0.5)(Var = 1)
CR Mean0.64240.95560.63350.92920.63250.9370
CR Std0.00210.00350.00150.00690.00150.0060
IGR Mean1.00000.89231.00000.90851.00000.9148
IGR Std<0.00010.0148<0.00010.0094<0.00010.0070
Table 3. Average CR and IGR with corresponding standard deviation for each method based on the simulated data of the sample size 1000 with 5 clusters; each cluster has sizes 500, 200, 100, 100, and 100, respectively. The mean values for each cluster are 2, 8, 16,20, and 35, respectively. With the same set of means, the different variances, 0.25, 0.5, and 1 are compared. The number of replications is 500.
Table 3. Average CR and IGR with corresponding standard deviation for each method based on the simulated data of the sample size 1000 with 5 clusters; each cluster has sizes 500, 200, 100, 100, and 100, respectively. The mean values for each cluster are 2, 8, 16,20, and 35, respectively. With the same set of means, the different variances, 0.25, 0.5, and 1 are compared. The number of replications is 500.
AutoClassOurAutoClassOursAutoClassOurs
(Var = 0.25)(Var = 0.5)(Var = 1)
CR Mean0.56380.87470.55980.87920.56150.8777
CR Std0.00160.01850.00150.01790.00140.0189
IGR Mean0.73370.92280.73380.91740.73380.9235
IGR Std<0.00010.0021<0.00010.0049<0.00010.0037
Table 4. Average CR and IGR with corresponding standard deviation for each method based on the simulated data of sample size 10,000 with 3 clusters; each cluster has sizes 5500, 3000, and 1500, respectively. Continuous data are from a multivariate t-distribution with degree freedom 5, 15, and 30, respectively. With the same set of means, the different variances, 0.25, 0.5, and 1 are compared. The number of replications is 100.
Table 4. Average CR and IGR with corresponding standard deviation for each method based on the simulated data of sample size 10,000 with 3 clusters; each cluster has sizes 5500, 3000, and 1500, respectively. Continuous data are from a multivariate t-distribution with degree freedom 5, 15, and 30, respectively. With the same set of means, the different variances, 0.25, 0.5, and 1 are compared. The number of replications is 100.
AutoClassOurAutoClassOursAutoClassOurs
(Var = 0.25)(Var = 0.5)(Var = 1)
CR Mean0.81200.96890.82310.96890.82020.9641
CR Std0.00190.00310.00230.00310.00330.0034
IGR Mean1.00000.93331.00000.93331.00000.9323
IGR Std<0.00010.0067<0.00010.0067<0.00010.0048
Table 5. Two Real Data Results from two comparison methods. Heart data have 2 clusters with a sample size of 270 and Australian data has 2 clusters with a sample size of 690.
Table 5. Two Real Data Results from two comparison methods. Heart data have 2 clusters with a sample size of 270 and Australian data has 2 clusters with a sample size of 690.
HeartAustralian
AutoClassOursAutoClassOurs
CR0.44440.81480.52170.7362
IGR0.27540.69750.27610.8314
Number of clusters5272
Table 6. Confusion Matrix for our method.
Table 6. Confusion Matrix for our method.
Our Method
Non-DepressedDepressedTotal
TrueNon-depressed14,097215316,250
Depressed18008082608
Total15,897296118,858
Table 7. Confusion Matrix for AutoClass.
Table 7. Confusion Matrix for AutoClass.
AutoClass
Clst1Clst2Clst3Clst4Clst5Clst6Clst7Clst8Clst9Total
TrueNon-depressed3117236214572039174920321915120137816,250
Depressed216217781335461158144223732608
Total3333257922382374221021902059142445118,858
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xu, Y.; Gao, X.; Wang, X. Nonparametric Clustering of Mixed Data Using Modified Chi-Squared Tests. Entropy 2022, 24, 1749. https://doi.org/10.3390/e24121749

AMA Style

Xu Y, Gao X, Wang X. Nonparametric Clustering of Mixed Data Using Modified Chi-Squared Tests. Entropy. 2022; 24(12):1749. https://doi.org/10.3390/e24121749

Chicago/Turabian Style

Xu, Yawen, Xin Gao, and Xiaogang Wang. 2022. "Nonparametric Clustering of Mixed Data Using Modified Chi-Squared Tests" Entropy 24, no. 12: 1749. https://doi.org/10.3390/e24121749

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop