Nonparametric Clustering of Mixed Data Using Modified Chi-Squared Tests

We propose a non-parametric method to cluster mixed data containing both continuous and discrete random variables. The product space of the continuous and discrete sample space is transformed into a new product space based on adaptive quantization on the continuous part. Detection of cluster patterns on the product space is determined locally by using a weighted modified chi-squared test. Our algorithm does not require any user input since the number of clusters is determined automatically by data. Simulation studies and real data analysis results show that our proposed method outperforms the benchmark method, AutoClass, in various settings.


Introduction
Mixed data are abundant in scientific research especially in medical or biological studies.An effective clustering method for mixed data will partition a large and complex data set into manageable and homogeneous subgroups.It thus has wide range applications in almost any scientific studies including financial data, personalized medicine and scientific studies on climate changes.
Most of the clustering methods in the literature have been mainly focused on numerical data.K-mean algorithm has been widely used in the industry for a long time.Detailed description and discussions can be found in Kaufman and Rousseeuw (2005).To capture the intrinsic geometric properties, a suitable distance function such as Manhattan distance or Mahoblis distances can be used when the underlying sample space are believed to be non-Euclidean.K-mode algorithm by Huang (1997) extends this geometrical approach to categorical data.However, this has proven to be not very successful for categorical data as demonstrated in Zhang et. al (2005).The geometrical or topological natures of continuous and categorical sample spaces are intrinsically different since the first one can be endowed with a differential manifold while the second one is defined entirely on a lattice with discontinues functions.Even when suitable distance functions are valid for continuous and discrete portion, a challenging question is how to combine the metrics from a continuous and a discrete sample space.A naive approach is to consider a convex combination of the two metrics which implies that the product space of continuous and discrete data can be metrizable in this fashion.The major difficulty is on how to choose the weights without introducing significant local or global distortions.
Alternatively, a parametric model based on Gaussian mixture could be used for continuous data, see Banfield and Raftery (1993).One of the most prominent methods is by Bradley et. al (1998) which can be scaled to large disk-resident data sets.
The number of clusters and outliers can be handled simultaneously.Fraley and Raftery (1998) propose to choose the number of clusters automatically for modelbased clustering method.For clustering mixed data, the AutoClass method proposed by Cheeseman and Stutz (1995) is well known and could be considered as the benchmark method for model-based clustering method in this class.AutoClass takes a database containing both real and discrete valued attributes, and automatically finds the number of clusters and groups automatically.This method has widely used in NASA and it helped to find infra-red stars in the IRAS Low Resolution Spectral catalogue and discovery of classes of proteins.
Instead a parametric model, we propose a non-parametric clustering method which does not assume a global distance function or any knowledge of the functional form of the joint probability density function.The key idea is inspired by the fact that any complicated manifold is supposed to be "locally" by a manifold with simpler structure.For example, it is well known that a differentiable manifold is homeomphic to R m .For categorical data, we suppose that a neighborhood on a lattice can be sufficiently characterized by the Hamming distance.The Hamming distance is widely used in information and coding theory, see Roman (1992)

Method
In this section, we introduce mixed sample space, on which we adopt the Hamming distance and Euclidean distance function to measure the relative positions of two data points.We define a HD vector, ED vector and optimal separation point which are essential component for the proposed weighted local chi-square test for clustering.

Joint Sample Space of Mixed Data
Now consider a general setup for mixed data where p nominal categorical attributes and q continuous attributes are of interest.The jth categorical attribute is categorized by m j levels defined by set i ) t being the vector of the observed states of p attributes for subject i.The categorical sample space, Ω p is defined as a collection of all possible p-dimensional vectors of states, namely i ) t , being the vector of the observed values of q attributes for subject i, where The continuous sample space is defined as Ω q = R q .
The mixed data consists of (X, Z) with overall space Ω = Ω p ⊗ Ω q .

Distance Vectors
We use Hamming distance (HD) to measure the relative positions of two categorical data points and Euclidean distance (ED) to measure the that of two continuous data points.
To be more specific, for any two positions in the categorical sample space Ω p , , the Hamming distance (HD) between X h and X i on the jth attribute is and the distance between the two positions is the sum of the componentwise distances, For continuous data, the Euclidean distance (ED) between the two positions is defined as ( We now introduce HD Vector and ED Vector.Let (S, T) be a reference position in the sample space with S = (s We measure the distance of all data points to the selected reference point.For the categorical portion, HD(X i , S) can take values ranging from 0 to p.We define the HD vector to record the frequencies of each distance value.Namely, the HD vector U(S) is a (p+1)element vector and is defined as (U 0 (S), U 1 (S), ...U p (S)) t where the j th component is given by where I(A) is the indicator function that takes value 1 when event A happens and value 0 when event A does not happen.For continuous portion of the data, in order to construct a frequency vector of ED(Z i , T) , we need to choose the bin size.The choice of bin size should be user defined.In practice, we find that choosing bin size l equal to 10 gives satisfactory empirical result.
where the jth component is the frequency given by In order to use the HD vector ED vector to detect possible clusters, we define a reference or null HD vector and ED vector when there is no clustering pattern in the mixed sample space Ω p ⊗ Ω q .If there is indeed no pattern, then it is equally likely for a randomly chosen data point to take any possible position in the joint sample space.The resulting HD vector is called uniform HD vector (UHD) and ED vector is called uniform ED vector (UED) which record the the expected frequencies under the null hypothesis that there are no clustering patterns in data.Let X be a categorical portion of data and Z be a continuous portion of the data from a sample of size n, with each observation having an equal probability of locating at any position on space Ω p ⊗ Ω q .The expected value of HD vector and ED vector associated with the null hypothesis, denoted by ε = (ε 0 , • • • ε P ) t and ν = (ν 1 , • • • ν l ) t are denoted as the UHD vector and UED vector, respectively.Zhang et. al (2005) proved that the UHD takes the form of ε = n M U * , where . For continuous data, the exact distribution of the UED vector is not tractable.We then obtain this vector by computer simulations.We simulate random data points with q continuous independent attributes.The UED vector is the sampling frequencies of ED vector from the simulations corresponds to the null hypothesis that there are no more than one cluster.. Figure 1 provides the plot of UED vector obtained from simulated null hypothesis with no clusters and ED vector obtained from a simulated data set with clustering structures.

Optimal Separation Point
If the initial starting point is chosen to be the center of one particular cluster, then the frequency of ED should demonstrate a general decreasing pattern as the ED function records the frequencies of data points from the center of cluster and outwards.Small local bumps at the beginning part of the ED curve are expected if the initial starting point deviate slightly from the cluster center.Any substantial reversal of decreasing trend will produce a valley area on the ED curve as can be seen from the Figure 2.This might indicate distances that corresponds to boundary points of the current cluster.The recorded frequencies might increase when the function records distances from another cluster.Therefore, the valley area is an ideal place to perform an operation to separate data points from the current cluster with the rest.
Assume that the categorical data X and continuous data Z are not uniformly distributed in the sample space Ω p ⊗ Ω q .Let U(S) = (U 0 (S), U 1 (S), • • • , U p (S)) t , S ∈ Ω p be the collection of all (p + 1)-element HD vectors in the space Ω p and V (T) = collection of all l-element ED vectors in the space Ω q , and let ε = (ε 0 , ε 1 , • • • , ε p ) t be the UHD vector and ν = (ν 1 , ν 2 , • • • , ν l ) t be the UED vector defined in above subsection.For a given distance value j c , j = 0, 1, • • • , p for categorical distance values and j d , j d = 1, 2, • • • , l for continuous distance values, there always exists at least one position (S,T) ∈ Ω p ⊗Ω q , such that the frequency at this distance value is lager than the corresponding component, ε j of the UHD vector ε and ν j of the UED vector ν.In order to compare the HD vector with the UHD vector, and ED vector to UED vector, we introduce a selection criterion for an optimal separation or cut-off point r * .The categorical cut-off was defined and proved by Zhang et. al (2005).We extend their approach to the continuous portion of data.If the clusters structure is present, the early segment of an HD vector and ED vector with respect to a data center should contain substantially larger frequencies than the corresponding frequencies of the UHD vector and UED vector respectively.
When the observed distances vectors are intersecting and going below the UHD or UED vectors, valley areas are created and they provide good hints about the locations of optimal separation points.This leas to an optimal r * c for categorical portion of data be: Similarly, optimal r * d for continuous portion of data be: The vertical line in the Figure 2 is the selected optimal separation point for continuous data, where two curve lines are first intercepted.

Algorithm
There are two key steps for the method.Firstly, we detect whether there exists any statistically significant clustering pattern.We propose to use weighted local Chisquare test to determine if the observed distance vectors differ significantly from the uniform distance vectors associated with no cluster pattern.Secondly, if the patterns are significant, we then extract the clusters based on the optimal separation strategies described in the previous section.
We consider the null hypothesis H 0 : There is no clustering pattern in data set.
The weighted local Chi-squared χ 2 * w is defined as: where the categorical part χ 2 c (r * c ; S) takes form as: and the continuous part χ 2 d (r * d ; T) takes the form: , where p and p are number of attributes from categorical and continuous data respectively; After applying the statistical test with significant result, we proceed to extract clusters by determining cluster centers C and estimate cluster radius R for mixed data.Therefore, a cluster center C is chosen where the χ 2 w has the maximum value: How to determine the cluster size is the next key step to complete cluster extraction process.Radius is the term we define to determine the size of a cluster.Zhang et.al (2005) gave the definition of radius which is the maximum distance of the data points in this cluster to its center.Radius is the distance at which the HD vector has its very first local minimum.Therefore, he defined categorical radius R c (C) as: For continuous part of the data, only those values before cut-off point are sensitive for selecting radius.The detailed procedures for our method are described as the following: Step 1.For each position S, we calculate Hamming distance (HD) in the Categorical sample space H c and Euclidean distance (ED) in Continuous sample space H d ; Step 2. Based on HD and ED, calculate and compare with UHD Vector and UED Vector; Step 3 Determine cut-off r * c (S) and r * d (S) for categorical and continuous data respectively; and further calculate the corresponding modified Chi-squared statistic χ 2 * c (S) and χ 2 * d (S) and obtain test statistic, weighted local chi-square χ 2 * w (S); Step 4 Corresponding to the weighted local Chi-squared test, select the largest test statistic χ 2 * w (T * ); compare it with critical value χ 2 * (0.05) .If the max(χ 2 * w (T * )) is smaller than χ 2 * (0.05) , stop the algorithm; otherwise, continue to step 5; Step 5 Assign the position who has either the largest test statistic χ 2 * w (T * ); Step 6 Calculate categorical radius R c (C) and continuous radius R d (C); label all data points within radius in the cluster; remove them from the current data set; Step 7 Repeat Step 1 to 7 until no more significant clusters are detected.

Numerical Results
We carry out simulation studies to examine the performances of our proposed method.
Classification rates and information gains are calculated to compare the performance from our proposed method and AutoClass algorithm.For simplicity, we assume all attributes are independent in the mixed data.The simulation setting is as the following: 1. Set the number of categorical attributes p = 10 and each attribute takes m j levels which is randomly selected from the set {4, 5, 6}; Set the number of continuous attributes q = 10.Table 1 to Table 4 provide results from the simulation experiments with 500 replications.Averages of classification rates (CR) and information gains (IG) with their corresponding standard deviations are used to evaluate two methods' performance.
Table 1 to Table 4 show the results from simulated data with various settings of sample size, number of clusters and cluster sizes.It can be seen from Table 1 that our proposed algorithm has relatively higher classification rate and information gain rate with lower standard deviations corre-spondingly, especially in continuous portion of data.Table 2 shows, compared to Au-toClass, our method generates significantly higher CR with lower standard deviations for both categorical and continuous data, for instance, for Categorical1, AutoClass gives 75% CR with 14% standard deviation, and ours gives 96% CR with 5% standard deviation.Table 3 and Table 4 show the same patterns.In summary, speaking, it is shown by Table 1 to 4, our proposed algorithm consistently has higher classification rate and information gain rate with lower standard deviation correspondingly.

Discussion
We have proposed a non-parametric clustering method based on weighted modified Chi-sqaure test.Numerical results show that the proposed method outperforms the AutoClass algorithm based on classification rate and entropy measure for various simulation settings.The proposed method is most useful when neither a distance function nor a parametric model can be assumed.We will extend this proposed method to cluster spatial and temporal data.

Figure 1 :
Figure 1: Plots for ED Vector and UED Vector.

Figure 3
gives empirical CDF plot of ED values where ED values jump at certain point.The first jump point is used as the value of continuous radius.More specifically, during each extraction iteration, we remove those extracted data points from the rest of clustering process in order to calculate distance between subjects to a fixed reference position.

Figure 3 :
Figure 3: Determine Radius R d for continuous portion of the data

2 . 5 and 1 respectively. 3 .
Set the number of clusters K c = K d = 3 or 5 with various sizes.The number of replications is 500.Continuous case 1, 2, 3 are generated from 10 independent normal distributions with same mean but variance are ranging from 0.25, 0.For categorical data, in the k th cluster with center C k , generate n k 10-attributes vectors independently.More specifically, generate for each attribute from a multinomial distribution with center probability 0.7 and the rest probability are identically equal to 0.3/(m j − 1); For continuous data, n k 10-attributes vectors are 10 independent normal random variables with µ = C k and σ 2 ranging from 0.25, 0.5 and 1, respectively.
and Laboulias et.al(2002).It only measures how many attributes are different without any attempt to impose any order on the magnitude of the observed difference.When the true manifold can be approximated locally by the product space of two manifolds that adopt either Euclidean or Hamming distance, a statistical test is designed based a weighted local Chi-squared test.This idea of local test for clustering was first proposed by Zhang

Table 1
is obtained from analyzing data withs with a sample size is 200 with 3 clusters of the sizes of 100, 75 and 25, respectively.Table2is obtained from simulated data with sample size 200, cluster numbers 3 and each cluster size 130, 45 and 25, respectively.Simulated data for Table3and 4 have sample size 100 and number of clusters is 5, but each cluster size is 40, 25, 15, 10 and 10 for Table3and 35, 25, 20, 10 and 10 for Table4.

Table 1 :
Average Classification Rates (CR) and Information Gains (IG) with corresponding stand deviation for each method.The sample size is 200 with 3 clusters.Each cluster has size 100, 75 and 25, respectively.Replication time is 500.

Table 2 :
Average CR and IG with corresponding stand deviation for each method.The sample size is 200 with 3 clusters Each cluster has size 130, 45 and 25, respectively.

Table 3 :
Average CR and IG with corresponding standard deviations for each method.The sample size is 100 with 5 clusters.Each cluster has size 40, 25, 15,10 and 10, respectively.

Table 4 :
CR and IG with corresponding stand deviations for each method.The sample size is 100 with 5 clusters.Each cluster has size 35, 25, 20, 10 and 10, respectively.Replication time is 500.