A Comparison of Information Criteria in Clustering Based on Mixture of Multivariate Normal Distributions

: Clustering analysis based on a mixture of multivariate normal distributions is commonly used in the clustering of multidimensional data sets. Model selection is one of the most important problems in mixture cluster analysis based on the mixture of multivariate normal distributions. Model selection involves the determination of the number of components (clusters) and the selection of an appropriate covariance structure in the mixture cluster analysis. In this study, the efﬁciency of information criteria that are commonly used in model selection is examined. The effectiveness of information criteria has been determined according to the success in the selection of the number of components and in the selection of an appropriate covariance matrix.


Introduction
Models for mixtures of distributions-first discussed by Newcomb [1] and Pearson [2]-are currently very popular in clustering.Wolfe [3,4] and Day [5] proposed a multivariate normal mixture model in cluster analysis.The most important problems in clustering are choosing the number of components and identifying the structure of the covariance matrix, based on modeling with multivariate normal distributions for each component that forms the data set.Oliveira-Brochado and Martins [6] examined information criteria used in the determination of the number of components in the mixture model.Despite the many criteria used in the determination of the number of components, these criteria cannot always give accurate results.In particular, information criteria on real data sets with a known number of clusters give different results.In this study, commonly used methods for the determination of the number of clusters-Akaike Information Criterion (AIC), corrected Akaike Information Criterion (AIC c ), Bayesian Information Criterion (BIC), Classification Likelihood Criterion (CLC), Approximate Weight of Evidence Criterion (AWE), Normalized Entropy Criterion (NEC), Kullback Information Criterion (KIC), corrected Kullback Information Criterion (KIC c ), and an approximation of Kullback Information Criterion (AKIC c ) are compared according to the effectiveness of the information criteria, determined by the number of components, and determined by the success in the selection of appropriate covariance matrices and classification accuracy (CA).

Clustering Based on Multivariate Finite Mixture Distributions
Mixture cluster analysis based on the mixture of multivariate distributions assumes that the data to be clustered are from several subgroups or clusters, with distinct multivariate distributions.In mixture cluster analysis, each cluster is mathematically represented by a parametric distribution, such as multivariate normal distribution.The entire data set is modeled by a mixture of these distributions.
Assume that there are n observations with p-dimensions, such that an observed random sample is expressed as y " py T 1 , . . ., y T n q T .The probability density function of finite mixture distribution models is given by [7], where f i py j ; θ i q are probability density functions of the components and π i are the mixing proportions or weights.Here, 0 ď π i ď 1 and ř g i"1 π i " 1pi " 1, . . ., gq.The parameter vector Ψ " pπ, θq contains all of the parameters of the mixture models.Here θ " pθ 1 , θ 2 , . . ., θ g q denotes unknown parameters of the probability density function of the ith components (subgroup or cluster) in the mixture models.In Equation ( 1), the number of components or clusters is represented by g.
The mixture likelihood approach can be used for estimation of the parameters in the mixture models.This approach assumes that the probability function can be the sum of weighted component densities.If the mixture likelihood approach is used for clustering, the clustering problem becomes a problem of estimating the parameters of a mixture distribution model.The maximum-likelihood function is given as follows [8], The most widely used approach for parameter estimation is the Expectation-Maximization (EM) algorithm [9].
In the EM framework, the data y " py T 1 , y T 2 , . . ., y T n q T are considered incomplete because their associated component label vectors z 1 , z 2 , . . ., z n are not observed.The component label variables z ij are consequently introduced, where z ij is defined to be one or zero, according to whether y j did or did not arise from the ith component of the mixture model (i " 1, 2, . . ., g; j " 1, 2, . . ., n).The completed data vector is represented as follows where is the unobservable vector of component-indicator variables.The log-likelihood function for the completed data is shown as

EM Algorithm
The EM algorithm is applied to this problem by treating the z ij as missing data.In this part, the E and M steps of the EM algorithm are described for the mixture distribution models [7].
E step: Log-likelihood function of the complete data, since z ij is linear in terms of its label values in the E step; given y observed value, the instant conditional expected values of the categorical variables of Z ij are calculated.Here, Z ij is a random variable corresponding to z ij .For parameter vector Ψ, the initial value Ψ p0q is assigned.In the first loop of the EM algorithm, while y is given for the E step, the conditional expected value of logL c pΨq is calculated with the initial value of Ψ p0q .QpΨ; Ψ p0q q " E Ψ p0q t logL c pΨq| yu (6) In the (k + 1)th loop of the E step of the EM algorithm, the expression QpΨ; Ψ pkq q must be represented.Here, Ψ pkq is a value of vector Ψ which is obtained from the kth step of EM.Since the Eth step of the (k + 1)th loop of the EM algorithm, where i " 1, 2, . . ., g and j " 1, 2, . . ., n the formula below is calculated.
Here, the expression of τ i py j ; Ψ pkq q is the membership probability of pattern y j in segment i (posterior probability).While y is given using the expression in Equation ( 7), the conditional probability in Equation ( 5) can be calculated as follows M step: In the (k + 1)th loop of the EM algorithm, the estimated value Ψ pk`1q of Ψ, defined in parameter space Ω, that makes the QpΨ; Ψ pkq q function maximum is calculated.In the finite mixture probability distribution model, the current estimate π pk`1q i of π i is done independently from the updated vector of the unknown parameters in component density ξ.
If z ij 's are observed, the maximum likelihood estimation of π i for completed data can be found as πi " If the logarithm takes in completed data in the Eth step of the EM algorithm, τ i py j ; Ψ pkq q values are used instead of the z ij expression.Similarly, when the current estimate π pk`1q i of π i is calculated, τ i py j ; Ψ pkq q is used instead of the z ij expression in Equation ( 9), as shown below: In the (k + 1)th iteration of the Mth step of the EM algorithm, the current value ξ pk`1q of ξ is defined as The E and M steps are repeated until the convergence criterion in the EM algorithm is satisfied.As a convenient stopping rule for convergence, if the difference of LpΨ pk`1q q ´LpΨ pkq q is quite small or stable, the algorithm is terminated.

The Mixture of Multivariate Normal Distribution
The mixture density function of the multivariate normal distribution is given by [7]; where Φ i py j ; µ i , Σ i q is a multivariate normal distribution function, such that Here, the mean vector is µ i , and the covariance matrix is Σ i , i " 1, 2, . . ., g, and j " 1, 2, . . ., n.
In this case, all unknown parameters of the model are shown as Ψ " pπ 1 , . . ., π g´1 , ξ T q T .Here, ξ occurs from the mean compound vectors µ " pµ 1 , µ 2 , . . ., µ g q and the compound covariance matrix Σ " pΣ 1 , Σ 2 , . . ., Σ g q of the parameters of the compound probability density function in the mixture distribution model.Posterior probability is given as Maximum likelihood estimators of updated mixture proportions π i , and mean vector µ i of the (k + 1)th iteration of the Mth step is calculated, respectively, by Current estimates of the covariance matrix (Σ i ) of the component probability density are calculated via the following formula

Information Criteria for Model Selection in Model Based Clustering
Model selection is one of the most important problems in mixture cluster analysis based on the mixture of multivariate normal distributions.Model selection includes the determination of the number of components (cluster) and the selection of an appropriate covariance structure in the mixture cluster analysis.Information criteria are often used in the model selection in mixture cluster analysis.In the literature, information criteria are usually computed as twice a negative value of the bias correction ´2logLp Ψq `2C Here, the first term represents the lack of harmonization, and the second term C is a measure of complexity.C is usually called the penalty term.The best model that makes the term ´2logLp Ψq `2C minimum is selected.Some commonly used information criteria in the literature are given below [6,7,10]: If the number of parameters in the model is shown by d, this is called as Akaike's Information Criterion (AIC), defined as AIC " ´2logLp Ψq `2d A model that makes the AIC score minimum can be selected as the best model [11].

‚
When d is large relative to the sample size n (which includes when n is small, for any d) there is a small-sample version called AIC c .AIC c is defined as The model that yields the minimum AIC c score can be selected as the best model [12].

‚
If we take the number of parameters in the mixture distribution models d, and the number of observations n, the Bayesian Information Criterion (BIC) can be calculated as The model that gives the minimum BIC score can be selected as the best model [13].

‚
The Hathaway [14] mixture logarithmic likelihood is formulated as logLpΨq " logL c pΨq ´ENpτq (22) Here, Equation ( 23) is defined as where ´ENpτq is the entropy of the fuzzy classification matrix C " pτ ij q ( .The CLC (Classification Likelihood Criterion) is defined as CLC " ´2logLp Ψq `2ENp τq (24) A model that gives the minimum CLC score can be selected as the best model [15].

‚
The Approximate Weight of Evidence (AWE) is expressed as A model that gives the minimum AWE score can be selected as the best model [16].

‚
The Normalized Entropy Criterion (NEC) is shown as below [17] NEC g " ENp τq logLp Ψq ´logLp Ψ˚q ( Here, Ψ˚i s a maximum likelihood estimator for Ψ when (g = 1).The minimum NEC for the number of components g is selected as the number of clusters.When g " 1, entropy takes the value of zero.For this case, Biernacki et al. [18] suggested the selection of the minimum value of NEC where the number of components g ą 1, including NEC < 1.

‚
Cavanaugh [19] has proposed an asymptotic unbiased estimator of the Kullback information criterion (KIC).KIC is defined as

‚
Bias correction of the Kullback information criterion (KIC c ) and an approximation of the Kullback information criterion (AKIC c ) are shown as below [20,21] KIC c " ´2logLp Ψq `2pd `1qn n ´d ´2 ´nψp Here, d is the number of parameters in the model, n is the sample size, and ψ(.) is the digamma or the psi function.

Application and Results
In this section, the performances of the information criteria used for the determination of the number of clusters are compared.Moreover, the efficiency of the different types of covariance matrices are investigated in the model based on clustering.The comparison of the information criteria is performed using two different settings.First, commonly used real data sets are used.Second, synthetic data sets are generated by using the properties of these real data sets, and they are used for comparison.
The properties of real data sets are given in Table 1.Moreover, the computed information criteria for each different data set are provided in Tables 2-8.The appropriate number of clusters is determined as the value which gives the minimum information criteria.According to Table 2, the number of clusters of the Liver Disorders data set is correctly determined via AWE, BIC, KIC c , and NEC.In Table 3, AIC c and KIC could accurately determine the number of clusters of the Iris data set.The number of clusters of the Wine data set is correctly determined via AIC and KIC in Table 4.
According to Table 5, the number of clusters of the Ruspini [23] data set is correctly determined via AIC c , AKIC c , BIC, and KIC c .In Table 6, the number of clusters of the Vehicle Silhouettes data set is correctly determined by AIC, BIC, CLC, and KIC.According to Table 7, the number of clusters for the Landsat Satellite data set is correctly determined via AIC and KIC.In Table 8, the number of clusters for the Image Segmentation data set is correctly determined by all information criteria.
In Tables 2-8, the performance of each information criterion varies in each data set.In order to make general conclusions, a simulation study is provided.By using the properties of each real data set, synthetic data sets are generated.In this simulation, we generated 1000 data sets according to each real data set.The synthetic data sets are generated from Liver, Iris, Wine, Ruspini, Vehicle, Landsat, and Image data sets.The cluster number determination accuracy is computed for each information criterion.The results are given in Table 9 and Figure 1.According to simulation results, better results are obtained by using KIC.
The efficiency of different types of covariance structures in mixture clustering based on a mixture of multivariate normal distributions is investigated.
According to the number of clusters regarding each data set, classification accuracy and information criteria are computed for each covariance structure.The results are given in Table 10.In Tables 2-8, the performance of each information criterion varies in each data set.In order to make general conclusions, a simulation study is provided.By using the properties of each real data set, synthetic data sets are generated.In this simulation, we generated 1000 data sets according to each real data set.The synthetic data sets are generated from Liver, Iris, Wine, Ruspini, Vehicle, Landsat, and Image data sets.The cluster number determination accuracy is computed for each information criterion.The results are given in Table 9 and Figure 1.According to simulation results, better results are obtained by using KIC.

Conclusions
In this study, we compared the effectiveness of information criteria in clustering analysis based on the mixture of multivariate normal distributions.As a result of this simulation study, KIC gave better results than other information criteria in the determination of the number of clusters in mixture clustering based on a mixture of multivariate normal distributions.Also, the efficiency of different types of covariance matrices are investigated in the model based clustering.The better results are obtained by the using covariance matrix of each subgroup (Type III) in mixture clustering based on a mixture of multivariate normal distributions.

Conclusions
In this study, we compared the effectiveness of information criteria in clustering analysis based on the mixture of multivariate normal distributions.As a result of this simulation study, KIC gave better results than other information criteria in the determination of the number of clusters in mixture clustering based on a mixture of multivariate normal distributions.Also, the efficiency of different types of covariance matrices are investigated in the model based clustering.The better results are obtained by the using covariance matrix of each subgroup (Type III) in mixture clustering based on a mixture of multivariate normal distributions.

Figure 1 .
Figure 1.According to synthetic data sets, (a) the percentage of success for the determination of the number of clusters from the best six information criteria and (b) the average of success in determining the number of clusters from the information criteria.

Figure 1 .
Figure 1.According to synthetic data sets, (a) the percentage of success for the determination of the number of clusters from the best six information criteria and (b) the average of success in determining the number of clusters from the information criteria.

Figure 2 .
Figure 2. The efficiency of different covariance types in the mixture clustering, according to the classification accuracy.

Figure 2 .
Figure 2. The efficiency of different covariance types in the mixture clustering, according to the classification accuracy.

Table 1 .
Descriptions of real data sets.

Table 2 .
Information criteria results in the determination of the number of clusters for the Liver Disorders data set.
Note: * True value of g or value of g given by criterion.AIC: Akaike information criterion; AIC c : corrected Akaike information criterion; AKIC c : approximation of Kullback information criterion; AWE: approximate weight of evidence criterion; BIC: Bayesian information criterion; CLC: classification likelihood criterion; KIC: Kullback information criterion; KIC c : corrected Kullback information criterion; NEC: normalized entropy criterion.

Table 3 .
Information criteria results in the determination of the number of clusters for the Iris data set.
Note: * True value of g or value of g given by criterion.

Table 4 .
Information criteria results in the determination of the number of clusters for the Wine data set.
Note: * True value of g or value of g given by criterion.KIC c could not be calculated because d is greater than n.

Table 5 .
Information criteria results in the determination of the number of clusters for the Ruspini data set.
Note: * True value of g or value of g given by criterion.

Table 6 .
Information criteria results in the determination of the number of clusters for the Vehicle Silhouettes data set.
Note: * True value of g or value of g given by criterion.KIC c (g " 5) could not be calculated because d is greater than n.

Table 7 .
Information criteria results in the determination of the number of clusters for the Landsat Satellite data set.
Note: * True value of g or value of g given by criterion.AWE and NEC have found g " 2.

Table 8 .
Information criteria results in the determination of the number of clusters for the Image Segmentation data set.
Note: * True value of g or value of g given by criterion.

Table 8 .
Information criteria results in the determination of the number of clusters for the Image Segmentation data set.

Table 9 .
The accuracy of determining the cluster numbers from the information criteria according to synthetic data sets.
Note: The best performance is indicated in bold.

Table 9 .
The accuracy of determining the cluster numbers from the information criteria according to synthetic data sets.

Table 10 .
Classification accuracy (CA) and information criteria results for real data sets, according to different types of covariance structures.