Abstract
Clustering analysis based on a mixture of multivariate normal distributions is commonly used in the clustering of multidimensional data sets. Model selection is one of the most important problems in mixture cluster analysis based on the mixture of multivariate normal distributions. Model selection involves the determination of the number of components (clusters) and the selection of an appropriate covariance structure in the mixture cluster analysis. In this study, the efficiency of information criteria that are commonly used in model selection is examined. The effectiveness of information criteria has been determined according to the success in the selection of the number of components and in the selection of an appropriate covariance matrix.
1. Introduction
Models for mixtures of distributions—first discussed by Newcomb [1] and Pearson [2]—are currently very popular in clustering. Wolfe [3,4] and Day [5] proposed a multivariate normal mixture model in cluster analysis. The most important problems in clustering are choosing the number of components and identifying the structure of the covariance matrix, based on modeling with multivariate normal distributions for each component that forms the data set. Oliveira-Brochado and Martins [6] examined information criteria used in the determination of the number of components in the mixture model. Despite the many criteria used in the determination of the number of components, these criteria cannot always give accurate results. In particular, information criteria on real data sets with a known number of clusters give different results. In this study, commonly used methods for the determination of the number of clusters—Akaike Information Criterion (AIC), corrected Akaike Information Criterion (AICc), Bayesian Information Criterion (BIC), Classification Likelihood Criterion (CLC), Approximate Weight of Evidence Criterion (AWE), Normalized Entropy Criterion (NEC), Kullback Information Criterion (KIC), corrected Kullback Information Criterion (KICc), and an approximation of Kullback Information Criterion (AKICc) are compared according to the effectiveness of the information criteria, determined by the number of components, and determined by the success in the selection of appropriate covariance matrices and classification accuracy (CA).
2. Clustering Based on Multivariate Finite Mixture Distributions
Mixture cluster analysis based on the mixture of multivariate distributions assumes that the data to be clustered are from several subgroups or clusters, with distinct multivariate distributions. In mixture cluster analysis, each cluster is mathematically represented by a parametric distribution, such as multivariate normal distribution. The entire data set is modeled by a mixture of these distributions.
Assume that there are observations with -dimensions, such that an observed random sample is expressed as . The probability density function of finite mixture distribution models is given by [7],
where are probability density functions of the components and are the mixing proportions or weights. Here, and . The parameter vector contains all of the parameters of the mixture models. Here denotes unknown parameters of the probability density function of the ith components (subgroup or cluster) in the mixture models. In Equation (1), the number of components or clusters is represented by .
The mixture likelihood approach can be used for estimation of the parameters in the mixture models. This approach assumes that the probability function can be the sum of weighted component densities. If the mixture likelihood approach is used for clustering, the clustering problem becomes a problem of estimating the parameters of a mixture distribution model. The maximum-likelihood function is given as follows [8],
The most widely used approach for parameter estimation is the Expectation-Maximization (EM) algorithm [9].
In the EM framework, the data are considered incomplete because their associated component label vectors are not observed. The component label variables are consequently introduced, where is defined to be one or zero, according to whether did or did not arise from the ith component of the mixture model (; ). The completed data vector is represented as follows
where
is the unobservable vector of component-indicator variables. The log-likelihood function for the completed data is shown as
3. EM Algorithm
The EM algorithm is applied to this problem by treating the as missing data. In this part, the E and M steps of the EM algorithm are described for the mixture distribution models [7].
E step: Log-likelihood function of the complete data, since is linear in terms of its label values in the E step; given observed value, the instant conditional expected values of the categorical variables of are calculated. Here, is a random variable corresponding to . For parameter vector , the initial value is assigned. In the first loop of the EM algorithm, while is given for the E step, the conditional expected value of is calculated with the initial value of .
In the (k + 1)th loop of the E step of the EM algorithm, the expression must be represented. Here, is a value of vector which is obtained from the kth step of EM. Since the Eth step of the (k + 1)th loop of the EM algorithm, where and the formula below is calculated.
Here, the expression of is the membership probability of pattern in segment (posterior probability). While is given using the expression in Equation (7), the conditional probability in Equation (5) can be calculated as follows
M step: In the (k + 1)th loop of the EM algorithm, the estimated value of , defined in parameter space , that makes the function maximum is calculated. In the finite mixture probability distribution model, the current estimate of is done independently from the updated vector of the unknown parameters in component density .
If ’s are observed, the maximum likelihood estimation of for completed data can be found as
If the logarithm takes in completed data in the Eth step of the EM algorithm, values are used instead of the expression. Similarly, when the current estimate of is calculated, is used instead of the expression in Equation (9), as shown below:
In the (k + 1)th iteration of the Mth step of the EM algorithm, the current value of is defined as
The E and M steps are repeated until the convergence criterion in the EM algorithm is satisfied. As a convenient stopping rule for convergence, if the difference of is quite small or stable, the algorithm is terminated.
4. The Mixture of Multivariate Normal Distribution
The mixture density function of the multivariate normal distribution is given by [7];
where is a multivariate normal distribution function, such that
Here, the mean vector is , and the covariance matrix is , , and . In this case, all unknown parameters of the model are shown as . Here, occurs from the mean compound vectors and the compound covariance matrix of the parameters of the compound probability density function in the mixture distribution model. Posterior probability is given as
Maximum likelihood estimators of updated mixture proportions , and mean vector of the (k + 1)th iteration of the Mth step is calculated, respectively, by
Current estimates of the covariance matrix () of the component probability density are calculated via the following formula
5. Information Criteria for Model Selection in Model Based Clustering
Model selection is one of the most important problems in mixture cluster analysis based on the mixture of multivariate normal distributions. Model selection includes the determination of the number of components (cluster) and the selection of an appropriate covariance structure in the mixture cluster analysis. Information criteria are often used in the model selection in mixture cluster analysis. In the literature, information criteria are usually computed as twice a negative value of the bias correction
Here, the first term represents the lack of harmonization, and the second term is a measure of complexity. is usually called the penalty term. The best model that makes the term minimum is selected. Some commonly used information criteria in the literature are given below [6,7,10]:
- If the number of parameters in the model is shown by , this is called as Akaike’s Information Criterion (AIC), defined asA model that makes the AIC score minimum can be selected as the best model [11].
- When d is large relative to the sample size n (which includes when n is small, for any d) there is a small-sample version called AICc. AICc is defined asThe model that yields the minimum AICc score can be selected as the best model [12].
- If we take the number of parameters in the mixture distribution models , and the number of observations , the Bayesian Information Criterion (BIC) can be calculated asThe model that gives the minimum BIC score can be selected as the best model [13].
- The Hathaway [14] mixture logarithmic likelihood is formulated asHere, Equation (23) is defined aswhere is the entropy of the fuzzy classification matrix .The CLC (Classification Likelihood Criterion) is defined asA model that gives the minimum CLC score can be selected as the best model [15].
- The Approximate Weight of Evidence (AWE) is expressed asA model that gives the minimum AWE score can be selected as the best model [16].
- The Normalized Entropy Criterion (NEC) is shown as below [17]Here, is a maximum likelihood estimator for when (g = 1). The minimum NEC for the number of components is selected as the number of clusters. When , entropy takes the value of zero. For this case, Biernacki et al. [18] suggested the selection of the minimum value of NEC where the number of components , including NEC < 1.
- Cavanaugh [19] has proposed an asymptotic unbiased estimator of the Kullback information criterion (KIC). KIC is defined as
- Bias correction of the Kullback information criterion (KICc) and an approximation of the Kullback information criterion (AKICc) are shown as below [20,21]Here, is the number of parameters in the model, is the sample size, and (.) is the digamma or the psi function.
6. Application and Results
In this section, the performances of the information criteria used for the determination of the number of clusters are compared. Moreover, the efficiency of the different types of covariance matrices are investigated in the model based on clustering. The comparison of the information criteria is performed using two different settings. First, commonly used real data sets are used. Second, synthetic data sets are generated by using the properties of these real data sets, and they are used for comparison.
The properties of real data sets are given in Table 1. Moreover, the computed information criteria for each different data set are provided in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8.
Table 1.
Descriptions of real data sets.
Table 2.
Information criteria results in the determination of the number of clusters for the Liver Disorders data set.
Table 3.
Information criteria results in the determination of the number of clusters for the Iris data set.
Table 4.
Information criteria results in the determination of the number of clusters for the Wine data set.
Table 5.
Information criteria results in the determination of the number of clusters for the Ruspini data set.
Table 6.
Information criteria results in the determination of the number of clusters for the Vehicle Silhouettes data set.
Table 7.
Information criteria results in the determination of the number of clusters for the Landsat Satellite data set.
Table 8.
Information criteria results in the determination of the number of clusters for the Image Segmentation data set.
The appropriate number of clusters is determined as the value which gives the minimum information criteria. According to Table 2, the number of clusters of the Liver Disorders data set is correctly determined via AWE, BIC, KICc, and NEC. In Table 3, AICc and KIC could accurately determine the number of clusters of the Iris data set. The number of clusters of the Wine data set is correctly determined via AIC and KIC in Table 4.
According to Table 5, the number of clusters of the Ruspini [23] data set is correctly determined via AICc, AKICc, BIC, and KICc. In Table 6, the number of clusters of the Vehicle Silhouettes data set is correctly determined by AIC, BIC, CLC, and KIC.
According to Table 7, the number of clusters for the Landsat Satellite data set is correctly determined via AIC and KIC. In Table 8, the number of clusters for the Image Segmentation data set is correctly determined by all information criteria.
In Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, the performance of each information criterion varies in each data set. In order to make general conclusions, a simulation study is provided. By using the properties of each real data set, synthetic data sets are generated. In this simulation, we generated 1000 data sets according to each real data set. The synthetic data sets are generated from Liver, Iris, Wine, Ruspini, Vehicle, Landsat, and Image data sets. The cluster number determination accuracy is computed for each information criterion. The results are given in Table 9 and Figure 1. According to simulation results, better results are obtained by using KIC.
Table 9.
The accuracy of determining the cluster numbers from the information criteria according to synthetic data sets.
Figure 1.
According to synthetic data sets, (a) the percentage of success for the determination of the number of clusters from the best six information criteria and (b) the average of success in determining the number of clusters from the information criteria.
The efficiency of different types of covariance structures in mixture clustering based on a mixture of multivariate normal distributions is investigated.
According to the number of clusters regarding each data set, classification accuracy and information criteria are computed for each covariance structure. The results are given in Table 10.
Table 10.
Classification accuracy (CA) and information criteria results for real data sets, according to different types of covariance structures.
According to Table 10, the Type III covariance matrix of each subgroup has generally performed better in the results, both in terms of the correct classification and the minimum information criteria value. The classification accuracy in mixture clustering based on a mixture of multivariate normal distributions according to covariance types is given in Figure 2.
Figure 2.
The efficiency of different covariance types in the mixture clustering, according to the classification accuracy.
7. Conclusions
In this study, we compared the effectiveness of information criteria in clustering analysis based on the mixture of multivariate normal distributions. As a result of this simulation study, KIC gave better results than other information criteria in the determination of the number of clusters in mixture clustering based on a mixture of multivariate normal distributions. Also, the efficiency of different types of covariance matrices are investigated in the model based clustering. The better results are obtained by the using covariance matrix of each subgroup (Type III) in mixture clustering based on a mixture of multivariate normal distributions.
Acknowledgments
This research has been supported by TUBITAK-BIDEB (2211) Ph.D. scholarship program. The author is grateful to the editors and the anonymous referees for their constructive comments and valuable suggestions which have helped me very much to improve the paper.
Author Contributions
All authors have equally contributed to this paper. They have read and approved the final version of the manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Newcomb, S. A generalized theory of the combination of observations so as to obtain the best result. Am. J. Math. 1886, 8, 343–366. [Google Scholar] [CrossRef]
- Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A 1894, 185, 71–110. [Google Scholar] [CrossRef]
- Wolfe, J.H. A Computer Program for the Maximum Likelihood Analysis of Types; U.S. Naval Personnel Research Activity: San Diego, CA, USA, 1965. [Google Scholar]
- Wolfe, J.H. Normix: Computational Methods for Estimating the Parameters of Multivariate Normal Mixtures of Distributions; U.S. Naval Personnel Research Activity: San Diego, CA, USA, 1967. [Google Scholar]
- Day, N.E. Estimating the components of a mixture of normal distributions. Biometrika 1969, 56, 463–474. [Google Scholar] [CrossRef]
- Oliveira-Brochado, A.; Martins, F.V. Assessing the Number of Components in Mixture Models: A Review; Universidade do Porto, Faculdade de Economia do Porto: Porto, Portugal, 2005. [Google Scholar]
- Mclachlan, G.; Peel, D. Finite Mixture Models; John Wiley & Sons, Inc.: New York, NY, USA, 2000. [Google Scholar]
- Fraley, C. Algorithms for model-based Gaussian hierarchical clustering. SIAM J. Sci. Comput. 1998, 20, 270–281. [Google Scholar] [CrossRef]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser B 1977, 39, 1–38. [Google Scholar]
- Servi, T. Multivariate Mixture Distribution Model Based Cluster Analysis. Ph.D. Thesis, University of Cukurova, Adana, Turkey, 2009. [Google Scholar]
- Akaike, H. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory; Petrov, B.N., Csaki, F., Eds.; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
- Hurvich, C.M.; Tsai, C.L. Regression and time series model selection in small samples. Biometrika 1989, 76, 297–307. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Hathaway, R.J. Another interpretation of the EM algorithm for mixture distributions. Stat. Probab. Lett. 1986, 4, 53–56. [Google Scholar] [CrossRef]
- Biernacki, C.; Govaert, G. Using the classification likelihood to choose the number of clusters. Comput. Sci. Stat. 1997, 29, 451–457. [Google Scholar]
- Banfield, J.D.; Raftery, A.E. Model-based Gaussian and non-Gaussian clustering. Biometrics 1993, 49, 803–821. [Google Scholar] [CrossRef]
- Celeux, G.; Soromenho, G. An entropy criterion for assessing the number of clusters in a mixture. J. Classif. 1996, 13, 195–212. [Google Scholar] [CrossRef]
- Biernacki, C.; Celeux, C.; Govaert, G. An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recognit. Lett. 1999, 20, 267–272. [Google Scholar] [CrossRef]
- Cavanaugh, J.E. A large-sample model selection criterion based on Kullback’s symmetric divergence. Stat. Probab. Lett. 1999, 42, 333–343. [Google Scholar] [CrossRef]
- Seghouane, A.-K.; Maiza, B. A small sample model selection criterion based on Kullback’s symmetric divergence. IEEE Trans. Signal Process. 2004, 52, 3314–3323. [Google Scholar] [CrossRef]
- Seghouane, A-K.; Bekara, M.; Fleury, G. A criterion for model selection in the presence of incomplete data based on Kullback’s symmetric divergence. Signal Process. 2005, 85, 1405–1417. [Google Scholar]
- UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 6 May 2016).
- Ruspini, E.H. Numerical methods for fuzzy clustering. Inf. Sci. 1970, 2, 319–350. [Google Scholar] [CrossRef]
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).