Next Article in Journal
Disproportionate Allocation of Indirect Costs at Individual-Farm Level Using Maximum Entropy
Previous Article in Journal
Invariant Components of Synergy, Redundancy, and Unique Information among Three Variables
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Approach for Determining the Number of Clusters in a Model-Based Cluster Analysis

1
Department of Statistics, Yildiz Technical University, 34220 Istanbul, Turkey
2
Department of Statistics, Necmettin Erbakan University, 42090 Konya, Turkey
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(9), 452; https://doi.org/10.3390/e19090452
Submission received: 12 July 2017 / Revised: 27 August 2017 / Accepted: 27 August 2017 / Published: 29 August 2017
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
To determine the number of clusters in the clustering analysis that has a broad range of applied sciences, such as physics, chemistry, biology, engineering, economics etc., many methods have been proposed in the literature. The aim of this paper is to determine the number of clusters of a dataset in a model-based clustering by using an Analytic Hierarchy Process (AHP). In this study, the AHP model has been created by using the information criteria Akaike’s Information Criterion (AIC), Approximate Weight of Evidence (AWE), Bayesian Information Criterion (BIC), Classification Likelihood Criterion (CLC), and Kullback Information Criterion (KIC). The achievement of the proposed approach has been tested on common real and synthetic datasets. The proposed approach based on the corresponding information criteria has produced accurate results. The currently produced results have been seen to be more accurate than those corresponding to the information criteria.

1. Introduction

Many clustering algorithms have been encountered in the literature. The clustering algorithms can be categorized into centroid-based clustering, connectivity-based clustering, model-based clustering, and so on [1]. Since each one of the clustering algorithms is of great importance in its own application area, and thus model-based clustering has a very large application field, the present study focuses on the use of the model-based clustering one with the combination of the Analytic Hierarchy Process (AHP). Since the AHP is one of the most important multi-criteria decision making (MCDM) [2] and determining the number of clusters can be modeled as the MCDM problem [3,4], this work pays its attention to the consideration of the AHP in deciding the number of clusters of datasets in a combination way with model-based clustering.
Pearson [5] first introduced the idea of the mixture distribution model by studying a mixture of two univariate normal distributions with different means and variances. Later on, many related works [6,7,8] have been carried out. Model-based clustering based on a mixture of distributions has commonly been used in the clustering of datasets. Some of those who used the mixture of multivariate normal distributions in the cluster analysis are Wolfe [9,10], Day [11], and Binder [12]. To estimate parameters in the mixture distribution model, the Expectation-Maximization (EM) algorithm suggested by Dempster et al. [13] has been widely used [14,15].
Model-based clustering based on finite normal mixture models is the most commonly used approach [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]. In estimating the number of clusters in model-based clustering, the information criteria have widely been used [32,33,34,35,36,37,38,39,40,41,42]. Some of the common criteria in the literature are Akaike’s Information Criterion (AIC) [32], Approximate Weight of Evidence (AWE) [37], Bayesian Information Criterion (BIC) [33], Classification Likelihood Criterion (CLC) [39], Kullback Information Criterion (KIC) [40], etc. These information criteria may give different results in estimating the number of clusters of a dataset. For example, while the number of clusters in the Iris dataset [43] is 3, according to AIC, AWE, BIC, CLC, and KIC, the number of clusters in the set has been seen to be 4, 2, 2, 4, and 3, respectively (Table 7). The Iris is a multivariate dataset introduced by Ronald Fisher [43] and the dataset consists of 50 samples from each of three species: setosa, virginica, and versicolor. Each one of the criteria has been seen to possibly produce a different number of clusters for the same dataset. To overcome this problem, model-based clustering and the AHP have been combined. The AHP model has been created by using the information criteria AIC, AWE, BIC, CLC and KIC. For the first time, to the best of authors’ knowledge, the number of clusters of a dataset in a model-based clustering has been determined by using the AHP. Thus, the common influence of those information criteria has been benefited. Satisfactory results have been obtained in terms of the suggested model.
The rest of the paper is organized as follows. Section 2 describes the model-based clustering, the AHP model, and the proposed approach. Section 3 presents details of the experimental study and analyses the results. Finally, Section 4 presents our conclusions and recommendation.

2. Materials and methods

2.1. The Model-Based Clustering

The model-based clustering assumes that a dataset to be clustered consists of various clusters with different distributions. The entire dataset is modeled by a mixture of these distributions. The clustering assumes a set of n p-dimensional vectors y 1 , , y n of observations from a multivariate mixture of a finite number of g components or clusters each with some unknown mixing proportions or weights π 1 , , π g [44]. The probability density function (PDF) of finite mixture distribution models can be given by
f y j ; Ψ = i = 1 g π i f i y j ; θ i
where f i ( y j ; θ i ) is the PDF of the components. Here 0 π i 1 and i = 1 g π i = 1 ( i = 1 , , g and j = 1 , 2 , , n ). The parameter vector Ψ = π , θ contains all of the parameters of the mixture models. Here θ = ( θ 1 , θ 2 , , θ g ) denotes the unknown parameters of the PDF of the i-th components in the mixture models [17].
In the model-based clustering, the cluster analysis based on the mixture of multivariate normal distributions is the most commonly used. In this case, in Equation (1), f i ( y j ; θ i ) ’s are assumed to be multivariate normal distribution function of the form
f i ( y j ; θ i ) = 1 ( 2 π ) p 2 Σ i 1 2 e 1 2 ( x j μ i ) T Σ k 1 ( x j μ i )
where μ i and Σ i stand for the mean vector and the covariance matrix, respectively ( i = 1 , 2 , , g ). Here, θ i stems from the mean compound vectors μ = ( μ 1 , μ 2 , , μ g ) and the compound covariance matrices Σ = ( Σ 1 , Σ 2 , , Σ g ) of the parameters of the compound PDF in the mixture distribution model [17].
The mixture likelihood approach has been used for estimating the parameters in the mixture models. This approach assumes that the PDF is the sum of the weighted component densities. If the mixture likelihood approach is used for clustering, the clustering problem comes out to be a problem of estimating the parameters of the assumed mixture distribution model. The maximum-likelihood function can then be given as follows [45],
L ( Ψ ) = j = 1 n i = 1 g π i f i y j y j θ i θ i .
The most widely used approach for parameter estimation is the EM algorithm [17].
Determination of the number of clusters is one of the most important problems in the cluster analysis. The information criteria for the number of clusters have often been used in model-based clustering. The criteria to be used in this study are given in Table 1.
A model that gives the minimum of the values of the criteria AIC, AWE, BIC, CLC, and KIC are selected to be the best model. In Table 1, the log-likelihood function for the completed data is shown as l o g L c Ψ = l o g L Ψ + E N ( τ ) . Here, E N τ = i = 1 g j = 1 n τ i j l o g τ i j is the entropy of the related classification matrix [46].

2.2. The Analytic Hierarchy Process (AHP)

The AHP was developed by Saaty [2]. The AHP is one of the most widely used multiple criteria decision-making tools. The AHP is a method for structuring, measurement, and synthesis [47]. The AHP is a hierarchical structure consisting of goal, criteria, and alternatives [48]. The AHP chooses the best one among the alternatives, taking into account the goal and the criteria [49]. In addition, the AHP is a mathematical approach that evaluates qualitative and quantitative variables together. The literature [50] tells us that the AHP has been implemented in various fields of science such as selecting a best alternative, planning, optimization, etc. To solve decision-making problems using the AHP, the following steps are applied [48,49,50,51,52]:
Structuring: Initially goal, criteria, and alternatives are determined. Then, the hierarchy model is constructed at different levels according to the structure of the problem. A three level-hierarchy, that has k criteria and m alternatives, can be given in Figure 1.
Measurement: Firstly, a decision matrix is formed. The decision matrix involves the assessments of each alternative with respect to the decision criteria. The decision matrix has been given in Table 2. Here, element d i j indicates the importance level of the i-th alternative with respect to the j-th criterion ( i = 1 , 2 , , m ; j = 1 , 2 , , k ).
Secondly, the pairwise comparison matrices of the criteria and alternatives for each criterion have been produced. In general, the pairwise comparison matrix is constructed as in Table 3. Here, a i j stands for the degree of preference of the i-th criterion/alternative over j-th criterion/alternative ( a i i = 1 ; a i j = 1 / a j i ), and Sumt is summation of the t-th column the pairwise comparison matrix.
Synthesis: To find the maximum eigenvalue ( λ m a x ), consistency index (CI), consistency ratio (CR), and normalized eigenvector of each pairwise comparison matrix, the necessary calculations are performed. Note that CR =(CI/RI) is calculated for all pairwise comparison matrices. Here, CI = ( λ m a x r ) / ( r 1 ) is the consistency index and RI is the random consistency index. As well-known from the literature [2,49,51,52], the average RI is calculated in terms of the dimension of the matrix, r. If the CR value is less than 0.10, it indicates that the matrices are consistent. As given in reference [49], if λ m a x = r , then the pairwise comparison matrix is considered to be consistent.
After the consistency test, the following calculations are made. Firstly, the relative importance vector (RIV) of the criteria is determined using the pairwise comparison matrix. The row averages of the normalized matrix are represented by RIV = [ A v g 1 , A v g 2 , . . . ] T . To obtain the normalized matrix, the element of each column in the pairwise comparison matrix is divided by the column sum. The normalized matrix is then given in Table 4. The RIV of the alternatives for each criterion and the RIV of the criteria are calculated separately using the normalized matrices.
Finally, to calculate the composite relative importance vector (C-RIV), the matrix formed by the RIV of the alternatives for each criterion is multiplied by the RIV of the criteria. Thus, the C-RIV determines the overall ranking of the alternatives.

2.3. The Proposed Approach for the AHP Model and the Pairwise Comparison Matrix

The model-based clustering is currently a very popular statistical-model. The information criteria for determining the number of clusters in the model-based clustering have commonly been used [42]. A number of criteria have been proposed to determine the number of clusters in a dataset. The current study has proposed an approach for determining the number of clusters by using the combination of the model-based clustering with the AHP. The AHP model has been created by using the information criteria AIC, AWE, BIC, CLC, and KIC. In Figure 2, the proposed approach has been described for determining the number of clusters of a dataset in the model-based clustering.
The proposed approach is summarized in the following steps:
  • Step 1. The hierarchical structure of the AHP has been created in Figure 3. In the figure, determination of the number of clusters is the goal, the AIC, AWE, BIC, CLC, KIC are the criteria, and 2, 3, 4, 5 are the alternatives.
  • Step 2. The dataset has been modeled as the mixture of a multivariate normal distribution for the different number of clusters in the model-based clustering. The mean vectors, the covariance matrices, the mixture proportions, and the likelihood function have been estimated by the EM algorithm.
  • Step 3. For each number of clusters, the values of the information criteria have been calculated. The decision matrix has been constructed using those values. Although a model that gives the lowest value of the information criteria in the model-based clustering is selected as the best model, in the AHP, the preferred case is the one with the highest value of the C-RIV. Therefore, the value of the information criteria has been reversed in the decision matrix; for example, the AIC is taken to be 1/AIC.
  • Step 4. The pairwise comparison matrices have been obtained by using the decision matrix.
  • Step 5. For each alternative, the C-RIV has been calculated.
  • Step 6. The alternative having the highest C-RIV value is the optimal number of clusters for the dataset.
To form the pairwise comparison matrix of the criteria, the study of Akogul and Erisoglu [53] has been used. In their study [53], the efficiency of the information criteria was examined. They also analyzed real datasets that are commonly used in clustering analysis. Those datasets have different characteristics such as sample size (i.e., 75, 150, 178, 345, 846, 2310 and 6435), number of clusters (i.e., 2, 3, 4, 6 and 7), and number of variables (i.e., 2, 4, 6, 13, 18, 19 and 36). The synthetic datasets were generated from the multivariate normal distribution by using the mean and covariance vectors of each dataset. Then, the number of clusters of the synthetic datasets were estimated by using the information criteria. This process was repeated 1000 times. Thus, the success of finding the right number of clusters in the dataset was computed for each information criterion. For all the synthetic datasets, in the corresponding study, the average of successes of the information criteria was given [53] as 43.6, 21.2, 47.4, 17.3, and 58.2 for AIC, AWE, BIC, CLC, and KIC, respectively.
In the work of Akogul and Erisoglu [53], the effectiveness of the information criteria was determined according to the success of finding right number of clusters. In the current study, those successes have been used to determine the importance level of the criteria in the AHP model. To produce the pairwise comparison matrix of the criteria, the average of the successes of the information criteria is considered. The average success is taken to be the degree of preference of a criterion over other criteria. The proposed pairwise comparison matrix of the criteria and the RIV of the criteria have been given in Table 5. For example, in Table 5, value 2.0566 can be interpreted as the degree of preference of the AIC over the AWE. The average of successes of the AIC is 43.6, while the AWE is of 21.2. Thus, the AIC is about two times more successful than the AWE.

3. Application and Results

3.1. Testing of the Proposed Approach for the Real Datasets

The achievement of the proposed approach has been tested on common real datasets, namely, Chemical Diabetes [54], Crab [55], Liver Disorders [56], Ionosphere [57], Iris [43], Wine [58], Ruspini [59], E.coli [60] and Vehicle Silhouettes [61]. They have been provided by the UCI machine learning repository [62] and the GitHub [63]. Their characteristics have been exhibited in Table 6.
In this section, to determine the number of clusters in the Iris dataset, all calculations have been produced step by step. For the other datasets, only the final results have been presented and the decision matrices have been represented in the Appendix. Their pairwise comparison matrices can be obtained by using the decision matrices.
The results of the information criteria in determining the number of clusters for the Iris dataset have been presented in Table 7. According to AIC, AWE, BIC, CLC, and KIC, the number of clusters in the Iris dataset has been estimated to be 4, 2, 2, 4, and 3, respectively.
To form the decision matrix, the values of the information criteria have been reversed (for example, AIC = 1/AIC). The decision matrix of the Iris dataset can be given in Table 8.
The pairwise comparison matrix and the RIV of each criterion, which are obtained by using the decision matrix, have been seen in Table 9.
The C-RIV has been presented in Table 10. In the table, the alternative value three is the best alternative because it has the maximum value of 0.2628 for the C-RIV. Thus, the number of clusters for the Iris dataset has been seen to be determined correctly.
The C-RIV for the real datasets has also been presented in Table 11. In the table, the number of clusters for the real datasets has been estimated correctly by using the proposed approach.

3.2. Testing of the Proposed Approach for the Synthetic Datasets

For the synthetic-1 dataset, we generate 1000 samples from a two-component bivariate normal mixture with the mixing proportions π 1 = π 2 = 1 / 2 , the mean vectors μ 1 = [ 2 , 4 ] T , μ 2 = [ 5 , 6 ] T , and the covariance matrices Σ 1 = [ 1 , 0 ; 0 , 1 ] , Σ 2 = [ 2 , 0 ; 0 , 0 . 5 ] . Figure 4 shows the scatter plot and the PDF of the mixture model of the synthetic-1 dataset. The decision matrix of the synthetic-1 dataset has been presented in Table 12.
The RIV of the alternatives for each criterion and the RIV of the criteria have been given in Table 13. To identify the best alternative, the C-RIV has been calculated using the corresponding values. In the table, the alternative value two is the best alternative because it has the maximum value, 0.2561, for the C-RIV. That is, the number of clusters has been seen to be determined correctly for the synthetic-1 dataset.
Moreover, this operation has been repeated 1000 times. The success of finding right number of clusters in the synthetic-1 dataset has been computed for each information criterion. The proposed approach (100%) has been seen to be more accurate than the AIC (92%), the AWE (27%), the BIC (99%), the CLC (93%), and the KIC (98%).
For the synthetic-2 dataset, we generate 1000 samples from a three-component bivariate normal mixture with the mixing proportions π 1 = π 2 = π 3 = 1 / 3 , the mean vectors μ 1 = [ 1 , 2 ] T , μ 2 = [ 1 , 1 ] T , μ 3 = [ 0 , 4 ] T , and the covariance matrices Σ 1 = [ 1 , 0 ; 0 , 1 ] , Σ 2 = [ 0 . 5 , 0 . 7 ; 0 . 7 , 1 . 5 ] , and Σ 3 = [ 2 , 0 ; 0 , 2 ] . Figure 5 shows the scatter plot and the PDF of the mixture model of the synthetic-2 dataset.
The decision matrix and the C-RIV of the synthetic-2 dataset have been given in Table 14 and Table 15. In Table 15, the alternative value three is the best alternative. Namely, the number of clusters for the synthetic-2 dataset has been determined correctly. Similar to previous calculations, this operation has been repeated 1000 times. The success of finding right number of clusters in the synthetic-2 dataset has been computed for each information criterion. The proposed approach (93%) has been seen to be better than the AIC (74%), the AWE (10%), the BIC (92%), the CLC (31%), and the KIC (86%).
For the synthetic-3 dataset, we generate again 1000 samples from a four-component bivariate normal mixture with the mixing proportions π 1 = π 2 = π 3 = π 4 = 0 . 25 , the mean vectors μ 1 = μ 2 = [ 2 , 2 ] T , μ 3 = [ 3 , 1 ] T , μ 4 = [ 1 , 3 ] T , and the covariance matrices Σ 1 = [ 0 . 2 , 0 ; 0 , 0 . 2 ] , Σ 2 = [ 3 , 2 ; 2 , 7 ] , Σ 3 = [ 1 , 0 ; 0 , 4 ] , and Σ 4 = [ 1 , 0 ; 0 , 1 ] . Figure 6 shows the scatter plot and the PDF of the mixture model of the synthetic-3 dataset.
The decision matrix and the C-RIV of the synthetic-3 dataset have been seen in Table 16 and Table 17. In Table 17, the alternative value four is the best alternative because it has the maximum value, 0.2526, of the C-RIV. The number of clusters for the synthetic-3 dataset has been determined correctly. Similarly, this operation has been repeated 1000 times. The success of finding the right number of clusters in the synthetic-3 dataset has been computed for each information criterion. The proposed approach (92%) has been seen to be better than the AIC (80%), the AWE (4%), the BIC (89%), the CLC (65%), and the KIC (89%).
Table 18 summarizes the estimations of the number of clusters for all datasets produced by the information criteria and the proposed approach. The bottommost column has given the correct number of clusters for each dataset determined by the information criteria and the proposed approach.

4. Conclusions and Recommendation

This paper has proposed to combine the AHP and some information criteria, namely AIC, AWE, BIC, CLC, and KIC, in determining the number of clusters of a dataset in model-based clustering. It has been concluded that the proposed approach has been seen to be more accurate than the corresponding information criteria. The approach has thus been realized to be capable of application to a widespread number of clustering algorithms. To carry out this study, the decision matrix has been created by using the information criteria values for each case. To increase the successes of the information criteria, a pairwise comparison matrix has been suggested in this study. Note that the proposed method is strongly expected to be very effective in analyzing data come out in various of fields science such as economics, biology, engineering etc. For further studies, researchers can pay their attention to produce different decision and pairwise comparison matrices to deal with their problems.

Acknowledgments

This research has been supported by TUBITAK-BIDEB (2211) Ph.D. scholarship program. The author is grateful to anonymous referees for their constructive comments and valuable suggestions to improve this paper. The authors wish to thank Murat SARI (Yildiz Technical University, Istanbul) for reading the manuscript and providing many useful suggestions.

Author Contributions

Serkan Akogul and Murat Erisoglu conceived of the research and wrote the paper. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The decision matrices of the real datasets are given in Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7 and Table A8. Their pairwise comparison matrices can easily be obtained by using the decision matrices.
Table A1. The decision matrix for the Crab dataset ( × 10 3 ).
Table A1. The decision matrix for the Crab dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.34140.28780.32630.34280.3363
30.34240.27100.32000.35120.3349
40.34630.25540.31630.35880.3362
50.35500.24590.31650.37710.3420
Table A2. The decision matrix for the Liver dataset ( × 10 3 ).
Table A2. The decision matrix for the Liver dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.06780.06450.06680.06800.0675
30.06810.06300.06660.06830.0677
40.06850.06180.06650.06870.0679
50.06830.06030.06590.06880.0676
Table A3. The decision matrix for the Ionosphere dataset ( × 10 3 ).
Table A3. The decision matrix for the Ionosphere dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.08160.03540.05840.10260.0739
30.07410.02670.04810.10300.0650
40.08290.02270.04590.14230.0686
50.07020.01840.03790.12600.0575
Table A4. The decision matrix for the Diabetes dataset ( × 10 3 ).
Table A4. The decision matrix for the Diabetes dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.15790.15010.15580.15910.1571
30.16020.14830.15690.16200.1590
40.16040.14440.15600.16230.1588
50.16070.14150.15520.16380.1587
Table A5. The decision matrix for the Wine dataset ( × 10 3 ).
Table A5. The decision matrix for the Wine dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.19150.13160.16990.20810.1840
30.20820.11930.17240.23890.1953
40.21040.10510.16430.25520.1932
50.20660.09260.15360.26350.1863
Table A6. The decision matrix for the Ruspini dataset ( × 10 3 ).
Table A6. The decision matrix for the Ruspini dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.70960.66000.69700.72090.7026
30.73000.65190.70960.74840.7195
40.75190.64420.72300.77840.7375
50.75620.62420.71960.79070.7383
Table A7. The decision matrix for the E.coli dataset ( × 10 3 ).
Table A7. The decision matrix for the E.coli dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.29610.23350.27410.30820.2897
31.75660.51821.02282.74831.4721
42.14980.43880.98915.36661.6362
51.94380.35890.83496.00671.4359
Table A8. The decision matrix for the Vehicle dataset ( × 10 3 ).
Table A8. The decision matrix for the Vehicle dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.01240.01160.01210.01250.0123
30.01280.01160.01240.01300.0127
40.01300.01140.01240.01320.0129
50.01290.01100.01220.01320.0127

References

  1. Yu, H.; Liu, Z.; Wang, G. An automatic method to determine the number of clusters using decision-theoretic rough set. Int. J. Approx. Reason. 2014, 55, 101–115. [Google Scholar] [CrossRef]
  2. Saaty, T.L. The Analytic Hierarchy Process. In Cook WD and Seiford LM (1978). Priority Ranking and Consensus Formation, Management Science; McGraw-Hill: New York, NY, USA, 1980; pp. 1721–1732. [Google Scholar]
  3. Peng, Y.; Kou, G.; Wang, G.; Wu, W.; Shi, Y. Ensemble of software defect predictors: An AHP-based evaluation method. Int. J. Inf. Technol. Decis. Mak. 2011, 10, 187–206. [Google Scholar] [CrossRef]
  4. Peng, Y.; Zhang, Y.; Kou, G.; Shi, Y. A multicriteria decision making approach for estimating the number of clusters in a data set. PLoS ONE 2012, 7, e41713. [Google Scholar] [CrossRef] [PubMed]
  5. Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A 1894, 185, 71–110. [Google Scholar] [CrossRef]
  6. Rao, C.R. The utilization of multiple measurements in problems of biological classification. J. R. Stat. Soc. Ser. B 1948, 10, 159–203. [Google Scholar]
  7. Hasselblad, V. Estimation of parameters for a mixture of normal distributions. Technometrics 1966, 8, 431–444. [Google Scholar] [CrossRef]
  8. Hasselblad, V. Estimation of finite mixtures of distributions from the exponential family. J. Am. Stat. Assoc. 1969, 64, 1459–1471. [Google Scholar] [CrossRef]
  9. Wolfe, J.H. A Computer Program for the Maximum Likelihood Analysis of Types; Technical Report; Naval Personnel Research Activity: San Diego, CA, USA, 1965. [Google Scholar]
  10. Wolfe, J.H. NORMIX: Computational Methods for Estimating the Parameters of Multivariate Normal Mixtures of Distributions; Technical Report; Naval Personnel Research Activity: San Diego, CA, USA, 1967. [Google Scholar]
  11. Day, N.E. Estimating the components of a mixture of normal distributions. Biometrika 1969, 56, 463–474. [Google Scholar] [CrossRef]
  12. Binder, D.A. Bayesian cluster analysis. Biometrika 1978, 65, 31–38. [Google Scholar] [CrossRef]
  13. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–38. [Google Scholar]
  14. Everitt, B. Maximum likelihood estimation of the parameters in a mixture of two univariate normal distributions; a comparison of different algorithms. Statistician 1984, 33, 205–215. [Google Scholar] [CrossRef]
  15. McLachlan, G.J.; Peel, D.; Basford, K.E.; Adams, P. The EMMIX software for the fitting of mixtures of normal and t-components. J. Stat. Softw. 1999, 4, 1–14. [Google Scholar] [CrossRef]
  16. Celeux, G.; Govaert, G. Gaussian parsimonious clustering models. Pat. Recognit. 1995, 28, 781–793. [Google Scholar] [CrossRef]
  17. McLachlan, G.; Peel, D. Finite Mixture Models; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar]
  18. Yeung, K.Y.; Fraley, C.; Murua, A.; Raftery, A.E.; Ruzzo, W.L. Model-based clustering and data transformations for gene expression data. Bioinformatics 2001, 17, 977–987. [Google Scholar] [CrossRef] [PubMed]
  19. Meilă, M.; Heckerman, D. An experimental comparison of model-based clustering methods. Mach. Learn. 2001, 42, 9–29. [Google Scholar] [CrossRef]
  20. Fraley, C.; Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002, 97, 611–631. [Google Scholar] [CrossRef]
  21. Biernacki, C.; Celeux, G.; Govaert, G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 2003, 41, 561–575. [Google Scholar] [CrossRef]
  22. Pernkopf, F.; Bouchaffra, D. Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Trans. Pat. Anal. Mach. Intell. 2005, 27, 1344–1348. [Google Scholar] [CrossRef] [PubMed]
  23. Raftery, A.E.; Dean, N. Variable selection for model-based clustering. J. Am. Stat. Assoc. 2006, 101, 168–178. [Google Scholar] [CrossRef]
  24. Jain, A.K. Data clustering: 50 years beyond K-means. Pat. Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  25. Browne, R.P.; McNicholas, P.D.; Sparling, M.D. Model-based learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Trans. Pat. Anal. Mach. Intell. 2012, 34, 814–817. [Google Scholar] [CrossRef] [PubMed]
  26. Yang, M.S.; Lai, C.Y.; Lin, C.Y. A robust EM clustering algorithm for Gaussian mixture models. Pat. Recognit. 2012, 45, 3950–3961. [Google Scholar] [CrossRef]
  27. Lee, S.X.; McLachlan, G.J. Model-based clustering and classification with non-normal mixture distributions. Stat. Methods Appl. 2013, 22, 427–454. [Google Scholar] [CrossRef]
  28. Bouveyron, C.; Brunet-Saumard, C. Model-based clustering of high-dimensional data: A review. Comput. Stat. Data Anal. 2014, 71, 52–78. [Google Scholar] [CrossRef] [Green Version]
  29. Kwedlo, W. A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering. Pat. Anal. Appl. 2015, 18, 757–770. [Google Scholar] [CrossRef]
  30. Malsiner-Walli, G.; Frühwirth-Schnatter, S.; Grün, B. Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 2016, 26, 303–324. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  31. Marbac, M.; Biernacki, C.; Vandewalle, V. Model-based clustering of Gaussian copulas for mixed data. Commun. Stat. Theory Methods 2017. [Google Scholar] [CrossRef]
  32. Akaike, H. Information theory and an extension of the Maximum likelihood Principal. In Proceedings of the 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, 2–8 September 1971. [Google Scholar]
  33. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  34. Bozdogan, H. On the information-based measure of covariance complexity and its application to the evaluation of multivariate linear models. Commun. Stat. Theory Methods 1990, 19, 221–278. [Google Scholar] [CrossRef]
  35. Bozdogan, H. Choosing the number of component clusters in the mixture-model using a new informational complexity criterion of the inverse-Fisher information matrix. In Information and Classification; Springer: Berlin, Germany, 1993; pp. 40–54. [Google Scholar]
  36. Bozdogan, H. Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach; Springer: Berlin, Germany, 1994; pp. 69–113. [Google Scholar]
  37. Banfield, J.D.; Raftery, A.E. Model-based Gaussian and non-Gaussian clustering. Biometrics 1993, 49, 803–821. [Google Scholar] [CrossRef]
  38. Celeux, G.; Soromenho, G. An entropy criterion for assessing the number of clusters in a mixture model. J. Classif. 1996, 13, 195–212. [Google Scholar] [CrossRef]
  39. Biernacki, C.; Govaert, G. Using the classification likelihood to choose the number of clusters. Comput. Sci. Stat. 1997, 29, 451–457. [Google Scholar]
  40. Cavanaugh, J.E. A large-sample model selection criterion based on Kullback’s symmetric divergence. Stat. Probab. Lett. 1999, 42, 333–343. [Google Scholar] [CrossRef]
  41. Smyth, P. Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 2000, 10, 63–72. [Google Scholar] [CrossRef]
  42. Oliveira-Brochado, A.; Martins, F.V. Assessing the Number of Components in Mixture Models: A Review; Technical Report; Universidade do Porto, Faculdade de Economia do Porto: Porto, Portugal, 2005. [Google Scholar]
  43. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Hum. Genet. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  44. Erol, H. A model selection algorithm for mixture model clustering of heterogeneous multivariate data. In Proceedings of the 2013 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA), Albena, Bulgaria, 19–21 June 2013; pp. 1–7. [Google Scholar]
  45. Fraley, C. Algorithms for model-based Gaussian hierarchical clustering. SIAM J. Sci. Comput. 1998, 20, 270–281. [Google Scholar] [CrossRef]
  46. Hathaway, R.J. Another interpretation of the EM algorithm for mixture distributions. Stat. Probab. Lett. 1986, 4, 53–56. [Google Scholar] [CrossRef]
  47. Forman, E.H.; Gass, S.I. The analytic hierarchy process—An exposition. Oper. Res. 2001, 49, 469–486. [Google Scholar] [CrossRef]
  48. Saaty, R.W. The analytic hierarchy process—What it is and how it is used. Math. Model. 1987, 9, 161–176. [Google Scholar] [CrossRef]
  49. Saaty, T.L.; Vargas, L.G. Models, Methods, Concepts & Applications of the Analytic Hierarchy Process; Springer Science & Business Media: Berlin, Germany, 2012; Volume 175. [Google Scholar]
  50. Vaidya, O.S.; Kumar, S. Analytic hierarchy process: An overview of applications. Eur. J. Oper. Res. 2006, 169, 1–29. [Google Scholar] [CrossRef]
  51. Saaty, T.L. How to make a decision: The analytic hierarchy process. Eur. J. Oper. Res. 1990, 48, 9–26. [Google Scholar] [CrossRef]
  52. Muralidhar, K.; Santhanam, R.; Wilson, R.L. Using the analytic hierarchy process for information system project selection. Inf. Manag. 1990, 18, 87–95. [Google Scholar] [CrossRef]
  53. Akogul, S.; Erisoglu, M. A Comparison of Information Criteria in Clustering Based on Mixture of Multivariate Normal Distributions. Math. Comput. Appl. 2016, 21, 34. [Google Scholar] [CrossRef]
  54. Reaven, G.; Miller, R. An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 1979, 16, 17–24. [Google Scholar] [CrossRef] [PubMed]
  55. Campbell, N.; Mahon, R. A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 1974, 22, 417–425. [Google Scholar] [CrossRef]
  56. Forsyth, R. PC/Beagle User’s Guide; BUPA Medical Research Ltd.: Hong Kong, China, 1990. [Google Scholar]
  57. Sigillito, V.G.; Wing, S.P.; Hutton, L.V.; Baker, K.B. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech. Dig. 1989, 10, 262–266. [Google Scholar]
  58. Aeberhard, S.; Coomans, D.; De Vel, O. Comparison of Classifiers in High Dimensional Settings; Technical Report 92-02; Department of Computer Science and Department of Mathematics and Statistics, James Cook University of North Queensland: Townsville City, Australia, 1992. [Google Scholar]
  59. Ruspini, E.H. Numerical methods for fuzzy clustering. Inf. Sci. 1970, 2, 319–350. [Google Scholar] [CrossRef]
  60. Horton, P.; Nakai, K. A probabilistic classification system for predicting the cellular localization sites of proteins. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, St. Louis, MO, USA, 12–15 June 1996; pp. 109–115. [Google Scholar]
  61. Siebert, J.P. Vehicle Recognition Using Rule Based Methods; Turing Institute: London, UK, 1987. [Google Scholar]
  62. Lichman, M. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 17 June 2013).
  63. GitHub. Available online: http://vincentarelbundock.github.io/Rdatasets/datasets.html/ (accessed on 17 June 2017).
Figure 1. A three-level hierarchy.
Figure 1. A three-level hierarchy.
Entropy 19 00452 g001
Figure 2. The proposed approach for determining the number of clusters.
Figure 2. The proposed approach for determining the number of clusters.
Entropy 19 00452 g002
Figure 3. The hierarchical structure of the AHP for the proposed approach.
Figure 3. The hierarchical structure of the AHP for the proposed approach.
Entropy 19 00452 g003
Figure 4. Synthetic-1 dataset: (a) the scatter plot; (b) the PDF of the mixture model.
Figure 4. Synthetic-1 dataset: (a) the scatter plot; (b) the PDF of the mixture model.
Entropy 19 00452 g004
Figure 5. Synthetic-2 dataset: (a) the scatter plot; (b) the PDF of the mixture model.
Figure 5. Synthetic-2 dataset: (a) the scatter plot; (b) the PDF of the mixture model.
Entropy 19 00452 g005
Figure 6. Synthetic-3 dataset: (a) the scatter plot; (b) the PDF of the mixture model.
Figure 6. Synthetic-3 dataset: (a) the scatter plot; (b) the PDF of the mixture model.
Entropy 19 00452 g006
Table 1. The information criteria for the model selection.
Table 1. The information criteria for the model selection.
CriteriaFormulaReference
AIC 2 l o g L ( Ψ ^ ) + 2 d Akaike [32]
AWE 2 l o g L c + 2 d ( 3 / 2 + l o g n ) Banfierd and Raftery [37]
BIC 2 l o g L ( Ψ ^ ) + d l o g ( n ) Schwarz [33]
CLC 2 l o g L ( Ψ ^ ) + 2 E N ( τ ^ ) Biernacki and Gnvaert [39]
KIC 2 l o g L ( Ψ ^ ) + 3 ( d + 1 ) Cavanaugh [40]
n: The number of observations; d: The number of parameters in the model.
Table 2. The decision matrix.
Table 2. The decision matrix.
Criterion 1Criterion 2Criterion k
Alternative 1 d 11 d 12 d 1 k
Alternative 2 d 21 d 22 d 2 k
Alternative m d m 1 d m 2 d m k
Table 3. The pairwise comparison matrix.
Table 3. The pairwise comparison matrix.
X 1 X 2
X 1 a 11 a 12
X 2 a 21 a 22
SumSum1Sum2
X t : t-th criterion ( t = 1 , 2 , , k )/t-th alternative ( t = 1 , 2 , , m ).
Table 4. The normalized matrix.
Table 4. The normalized matrix.
X 1 X 2 Average
X 1 a 11 /Sum1 a 12 /Sum2Avg1
X 2 a 21 /Sum1 a 22 /Sum2Avg2
Table 5. The proposed pairwise comparison matrix of the criteria and the RIV of the criteria.
Table 5. The proposed pairwise comparison matrix of the criteria and the RIV of the criteria.
CriteriaAICAWEBICCLCKICRIV criteria *
AIC12.05660.91982.52020.74910.2323
AWE0.486210.44731.22540.36430.1129
BIC1.08722.235812.73990.81440.2525
CLC0.39680.81600.365010.29730.0922
KIC1.33492.74531.22783.364210.3101
Sum4.30508.85383.959910.84973.2251
* The relative importance vector (RIV) of the criteria.
Table 6. Descriptions of the real datasets.
Table 6. Descriptions of the real datasets.
DatasetsSample SizeNumber of VariablesNumber of Clusters
Crab20052
Liver Disorders34562
Ionosphere351342
Chemical Diabetes14543
Iris15043
Wine178133
Ruspini7524
E.coli33684
Vehicle Silhouettes846184
Table 7. The results in determining the number of clusters for the Iris dataset.
Table 7. The results in determining the number of clusters for the Iris dataset.
AlternativesAICAWEBICCLCKIC
2487.11806.74 *574.42 *429.12519.11
3449.15944.15581.61371.21496.15 *
4448.86 *1126.55626.49358.29 *510.86
5474.121378.81696.90415.24551.12
* The minimum value of the information criteria.
Table 8. The decision matrix for the Iris dataset ( × 10 3 ).
Table 8. The decision matrix for the Iris dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.20530.12400.17410.23300.1926
30.22260.10590.17190.26940.2016
40.22280.88770.15960.27910.1957
50.21090.72530.14350.24080.1814
Table 9. The pairwise comparison matrix and the RIV of each criterion for the Iris dataset.
Table 9. The pairwise comparison matrix and the RIV of each criterion for the Iris dataset.
AIC2345RIV AIC
210.92210.92150.97330.2383
31.084510.99941.05560.2584
41.08521.000611.05630.2586
51.02740.94730.946710.2448
Sum4.19723.87003.86764.0852
AWE2345RIV AWE
211.17031.39641.70910.3169
30.854511.19321.46040.2708
40.71610.838111.22390.2269
50.58510.68480.817010.1854
Sum3.15573.69324.40665.3934
BIC2345RIV BIC
211.01251.09061.21320.2682
30.987611.07721.19820.2649
40.91690.928411.11240.2459
50.82420.83460.899010.2211
Sum3.72883.77554.06674.5239
CLC2345RIV CLC
210.86510.83490.96760.2279
31.156010.96521.11860.2635
41.19771.036111.15890.2730
51.03340.89400.862910.2356
Sum4.38713.79513.66304.2452
KIC2345RIV KIC
210.95580.98411.06170.2497
31.046311.02971.11080.2613
41.01620.971211.07880.2538
50.94190.90030.927010.2352
Sum4.00443.82723.94074.2512
Table 10. The C-RIV for the Iris dataset.
Table 10. The C-RIV for the Iris dataset.
AlternativesRIV AIC RIV AWE RIV BIC RIV CLC RIV KIC C-RIV
20.23830.31690.26820.22790.24970.2573
30.25840.27080.26490.26350.26130.2628 *
40.25860.22690.24590.27300.25380.2516
50.24480.18540.22110.23560.23520.2283
RIV criteria 0.23230.11290.25250.09220.3101
* The maximum value of the C-RIV.
Table 11. The C-RIV of the real datasets for the proposed approach.
Table 11. The C-RIV of the real datasets for the proposed approach.
C-RIV
CrabLiverIonosphereDiabetesIrisWineRuspiniE.coliVehicle
20.2517 *0.2507 *0.2841 *0.24900.25730.24760.24360.07090.2451
30.24910.25020.24490.2513 *0.2628 *0.2578 *0.24860.29890.2513
40.24810.25040.25590.25020.25160.25240.2541 *0.3325 *0.2534 *
50.25110.24870.21510.24960.22830.24210.25370.29770.2502
* The maximum value of the C-RIV for each dataset.
Table 12. The decision matrix for the synthetic-1 dataset ( × 10 3 ).
Table 12. The decision matrix for the synthetic-1 dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.14600.13730.14490.14080.1457
30.14610.12540.14430.13010.1456
40.14570.11880.14340.12450.1452
50.14580.11180.14280.11820.1451
Table 13. The C-RIV for the synthetic-1 dataset.
Table 13. The C-RIV for the synthetic-1 dataset.
AlternativesRIV AIC RIV AWE RIV BIC RIV CLC RIV KIC C-RIV
20.25020.27820.25180.27410.25050.2561 *
30.25030.25430.25080.25330.25040.2512
40.24970.24090.24920.24240.24960.2479
50.24980.22660.24820.23020.24950.2449
RIV criteria 0.23230.11290.25250.09220.3101
* The maximum value of the C-RIV.
Table 14. The decision matrix for the synthetic-2 dataset ( × 10 3 ).
Table 14. The decision matrix for the synthetic-2 dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.12750.12340.12660.12630.1273
30.13160.12260.13020.12700.1313
40.13160.11540.12960.12080.1311
50.13190.11700.12950.12410.1313
Table 15. The C-RIV for the synthetic-2 dataset.
Table 15. The C-RIV for the synthetic-2 dataset.
AlternativesRIV AIC RIV AWE RIV BIC RIV CLC RIV KIC C-RIV
20.24400.25790.24550.25350.24430.2469
30.25190.25620.25240.25500.25200.2528 *
40.25180.24120.25130.24240.25170.2496
50.25240.24460.25090.24910.25210.2507
RIV criteria 0.23230.11290.25250.09220.3101
* The maximum value of the C-RIV.
Table 16. The decision matrix for the synthetic-3 dataset ( × 10 3 ).
Table 16. The decision matrix for the synthetic-3 dataset ( × 10 3 ).
AlternativesAICAWEBICCLCKIC
20.12040.11460.11960.11710.1202
30.12230.11060.12110.11420.1220
40.12440.11140.12270.11640.1240
50.12450.10800.12230.11400.1240
Table 17. The C-RIV for the synthetic-3 dataset.
Table 17. The C-RIV for the synthetic-3 dataset.
AlternativesRIV AIC RIV AWE RIV BIC RIV CLC RIV KIC C-RIV
20.24490.25780.24630.25360.24520.2476
30.24880.24870.24930.24730.24890.2488
40.25310.25060.25270.25220.25300.2526 *
50.25320.24290.25180.24690.25290.2510
RIV criteria 0.23230.11290.25250.09220.3101
* The maximum value of the C-RIV.
Table 18. Results summary.
Table 18. Results summary.
Datasets#ClusterAICAWEBICCLCKICProposed Approach
Crab2522552
Liver2422542
Ionosphere2422422
Diabetes3523533
Iris3422433
Wine3423533
Ruspini4524554
E.coli4433544
Vehicle4424444
Synthetic-12322222
Synthetic-23523353
Synthetic-34524244
Correct number24103812

Share and Cite

MDPI and ACS Style

Akogul, S.; Erisoglu, M. An Approach for Determining the Number of Clusters in a Model-Based Cluster Analysis. Entropy 2017, 19, 452. https://doi.org/10.3390/e19090452

AMA Style

Akogul S, Erisoglu M. An Approach for Determining the Number of Clusters in a Model-Based Cluster Analysis. Entropy. 2017; 19(9):452. https://doi.org/10.3390/e19090452

Chicago/Turabian Style

Akogul, Serkan, and Murat Erisoglu. 2017. "An Approach for Determining the Number of Clusters in a Model-Based Cluster Analysis" Entropy 19, no. 9: 452. https://doi.org/10.3390/e19090452

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop