An Approach for Determining the Number of Clusters in a Model-Based Cluster Analysis

: To determine the number of clusters in the clustering analysis that has a broad range of applied sciences, such as physics, chemistry, biology, engineering, economics etc., many methods have been proposed in the literature. The aim of this paper is to determine the number of clusters of a dataset in a model-based clustering by using an Analytic Hierarchy Process (AHP). In this study, the AHP model has been created by using the information criteria Akaike’s Information Criterion (AIC), Approximate Weight of Evidence (AWE), Bayesian Information Criterion (BIC), Classiﬁcation Likelihood Criterion (CLC), and Kullback Information Criterion (KIC). The achievement of the proposed approach has been tested on common real and synthetic datasets. The proposed approach based on the corresponding information criteria has produced accurate results. The currently produced results have been seen to be more accurate than those corresponding to the information criteria.


Introduction
Many clustering algorithms have been encountered in the literature.The clustering algorithms can be categorized into centroid-based clustering, connectivity-based clustering, model-based clustering, and so on [1].Since each one of the clustering algorithms is of great importance in its own application area, and thus model-based clustering has a very large application field, the present study focuses on the use of the model-based clustering one with the combination of the Analytic Hierarchy Process (AHP).Since the AHP is one of the most important multi-criteria decision making (MCDM) [2] and determining the number of clusters can be modeled as the MCDM problem [3,4], this work pays its attention to the consideration of the AHP in deciding the number of clusters of datasets in a combination way with model-based clustering.
Pearson [5] first introduced the idea of the mixture distribution model by studying a mixture of two univariate normal distributions with different means and variances.Later on, many related works [6][7][8] have been carried out.Model-based clustering based on a mixture of distributions has commonly been used in the clustering of datasets.Some of those who used the mixture of multivariate normal distributions in the cluster analysis are Wolfe [9,10], Day [11], and Binder [12].To estimate parameters in the mixture distribution model, the Expectation-Maximization (EM) algorithm suggested by Dempster et al. [13] has been widely used [14,15].
Model-based clustering based on finite normal mixture models is the most commonly used approach [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31].In estimating the number of clusters in model-based clustering, the information criteria have widely been used [32][33][34][35][36][37][38][39][40][41][42].Some of the common criteria in the literature are Akaike's Information Criterion (AIC) [32], Approximate Weight of Evidence (AWE) [37], Bayesian Information Criterion (BIC) [33], Classification Likelihood Criterion (CLC) [39], Kullback Information Criterion (KIC) [40], etc.These information criteria may give different results in estimating the number of clusters of a dataset.For example, while the number of clusters in the Iris dataset [43] is 3, according to AIC, AWE, BIC, CLC, and KIC, the number of clusters in the set has been seen to be 4, 2, 2, 4, and 3, respectively (Table 7).The Iris is a multivariate dataset introduced by Ronald Fisher [43] and the dataset consists of 50 samples from each of three species: setosa, virginica, and versicolor.Each one of the criteria has been seen to possibly produce a different number of clusters for the same dataset.To overcome this problem, model-based clustering and the AHP have been combined.The AHP model has been created by using the information criteria AIC, AWE, BIC, CLC and KIC.For the first time, to the best of authors' knowledge, the number of clusters of a dataset in a model-based clustering has been determined by using the AHP.Thus, the common influence of those information criteria has been benefited.Satisfactory results have been obtained in terms of the suggested model.
The rest of the paper is organized as follows.Section 2 describes the model-based clustering, the AHP model, and the proposed approach.Section 3 presents details of the experimental study and analyses the results.Finally, Section 4 presents our conclusions and recommendation.

The Model-Based Clustering
The model-based clustering assumes that a dataset to be clustered consists of various clusters with different distributions.The entire dataset is modeled by a mixture of these distributions.The clustering assumes a set of n p-dimensional vectors y 1 , . . ., y n of observations from a multivariate mixture of a finite number of g components or clusters each with some unknown mixing proportions or weights π 1 , . . ., π g [44].The probability density function (PDF) of finite mixture distribution models can be given by where f i (y j ; θ i ) is the PDF of the components.Here 0 ≤ π i ≤ 1 and ∑ g i=1 π i = 1 (i = 1, . . ., g and j = 1, 2, . . ., n).The parameter vector Ψ = (π, θ) contains all of the parameters of the mixture models.Here θ = (θ 1 , θ 2 , . . ., θ g ) denotes the unknown parameters of the PDF of the i-th components in the mixture models [17].
In the model-based clustering, the cluster analysis based on the mixture of multivariate normal distributions is the most commonly used.In this case, in Equation (1), f i (y j ; θ i )'s are assumed to be multivariate normal distribution function of the form where µ i and Σ i stand for the mean vector and the covariance matrix, respectively (i = 1, 2, . . ., g).
Here, θ i stems from the mean compound vectors µ = (µ 1 , µ 2 , . . ., µ g ) and the compound covariance matrices Σ = (Σ 1 , Σ 2 , . . ., Σ g ) of the parameters of the compound PDF in the mixture distribution model [17].The mixture likelihood approach has been used for estimating the parameters in the mixture models.This approach assumes that the PDF is the sum of the weighted component densities.If the mixture likelihood approach is used for clustering, the clustering problem comes out to be a problem of estimating the parameters of the assumed mixture distribution model.The maximum-likelihood function can then be given as follows [45], The most widely used approach for parameter estimation is the EM algorithm [17].
Determination of the number of clusters is one of the most important problems in the cluster analysis.The information criteria for the number of clusters have often been used in model-based clustering.The criteria to be used in this study are given in Table 1.
Cavanaugh [40] n: The number of observations; d: The number of parameters in the model.
A model that gives the minimum of the values of the criteria AIC, AWE, BIC, CLC, and KIC are selected to be the best model.In Table 1, the log-likelihood function for the completed data is shown as logL c (Ψ) = logL (Ψ) + EN(τ).Here, EN (τ) = − ∑ g i=1 ∑ n j=1 τ ij logτ ij is the entropy of the related classification matrix [46].

The Analytic Hierarchy Process (AHP)
The AHP was developed by Saaty [2].The AHP is one of the most widely used multiple criteria decision-making tools.The AHP is a method for structuring, measurement, and synthesis [47].The AHP is a hierarchical structure consisting of goal, criteria, and alternatives [48].The AHP chooses the best one among the alternatives, taking into account the goal and the criteria [49].In addition, the AHP is a mathematical approach that evaluates qualitative and quantitative variables together.The literature [50] tells us that the AHP has been implemented in various fields of science such as selecting a best alternative, planning, optimization, etc.To solve decision-making problems using the AHP, the following steps are applied [48][49][50][51][52]: Structuring: Initially goal, criteria, and alternatives are determined.Then, the hierarchy model is constructed at different levels according to the structure of the problem.A three level-hierarchy, that has k criteria and m alternatives, can be given in Figure 1.Measurement: Firstly, a decision matrix is formed.The decision matrix involves the assessments of each alternative with respect to the decision criteria.The decision matrix has been given in Table 2.
Here, element d ij indicates the importance level of the i-th alternative with respect to the j-th criterion (i = 1, 2, . . ., m; j = 1, 2, . . ., k).Secondly, the pairwise comparison matrices of the criteria and alternatives for each criterion have been produced.In general, the pairwise comparison matrix is constructed as in Table 3.Here, a ij stands for the degree of preference of the i-th criterion/alternative over j-th criterion/alternative (a ii = 1; a ij = 1/a ji ), and Sumt is summation of the t-th column the pairwise comparison matrix.Synthesis: To find the maximum eigenvalue (λ max ), consistency index (CI), consistency ratio (CR), and normalized eigenvector of each pairwise comparison matrix, the necessary calculations are performed.Note that CR = (CI/RI) is calculated for all pairwise comparison matrices.Here, CI = (λ max − r)/(r − 1) is the consistency index and RI is the random consistency index.As well-known from the literature [2,49,51,52], the average RI is calculated in terms of the dimension of the matrix, r.If the CR value is less than 0.10, it indicates that the matrices are consistent.As given in reference [49], if λ max = r, then the pairwise comparison matrix is considered to be consistent.
After the consistency test, the following calculations are made.Firstly, the relative importance vector (RIV) of the criteria is determined using the pairwise comparison matrix.The row averages of the normalized matrix are represented by RIV = [Avg1, Avg2, ...] T .To obtain the normalized matrix, the element of each column in the pairwise comparison matrix is divided by the column sum.The normalized matrix is then given in Table 4.The RIV of the alternatives for each criterion and the RIV of the criteria are calculated separately using the normalized matrices.Finally, to calculate the composite relative importance vector (C-RIV), the matrix formed by the RIV of the alternatives for each criterion is multiplied by the RIV of the criteria.Thus, the C-RIV determines the overall ranking of the alternatives.

The Proposed Approach for the AHP Model and the Pairwise Comparison Matrix
The model-based clustering is currently a very popular statistical-model.The information criteria for determining the number of clusters in the model-based clustering have commonly been used [42].
A number of criteria have been proposed to determine the number of clusters in a dataset.The current study has proposed an approach for determining the number of clusters by using the combination of the model-based clustering with the AHP.The AHP model has been created by using the information criteria AIC, AWE, BIC, CLC, and KIC.In Figure 2, the proposed approach has been described for determining the number of clusters of a dataset in the model-based clustering.The proposed approach is summarized in the following steps:

•
Step 1.The hierarchical structure of the AHP has been created in Figure 3.In the figure, determination of the number of clusters is the goal, the AIC, AWE, BIC, CLC, KIC are the criteria, and 2, 3, 4, 5 are the alternatives.

•
Step 2. The dataset has been modeled as the mixture of a multivariate normal distribution for the different number of clusters in the model-based clustering.The mean vectors, the covariance matrices, the mixture proportions, and the likelihood function have been estimated by the EM algorithm.

•
Step 3.For each number of clusters, the values of the information criteria have been calculated.The decision matrix has been constructed using those values.Although a model that gives the lowest value of the information criteria in the model-based clustering is selected as the best model, in the AHP, the preferred case is the one with the highest value of the C-RIV.Therefore, the value of the information criteria has been reversed in the decision matrix; for example, the AIC is taken to be 1/AIC.

•
Step 4. The pairwise comparison matrices have been obtained by using the decision matrix.

•
Step 5.For each alternative, the C-RIV has been calculated.

•
Step 6.The alternative having the highest C-RIV value is the optimal number of clusters for the dataset.To form the pairwise comparison matrix of the criteria, the study of Akogul and Erisoglu [53] has been used.In their study [53], the efficiency of the information criteria was examined.They also analyzed real datasets that are commonly used in clustering analysis.Those datasets have different characteristics such as sample size (i.e., 75, 150, 178, 345, 846, 2310 and 6435), number of clusters (i.e., 2, 3, 4, 6 and 7), and number of variables (i.e., 2, 4, 6, 13, 18, 19 and 36).The synthetic datasets were generated from the multivariate normal distribution by using the mean and covariance vectors of each dataset.Then, the number of clusters of the synthetic datasets were estimated by using the information criteria.This process was repeated 1000 times.Thus, the success of finding the right number of clusters in the dataset was computed for each information criterion.For all the synthetic datasets, in the corresponding study, the average of successes of the information criteria was given [53] as 43.6, 21.2, 47.4, 17.3, and 58.2 for AIC, AWE, BIC, CLC, and KIC, respectively.
In the work of Akogul and Erisoglu [53], the effectiveness of the information criteria was determined according to the success of finding right number of clusters.In the current study, those successes have been used to determine the importance level of the criteria in the AHP model.To produce the pairwise comparison matrix of the criteria, the average of the successes of the information criteria is considered.The average success is taken to be the degree of preference of a criterion over other criteria.The proposed pairwise comparison matrix of the criteria and the RIV of the criteria have been given in Table 5.For example, in Table 5, value 2.0566 can be interpreted as the degree of preference of the AIC over the AWE.The average of successes of the AIC is 43.6, while the AWE is of 21.2.Thus, the AIC is about two times more successful than the AWE.

Testing of the Proposed Approach for the Real Datasets
The achievement of the proposed approach has been tested on common real datasets, namely, Chemical Diabetes [54], Crab [55], Liver Disorders [56], Ionosphere [57], Iris [43], Wine [58], Ruspini [59], E.coli [60] and Vehicle Silhouettes [61].They have been provided by the UCI machine learning repository [62] and the GitHub [63].Their characteristics have been exhibited in Table 6.In this section, to determine the number of clusters in the Iris dataset, all calculations have been produced step by step.For the other datasets, only the final results have been presented and the decision matrices have been represented in the Appendix.Their pairwise comparison matrices can be obtained by using the decision matrices.
The results of the information criteria in determining the number of clusters for the Iris dataset have been presented in Table 7.According to AIC, AWE, BIC, CLC, and KIC, the number of clusters in the Iris dataset has been estimated to be 4, 2, 2, 4, and 3, respectively.To form the decision matrix, the values of the information criteria have been reversed (for example, AIC = 1/AIC).The decision matrix of the Iris dataset can be given in Table 8.The pairwise comparison matrix and the RIV of each criterion, which are obtained by using the decision matrix, have been seen in Table 9.The C-RIV has been presented in Table 10.In the table, the alternative value three is the best alternative because it has the maximum value of 0.2628 for the C-RIV.Thus, the number of clusters for the Iris dataset has been seen to be determined correctly.The C-RIV for the real datasets has also been presented in Table 11.In the table, the number of clusters for the real datasets has been estimated correctly by using the proposed approach.

Testing of the Proposed Approach for the Synthetic Datasets
For the synthetic-1 dataset, we generate 1000 samples from a two-component bivariate normal mixture with the mixing proportions π 1 = π 2 = 1/2, the mean vectors µ 1 = [2, 4] T , µ 2 = [5, 6] T , and the covariance matrices 4 shows the scatter plot and the PDF of the mixture model of the synthetic-1 dataset.The decision matrix of the synthetic-1 dataset has been presented in Table 12.The RIV of the alternatives for each criterion and the RIV of the criteria have been given in Table 13.To identify the best alternative, the C-RIV has been calculated using the corresponding values.In the table, the alternative value two is the best alternative because it has the maximum value, 0.2561, for the C-RIV.That is, the number of clusters has been seen to be determined correctly for the synthetic-1 dataset.
The decision matrix and the C-RIV of the synthetic-2 dataset have been given in Tables 14 and 15.In Table 15, the alternative value three is the best alternative.Namely, the number of clusters for the synthetic-2 dataset has been determined correctly.Similar to previous calculations, this operation has been repeated 1000 times.The success of finding right number of clusters in the synthetic-2 dataset has been computed for each information criterion.The proposed approach (93%) has been seen to be better than the AIC (74%), the AWE (10%), the BIC (92%), the CLC (31%), and the KIC (86%).For the synthetic-3 dataset, we generate again 1000 samples from a four-component bivariate normal mixture with the mixing proportions π 1 = π 2 = π 3 = π 4 = 0.25, the mean vectors  The decision matrix and the C-RIV of the synthetic-3 dataset have been seen in Tables 16 and 17.In Table 17, the alternative value four is the best alternative because it has the maximum value, 0.2526, of the C-RIV.The number of clusters for the synthetic-3 dataset has been determined correctly.Similarly, this operation has been repeated 1000 times.The success of finding the right number of clusters in the synthetic-3 dataset has been computed for each information criterion.The proposed approach (92%) has been seen to be better than the AIC (80%), the AWE (4%), the BIC (89%), the CLC (65%), and the KIC (89%).Table 18 summarizes the estimations of the number of clusters for all datasets produced by the information criteria and the proposed approach.The bottommost column has given the correct number of clusters for each dataset determined by the information criteria and the proposed approach.

Conclusions and Recommendation
This paper has proposed to combine the AHP and some information criteria, namely AIC, AWE, BIC, CLC, and KIC, in determining the number of clusters of a dataset in model-based clustering.It has been concluded that the proposed approach has been seen to be more accurate than the corresponding information criteria.The approach has thus been realized to be capable of application to a widespread number of clustering algorithms.To carry out this study, the decision matrix has been created by using the information criteria values for each case.To increase the successes of the information criteria, a pairwise comparison matrix has been suggested in this study.Note that the proposed method is strongly expected to be very effective in analyzing data come out in various of fields science such as economics, biology, engineering etc.For further studies, researchers can pay their attention to produce different decision and pairwise comparison matrices to deal with their problems.

Figure 2 .
Figure 2. The proposed approach for determining the number of clusters.

Figure 3 .
Figure 3.The hierarchical structure of the AHP for the proposed approach.

Figure 4 .
Figure 4. Synthetic-1 dataset: (a) the scatter plot; (b) the PDF of the mixture model.

Figure 5 .
Figure 5. Synthetic-2 dataset: (a) the scatter plot; (b) the PDF of the mixture model.

Figure 6 .
Figure 6.Synthetic-3 dataset: (a) the scatter plot; (b) the PDF of the mixture model.

Table 1 .
The information criteria for the model selection.

Table 2 .
The decision matrix.

Table 3 .
The pairwise comparison matrix.

Table 4 .
The normalized matrix.

Table 5 .
The proposed pairwise comparison matrix of the criteria and the RIV of the criteria.
* The relative importance vector (RIV) of the criteria.

Table 6 .
Descriptions of the real datasets.

Table 7 .
The results in determining the number of clusters for the Iris dataset.
* The minimum value of the information criteria.

Table 9 .
The pairwise comparison matrix and the RIV of each criterion for the Iris dataset.

Table 10 .
The C-RIV for the Iris dataset.

RIV AIC RIV AW E RIV BIC RIV CLC RIV KIC
* The maximum value of the C-RIV.

Table 11 .
The C-RIV of the real datasets for the proposed approach.
* The maximum value of the C-RIV for each dataset.

Table 13 .
The C-RIV for the synthetic-1 dataset.

RIV AIC RIV AW E RIV BIC RIV CLC RIV KIC C-RIV
* The maximum value of the C-RIV.

Table 15 .
The C-RIV for the synthetic-2 dataset.

RIV AIC RIV AW E RIV BIC RIV CLC RIV KIC
* The maximum value of the C-RIV.

Table 17 .
The C-RIV for the synthetic-3 dataset.

RIV AIC RIV AW E RIV BIC RIV CLC RIV KIC
* The maximum value of the C-RIV.