Next Article in Journal
An Entropy-Based Design Evaluation Model for Architectural Competitions through Multiple Factors
Next Article in Special Issue
Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation
Previous Article in Journal
Impact of Misclassification Rates on Compression Efficiency of Red Blood Cell Images of Malaria Infection Using Deep Learning

Entropy 2019, 21(11), 1063; https://doi.org/10.3390/e21111063

Article
An Integrated Approach for Making Inference on the Number of Clusters in a Mixture Model
Erlandson Ferreira Saraiva 1,*,  Adriano Kamimura Suzuki 2, Luis Aparecido Milan 3Orcid and Carlos Alberto de Bragança Pereira 1,4Orcid
1
Instituto de Matemática, Universidade Federal de Mato Grosso do Sul, Campo Grande 79070-900, Brazil
2
Departamento de Matemática Aplicada e Estatística, Universidade de São Paulo, São Carlos 13566-590, Brazil
3
Departamento de Estatística, Universidade Federal de São Carlos, São Carlos 13565-905, Brazil
4
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo 05508-090, Brazil
*
Author to whom correspondence should be addressed.
Received: 23 September 2019 / Accepted: 26 October 2019 / Published: 30 October 2019

Abstract

:
This paper presents an integrated approach for the estimation of the parameters of a mixture model in the context of data clustering. The method is designed to estimate the unknown number of clusters from observed data. For this, we marginalize out the weights for getting allocation probabilities that depend on the number of clusters but not on the number of components of the mixture model. As an alternative to the stochastic expectation maximization (SEM) algorithm, we propose the integrated stochastic expectation maximization (ISEM) algorithm, which in contrast to SEM, does not need the specification, a priori, of the number of components of the mixture. Using this algorithm, one estimates the parameters associated with the clusters, with at least two observations, via local maximization of the likelihood function. In addition, at each iteration of the algorithm, there exists a positive probability of a new cluster being created by a single observation. Using simulated datasets, we compare the performance of the ISEM algorithm against both SEM and reversible jump (RJ) algorithms. The obtained results show that ISEM outperforms SEM and RJ algorithms. We also provide the performance of the three algorithms in two real datasets.
Keywords:
model-based clustering; mixture model; EM algorithm; integrated approach

1. Introduction

Recently, there has been increasing interest in modeling using mixture models. This is mainly due to the flexibility for treating heterogeneous populations. Under a data-clustering framework, this model has the advantage of being probabilistic, and then the obtained clusters can have a better interpretation from a statistical point of view [1]. This contrasts with usual methods, such as k-means or hierarchical clustering, in which clusters are not statistically based, as discussed by [2].
From a frequentist viewpoint, the standard method to get the maximum likelihood estimates for the parameters of a mixture model is based on the use of the Expectation Maximization (EM) algorithm [3]. However, for the use of this algorithm, the number of components k of the mixture model needs to be known a priori. As the resulting model is highly dependent on the choice of this value, the main question is how to set the k value. Selecting an erroneous k value may cause the non-convergence of the algorithm and/or a low exploration of the clusterings. In addition, depending on the k value chosen we may have empty components, and therefore, there are no maximum likelihood estimates for these components.
An approach frequently used to determine the best k value among a fixed set of values is the use of the stochastic version of the EM algorithm (SEM) with some model selection criterion, such as the Akaike information criterion (AIC) [4,5] or the BIC [6]. In this approach, models are fitted for a set of predefined k values, and the best model is the one that has the smallest AIC or BIC value.
However, as discussed by [7], to adjust several models for a predefined set of values for the number of the cluster and compare them using some model selection criterion is not a practical and efficient procedure. Therefore, it is desirable to have an efficient algorithm to calculate the optimal number of clusters together with the estimation of the parameters of each mixture component. In this scenario, the Bayesian approach was successfully performed considering the Markov chain Monte Carlo (MCMC) algorithm with reversible jumps, described by [8] in the context of univariate normal mixture models. On the other hand, a difficulty often encountered in implementing a reversible jump algorithm (RJ) is the construction of efficient transition proposals that lead to a reasonable acceptance rate.
Following in the line of MCMC algorithms, [9] proposes a split–merge MCMC procedure for the conjugated Dirichlet process mixture model using a restricted Gibbs sampling scan to determine a split proposal, where the number of scans (tuning parameter) must be previously fixed by the user, and [10] extend their method to a nonconjugated Dirichlet process mixture model. [11] proposes a data-driven split-and-merge approach. In this proposal, the number of clusters is updated according to the creation of a new component based on a single observation and using a split–merge strategy, developed based on the use of the Kullback–Leibler divergence. A difficulty encountered for implementing this algorithm is the obtaining of the mathematical expression for the Kullback–Leibler divergence, which does not always have known analytical expression. In addition, the sequential allocation used in the split–merge strategy of these three works may make the algorithm slow when the sample size is great, and the computation implementation of these methods is not so simple.
The present work proposes an integrated approach that, in a joint way, selects the number of clusters and estimates the parameters of interest. With this approach, the mixture weights are integrated out to obtain allocation probabilities that depend on the number of clusters (nonempty components) but do not depend on the number of components k. In addition, considering k tending to infinity, this procedure introduces a positive probability of a new cluster being created by a single observation. When a new cluster is created, the parameters associated with it are generated from its posterior distribution. We then developed the ISEM (integrated stochastic expectation maximization) algorithm to estimate the parameters of interest. This algorithm configures a setting for latent allocation variables c according to allocation probabilities, and then the cluster parameters are updated conditionally on c as follows: for clusters with at least two observations, the parameter values are the maximum likelihood estimates; for the clusters with only one observation, the parameter values are generated from their posterior distribution.
In order to illustrate the computation implementation of the method and verify its performance, we have considered a specific model in which data are generated from mixtures of univariate normal distributions. This model allows us to avoid the label switching problem by considering the labeling of the components according to the increasing order of the component averages, as done by [8,11,12,13], among others. But we emphasize that our algorithm is not restricted to this particular model. For instance, for the multivariate case, we may consider the labeling of the components according to the eigenvalues of the current covariance matrix, as done by [14]. However, a detailed discussion of the multivariate case will be done in a future paper.
We also compare the performance of the ISEM with both SEM and RJ algorithms. The criteria used to compare the methods are the estimated probability of the number of clusters, convergence of the sampled values, mixing, autocorrelation, and computation time. We also applied the three algorithms to two real datasets. The first is the well-known Galaxy data, and the second is a dataset on Acidity.
The remainder of the paper is as follows. Section 2 describes the mixture model and the estimation process based on the SEM algorithm. Section 3 develops the integrated approach and describes the ISEM algorithm. Section 4 shows how we applied the algorithm to simulated datasets in order to assess its performance. Section 5 describes the application of the three algorithms to two real datasets. Section 6 is about our final remarks. Additional details are in the Supplementary Material, which is referred to as “SM” in this paper. Table 1 presents the main notations used throughout the article.

2. Mixture Model and SEM Algorithm

Let y = ( y 1 , , y n ) be a vector of independent observations from a mixture model with k components, i.e.,
f ( y i | w , θ k , k ) = j = 1 k w j f ( y i | θ j ) ,
where f ( y i | θ j ) is the density of a family of parametric distributions with parameters θ j (scalar or vector), θ k = ( θ 1 , , θ k ) are the parameters of the components, and w = ( w 1 , , w k ) , w j > 0 and j = 1 k w j = 1 are component weights.
The log-likelihood function for ( θ k , w ) is given by
l ( θ k , w | y , k ) = log i = 1 n j = 1 k w j f ( y i | θ j ) = i = 1 n log j = 1 k w j f ( y i | θ j ) .
The mathematical notation l ( θ k , w | y , k ) is given as in the book of Casella and Berger (2002).
The usual procedure to obtain the maximum likelihood estimators consists of getting partial derivatives of l ( θ k , w | y ) in relation to θ j and then equalizing the result to zero, i.e.,
d l ( θ k , w | y ) d θ j = i = 1 n w j f ( y i | θ j ) j = 1 k w j f ( y i | θ j ) d log f ( y i | θ j ) d θ j = 0 ,
for j = 1 , , k .
But, note that in (2), the maximization procedure consists of a weighted maximization process of the log-likelihood function with each observation y i having a weight associated to component j given by
w i j * = w j f ( y i | θ j ) j = 1 k w j f ( y i | θ j ) ,
for i = 1 , , n and j = 1 , , k . However, these weights depend on the parameters that we are trying to estimate. In this way, we cannot obtain a “closed” mathematical expression that allows the direct maximization of the log-likelihood function. Due to this, the mixture problem is reformulated as a complete-data problem [12,15].

Complete-Data Formulation

Consider associated to each observation y i a latent indicator variable c i not known, so that if c i = j , then y i is from component j, for i = 1 , , n and j = 1 , , k . The probability of c i = j is w j , P ( c i = j | w , k ) = w j , for i = 1 , , n and j = 1 , , k . Letting n j be the number of observations from component j (i.e., the number of c i s equals to j), the joint probability for c = ( c 1 , , c n ) given w and k is
π ( c | w , k ) = k j = 1 w j n j .
The distribution of the number of observations assigned to each component, n 1 , , n k , called the occupation number, is multinomial, ( n 1 , , n k | n , w ) M u l t i n o m i a l ( n , w ) , where n = n 1 + + n k .
Thus, under this augmented framework, we have that
(1)
the probability of c i = j , conditional on observation y i and on component parameters θ k , is  w i j * , i.e., P ( c i = j | y i , θ k , k ) = w i j * , for w i j * given in Equation (3), for i = 1 , , n and j = 1 , , k . That is, although the indicator variables are nonobservable, they are implicitly present in the estimation procedure given in Equation (2).
(2)
the log-likelihood function for ( θ k , w ) , conditional on complete data ( y , c ) , is given by
l ( θ k , w | y , c ) = log j = 1 k w j n j L ( θ j | y ) = j = 1 k n j log ( w j ) + l ( θ j | y ) ,
where l ( θ j | y ) = log L ( θ j | y ) is the log-likelihood function for component j, for j = 1 , , k . Thus, the estimation procedure of the k component parameters reduce to k independent problems of estimation. For example, for a normal mixture model, the maximum likelihood estimates for component parameters θ j = ( μ j , σ j 2 ) is θ ^ j = ( μ ^ j , σ ^ j 2 ) = ( y ¯ j , s j 2 ) , where y ¯ j and s j 2 are, respectively, the average and variance of the observations allocated to component j, for j = 1 , , k .
From this complete-data formulation, the estimation procedure is given by an iterative process with two steps. In the first one, the allocation indicator variables are updated conditional on component parameters, and in the subsequent step, the component parameters are updated conditional on configuration of the allocation indicator variables.
The usual algorithm used to implement these two steps is the EM algorithm [3]. The stochastic version of the EM algorithm (SEM) can be implemented according to Algorithm 1.
Algorithm 1 SEM Algorithm
1: Initialize the algorithm with a configuration c ( 0 ) = c 1 ( 0 ) , , c n ( 0 ) for allocation indicator variables.
2: procedure For the s-th iteration of the algorithm, s = 1 ,
3:     get the maximum likelihood estimates θ ^ k ( s ) = θ ^ 1 ( s ) , , θ ^ k ( s ) and w ^ ( s ) = w ^ 1 ( s ) , , w ^ k ( s )         conditional on configuration c ( s 1 ) ;
4:     if l θ ^ k ( s ) , w ^ ( s ) | y l θ ^ k ( s 1 ) , w ^ ( s 1 ) | y l θ ^ k ( s 1 ) , w ^ ( s 1 ) | y < ϵ , where ϵ is a threshold value previously fixed, then stop the algorithm. Otherwise, go to item (iii);
5:     conditional on θ ^ k ( s ) and w ^ ( s ) , update c = ( c 1 , , c n ) as follows. For i = 1 , , n and j = 1 , , k do the following:
6:     Let z i = ( z i 1 , , z i k ) be a indicator vector, so that z i j = 0 or z i j = 1 ;
7:     Generate z i M u l t i n o m i a l ( 1 , w i * ) , where w i * = ( w i 1 * , , w i k * ) and w i j * is obtained from Equation (3) doing θ j = θ ^ j and w j = w ^ j . If z i j = 1 , then do c i = j ;
8:     Do s = s + 1 and return to step (3).
Although it is simple to implement computationally, the SEM algorithm may present some practical problems. As discussed by [16], the algorithm may present a slow convergence. Due to this, some authors, such as [17,18], discuss how to set up the start values in order to increase the convergence. In addition, [15] discusses the non-existence of the global maximum estimator.
Moreover, in this algorithm, the k value must be known previously. For the cases in which k is an unknown quantity, the best k value is chosen by fitting a set of models associated with a set of predefined k values and comparing them according to AIC [4,5] or BIC [6] criteria. Furthermore, given a sample of size n and fixed a k value, there exists a positive probability, given by ( 1 w j ) n 0 , of the j-th component not having observations allocated in an iteration of the algorithm. In this case, we have an empty component, and the maximum likelihood estimates cannot be calculated for these components. Thus, in order to avoid the practical problems presented by the EM algorithm, we propose an integrated approach.

3. Integrated Approach

We start our integrated approach linking data clustering to a mixture model. For this, consider a sampling process from a heterogeneous population that is subdivided into k sub-populations. Thus, it is natural to assume that the sampling process consists of the realization of the following steps:
(i)
choose a sub-population j with probability w j , where w j is the relative frequency of the j-th sub-population in relation to the overall population;
(ii)
sample a Y i value of this sub-population,
for i = 1 , , n and j = 1 , , k , where n is the sample size.
Let ( Y i , c i ) be a sample unit, where c i is an indicator allocation variable that assumes a value of the set { 1 , , k } with probabilities { w 1 , , w k } , respectively. Thus, assuming that subpopulation j is modeled by a probability distribution F ( θ j ) indexed by parameter θ j (scalar or vector), we have that
( Y i | c i = j , θ j ) F ( θ j ) and P ( c i = j | w ) = w j ,
for i = 1 , , n and j = 1 , , k .
However, in clustering problems, the c i ’s values are non-observable. Thus, the probability of c i = j is w j , and the marginal probability density function for Y i = y i is given by Equation (1).
In addition, as the model in Equation (1) is a population model; so there exists a non-null probability ( 1 w j ) n that the j-th component is an empty component. Thus, the number of clusters (i.e., non-empty components) is smaller than the number of components k. As viewed in the description of the EM algorithm, the number of clusters is defined by the configuration of the latent allocation variables c ; thus hereafter, we will denote the number of clusters by k c , for k c k .
Since the interest lies in the configuration of c , let us marginalize out the weights of the mixture model. Thus, integrating density (4) with respect to the prior D i r i c h l e t γ k , , γ k distribution of the weights, denoted by ( w 1 , , w k ) | k , γ D i r i c h l e t γ k , , γ k , the joint probability for c is given by (see Appendix 3 of the SM)
π ( c | γ , k ) = Γ ( γ ) Γ ( n + γ ) j = 1 k Γ ( n j + γ k ) Γ γ k .
Similarly, the conditional probability for c i = j given c i = ( c 1 , , c i 1 , c i + 1 , , c n ) , is given by
π ( c i = j | c i , γ , k ) = n j , i + γ k n + γ 1 ,
where n j , i is the number of observations allocated to the j-th component, excluding the i-th observation, for i = 1 , , n and j = 1 , , k .
As the main interest is in k c and not k, we remove k from Equation (6) by letting k tend to infinity. Under this assumption, the probability reaches the following limit:
π ( c i = j | c i , γ ) = n j , i n + γ 1 ,
when n j , i > 0 , for i = 1 , , n and j = 1 , , k c , where k c is the number of clusters defined by configuration c . In addition, we now have a probability of the i-th observation being allocated to one of the other infinite components, which is given by
π ( c i = j * | c i , γ ) = γ n + γ 1 ,
for j * { 1 , , k c } . This is the probability of the observation y i creating a new cluster, for i = 1 , , n . The probabilities in (7) and (8) are equivalent to the update probabilities of a Dirichlet process mixture model. See, for example, [19,20,21].
Given y i , the conditional probability for c i = j , such that n j , i > 0 , is
π i j = π ( c i = j | y i , θ j , c i , γ ) = n j , i n + γ 1 f ( y i | θ j ) ,
for i = 1 , , n and j = 1 , , k c i , where k c i is the number of clusters excluding the i-th observation. At this point, it is important to note that if an observation y i is allocated to a component j, c i = j , and  n j > 1 , then n j , i 1 and k c i = k c . But if c i = j and n j = 1 , then n j , i = 0 and k c i = k c 1 .
In order to define the conditional probability of the i-th observation creating a new cluster j * , we integrate parameters out for this case, for j * = k c i + 1 . This was done because that probability does not depend on the parameter value θ j * . Thus, the conditional posterior probability for C i = j * is
π i j * = π ( c i = j * | y i , c i , γ ) = γ n + γ 1 I ( y i ) ,
where I ( y i ) = f ( y i | θ j * ) π ( θ j * ) d θ j * and π ( θ j * ) is the density of the prior distribution for θ j * , for  i = 1 , , n .
As is known from the literature, the likelihood function for a mixture model is non-identifiable, i.e., any permutation of the components’ labels lead to the same likelihood function (see, for example, [8,11,22,23,24]). Thus, in order to get identifiability, we assume that μ 1 , , μ k c are the component means for clusters and that μ 1 < < μ k c . However, it does not prevent the algorithm described in the next Section from being applicable to another labeling criterion. Additional discussion about label switching can be found in [22,23].

Integrated SEM Algorithm

Using probabilities given in Equations (9) and (10), we update the allocation indicator variables according to Algorithm 2.
Conditional on a configuration c , we have k c clusters. So, we update parameters of interest according to Algorithm 3. We then join Algorithms 2 and 3 to get the Algorithm 4.
After the S iterations, we discard the first B iterations as a burn-in. In the following, we also consider “jumps” of size h, i.e., only one draw every h is extracted from the original sequence in order to obtain a sub-sequence of size H = ( S B ) / h to make inferences. Denote this sub-sequence by S ( H ) .
Consider N k c ( j ) to be the number of times that k c = j in S ( H ) , for j { 1 , , K m a x } , where K m a x is the maximum k c value sampled in the course of iterations. Thus, P ˜ ( k c = j ) = N k c ( j ) H is the estimated probability for k c = j . We then consider
k ˜ c = a r g m a x 1 j K m a x P ˜ ( k c = j )
as being the estimates for the number of components k c .
Appendix 1 of the SM presents the mathematical expression used to determine a configuration for c and estimates for the parameters of the clusters, conditional on the estimate k ˜ c .
Algorithm 2 Updating c
1: Let c = ( c 1 , , c n ) be the current configuration for latent allocation variables. Then, update c as follows.
2: procedure For i = 1 , , n :
3:     Remove c i from the current state c , obtaining c i and k c i ;
4:     Generate a variable Z i = Z i 1 , , Z i k c M u l t i n o m i a l ( 1 , P i ) , where P i = ( π i 1 , , π i k c i , π i j * ) for π i j given in (9) and π i j * given in (10), for j = 1 , , k c i and j * = k c i + 1 ;
5:     If Z i j = 1 , for j { 1 , , k c i } , set up c i = j and do n j = n j , i + 1 ;
6:     If Z i j * = 1 , do n j * = 1 and k c = k c i + 1 . As this new cluster has just one observation allocated, set as component parameter θ j * = θ j g , where θ j g is a value generated from posterior distribution π ( θ j * | y i ) f ( y i | θ j * ) π ( θ j * ) , where f ( y i | θ j * ) is the probability density function for y i conditional on θ j * and π ( θ j * ) is the density of the prior distribution for θ j * . Relabel the k c clusters in order to maintain the adjacency condition. If the component mean μ j * of the new cluster is such that:
7:      μ j * = m i n 1 j k c μ j , then do j * = 1 and relabel all other clusters doing j + 1 ;
8:      μ j * = m a x 1 j k c μ j , then do j * = k c and keep all other clusters labels;
9:      μ j < μ j * < μ j + 1 , for j { 1 , k c } , then do j * = j + 1 and relabel all other clusters j j + 1 doing j = j + 1 .
Algorithm 3 Updating cluster parameters
1: Let θ k c = ( θ 1 , , θ k c ) be the current parameter values of the clusters. Conditional on configuration c , get θ k c u p d a t e d = ( θ 1 u p d a t e d , , θ k c u p d a t e d ) as follows:
2: if cluster j is such that n j > 1 , then do θ j u p d a t e d = θ ^ j , where θ ^ j are the maximum likelihood estimates of the j-th cluster;
3: if cluster j is such that n j = 1 , then generate θ j g from conditional posterior distribution π ( θ | y i ) and set θ j u p d a t e d = θ j g ;
4: Do θ k = θ k u p d a t e d only if the adjacency condition μ 1 u p d a t e d < < μ k c u p d a t e d is met. Otherwise, keep θ k c as the current value.
Algorithm 4 ISEM Algorithm
1: Initialize the algorithm with a configuration c ( 0 ) = c 1 ( 0 ) , , c n ( 0 ) for allocation indicator variables.
2: procedure For the s-th iteration of the algorithm, s = 1 , , S , do the following.
3:     Conditional on c ( s 1 ) , update the parameters of the clusters according to algorithm 3;
4:     Obtain a new configuration c ( s ) for the allocation of indicator variables using algorithm 2.

4. Simulation Study

In this section, we describe the results from a simulation study carried out to verify the performance of the proposed algorithm. To generate the artificial datasets, we considered univariate normal mixture models. We set up the number of clusters and parameter values according to the specified values in Table 2. We also fixed the sample size equal to n = 200 .
The procedure for generating the datasets is given by the following steps:
(i)
For i = 1 , , n , generate U i U ( 0 , 1 ) ; if j = 1 j 1 w j < u i j = 1 j w j , generate Y i N μ j , σ j 2 , with fixed parameter values according to Table 2, for w 0 = 0 and j = 1 , , k c .
(ii)
In order to record from which component each observation is generated, we define G = ( G 1 , , G n ) such that G i = j if Y i N μ j , σ j 2 , for i = 1 , , n and j = 1 , , k c .
Having generated the datasets, we need to define the the probability of creating a new cluster and the posterior distribution for θ j * = μ j * , σ j * 2 given y i , for i = 1 , , n . For this, consider the following conjugated prior distributions for component parameters θ j = μ j , σ j 2 :
μ j | σ j 2 , μ 0 , λ N μ 0 , σ j 2 λ and σ j 2 | α , β Γ ( α , β ) ,
where μ 0 , λ , α , and β are hyperparameters. The parametrization of the gamma distribution is such that the mean is α / β and the variance is α / β 2 .
Following [11,24], we consider the following procedure to define the values for the hyperparameters. Let R be the observed variation interval of the data and ε its midpoint. Then, we set up μ 0 = ε and E ( σ j 2 ) = R 2 . Thus, we obtain β = α R 2 , and we fix α = 1 . In addition, to obtain a prior distribution with a large variance, we fixed λ = 10 2 , and for the hyperparameter γ , we consider the value 0.1 , γ = 0.1 .
Thus, the probability of creating a new cluster is given by Equation (10), in which
I ( y i ) = λ 2 β π ( 1 + λ ) 1 2 Γ ( α + 1 ) Γ ( α ) 1 + y i 2 + λ μ 0 2 2 β ( y i + λ μ 0 ) 2 2 β ( 1 + λ ) ( α + 1 2 ) ,
and j * = k c + 1 , for i = 1 , , n .
When a new cluster is created, the new parameter values θ j * = ( μ j * , σ j * 2 ) are generated from the following conditional posterior distributions,
μ j * | σ j * 2 , y i , c , μ 0 , λ N y i + λ μ 0 1 + λ , σ j 2 1 + λ
and
σ j * 2 | y i , c , τ , β Γ α + 1 , β + y i 2 + λ μ 0 2 2 ( y i + λ μ 0 ) 2 2 ( 1 + λ ) ,
for j * = k c i + 1 .
We run the ISEM algorithm for S = 55,000, B = 5000, and h = 10. From these values, we got a sub-sequence S ( H ) of size 5000 to make inferences. The algorithm was initialized with k c = 1 and parameter values μ 1 = y ¯ and σ 1 2 = s 2 , the sample mean and variance of the generated dataset, respectively.
We also apply to the generated datasets the SEM algorithm, as describe in Section 2, and the RJ algorithm as proposed by [8]. In order to choose the number of clusters using the SEM algorithm, we consider the AIC and BIC model selection criteria. In addition, the algorithm was initialized using a configuration c ( 0 ) obtained via the k-means algorithm [25]. As stop criterion, we set up the threshold ε = 0.001 . For the RJ algorithm, we consider the same number of iterations, burn-in, and thin value used in the ISEM algorithm.
In order to compare the three algorithms in terms of the estimation of the number of clusters, we consider M = 500 simulated datasets. Table 3 shows the proportion of times that the ISEM and RJ algorithms put the highest estimated probability on the k c values presented. This table also show the proportion of times that the AIC and BIC indicated the k c value as the best among the tested values. The values highlighted in bold are the proportions on the k c true value. As one can note, the ISEM shows a better performance, i.e., higher proportion of the k c true value than the other two algorithms, especially in relation to the SEM algorithm with the selection of k c via the AIC and BIC. The results also show that the AIC and BIC model selection criteria have a low success ratio, with a proportion of the k c true value smaller than 0.50 .

4.1. Results from a Single Simulated Data Set

We also analyze the results from a single dataset selected at random from the M = 500 generated datasets in each situation A 1 to A 4 . Then, we discuss the convergence of the ISEM and RJ algorithms based on the sample generated across iterations, using graphical tools. In general, the graphical tools show whether the simulated chain stabilizes in some sense and provide useful feedback about the convergence [26].
Table 4 shows the estimated probabilities of k c obtained with ISEM and RJ and the AIC and BIC values from the SEM algorithm for the selected dataset. In this table, the values highlighted in bold are the highest estimated probabilities and the smallest AIC and BIC values. As we can note, the ISEM algorithm set up a maximum probability for the k c true value for the four simulated cases.
The RJ algorithm puts a higher probability on the k c true value for datasets A 1 and A 2 . However, the probability on the k c true value is smaller than that estimated by ISEM. This indicates a higher precision for the ISEM algorithm. For datasets A 3 and A 4 , the RJ attributes maximum probability to the wrong values, k c = 5 and k c = 6 , respectively. Moreover, the probabilities estimated by RJ do not evidence a single value for k c as being the best value since there are different values for k c with similar probabilities. For example, for dataset A 2 , the maximum is at k c = 3 with P ( k c = 3 | · ) = 0.3836 , but one can argue that the estimated probabilities favor k c = 3 or k c = 4 . For dataset A 3 , there is similar support for k c between 4 and 7, and for A 4 between 5 and 7.
Analogously to ISEM and RJ, the AIC and BIC model selection criteria indicate the k c true value as the best value for datasets A 1 and A 2 . For dataset A 3 , similar to the RJ, the AIC indicates the wrong value k c = 5 as the best value, while the BIC indicates the k c true value as the best value. For dataset A 4 , the AIC and BIC indicate the wrong value k c = 6 as the best model.

4.2. An Empirical Check of the Convergence

We now empirically check the convergence of the sequence of the probability for k c across iterations, the capacity to move for different values of k c in the course of the iterations, and the estimated autocorrelation function the (acf) for the ISEM and RJ algorithms.
Figure 1a,d,g,j presents the graphics of the probability for k c in the course of the iterations, for the four simulated datasets. To maintain a better visualization, we plot in these graphics only the three higher P ( k c | · ) estimates. Observing at these figures, it can be seen that the L iterations and the burn-in value B used were adequate to achieve stability for P ( k c | · ) . In addition, Figure 1b,e,h,k shows that the ISEM algorithm mixes well over k c , i.e., “visits” mixture models with different values of k c across iterations. As shown by Figure 1c,f,i,l, the sampled k c values also do not have significant autocorrelation function (ACF). Thus, based on these graphical tools, there is no evidence against the convergence of the generated values by the ISEM algorithm.
Figure 2 shows the performance of the RJ algorithm. The probabilities of k c present a satisfactory stability. The sampled k c values have a satisfactory mix, and the estimated autocorrelation is non-significant. In addition, as can be noted in Figure 2, probabilities for the number of clusters do not differentiate a value of k c in order to be chosen as the better value, as done by ISEM. This may happen due the fact that the performance of the RJ depends on the choice of the transition functions to do “good” jumping, meaning that a transition function that is adequate for one dataset may be not for another one. As the ISEM algorithm does not need the specification of transition functions to propose a change of the k c value, these results shows us that ISEM may be an effective alternative in relation to RJ and SEM algorithms for the joint estimation of k c and the cluster parameters of a mixture model.
Figure 1 in Appendix 2 of the SM shows the generated values for datasets A 1 to A 4 . This Figure also shows the clusters identified by the ISEM algorithm. As can be seen, clusters are satisfactorily identified by the proposed algorithm.
We also compare ISEM and RJ algorithms in terms of CPU computation time. The simulations were realized on a MacBook Pro, 2.5 GHz Intel Core i5 dual core, 4 Gb MHz DDR3. Table 5 shows a summary of the times of iterations for the ISEM and RJ algorithms. The column denoted by s.d. presents the standard deviation values. For dataset A 1 , the average time that RJ takes to run one iteration is 1.8491 times greater than the average time that ISEM takes to run an iteration. For datasets A 2 , A 3 , and A 4 , the average time that RJ needs to run one iteration is 1.8175 , 2.3239 , and 1.8932 times greater than the average time that ISEM takes to run an iteration, respectively. These results show a better performance of the ISEM algorithm. The higher iteration times of the RJ algorithm are mainly due to the split–merge step used to increase the mixing of the Markov chain in relation to the number of clusters.
The results from these simulated datasets show that the ISEM algorithm may be an effective alternative to the RJ and SEM algorithms for data clustering in situations where the number of clusters is a unknown quantity.

5. Application

The three algorithms are now applied to two real datasets. The first real dataset refers to velocity in km/s of n = 82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region. This dataset is known in the literature as the Galaxy data and has already been analyzed by [8,13,22,27], among others. This dataset is available in the R software. The second real dataset refers to an acidity index measured in a sample of n = 155 lakes in central-north Wisconsin. This dataset was downloaded from the website https://people.maths.bris.ac.uk/∼mapjg/mixdata.
For application of ISEM and RJ algorithms, we consider the same number L = 5500, B = 5000, and h = 10. Table 6 shows the estimated probabilities for k c obtained with ISEM and RJ and the AIC and BIC values from EM algorithm for each dataset. The maximum probability from ISEM and RJ and the minimum AIC and BIC values are highlighted in bold.
For the Galaxy dataset, the ISEM and RJ algorithms put highest probability on k c = 3 and k c = 5 , respectively. However, analogously to the simulation study, the probabilities estimated by RJ do not evidence a single value for k c as being the best value. For this dataset, the estimated probabilities indicate a k c value between 3 and 7. The AIC and BIC also indicate k c = 5 as the best value. For the Acidity dataset, ISEM, AIC, and BIC indicate k c = 2 as the best value. The probabilities estimated by RJ attribute similar values for k c = 3 and k c = 4 .
Figure 3 and Figure 4 show the performance of the ISEM and RJ algorithms across iterations for the Galaxy and Acidity datasets. The values sampled by the ISEM algorithm present satisfactory stability for estimated probability across iterations, mix well among different k c values, and present no significant autocorrelation. That is, we do not have evidence against the convergence of the generated chain by the ISEM algorithm. In relation to the RJ, the sampled values mix well and do not present significant autocorrelation. However, although the values sampled by RJ present stability for P ( k c ) , the estimated probabilities do not differentiate a value of k c in order to be chosen as the better value, as done by ISEM. This result shows the need to run RJ for a greater number of iterations. With this, we have that for both real datasets, ISEM presents faster convergence than the RJ algorithm.
Table 7 shows a summary of the iteration times for the ISEM and RJ algorithms. For the Galaxy data, the average time that ISEM takes to run an iteration is 0.0053 s; while the average time for RJ is 0.0098 s. That is, the average time that RJ takes to run one iteration is 1.8491 times greater than the average time that ISEM takes to run an iteration. For the Acidity data, the average times that the ISEM and RJ algorithms take to run an iteration are 0.0085 and 0.0180 s, respectively. For this dataset, the average time that RJ needs to run an iteration is 2.2118 times greater than the average time that ISEM runs. Similarly to results from the simulation study, ISEM presents better results, i.e., a shorter time to run the iterations.

6. Final Remarks

This article presents a discussion of how to estimate the parameters of a mixture model in the context of data clustering. We propose an alternative algorithm to the EM algorithm called ISEM. This algorithm was developed through an integrated approach in order to allow k c to be estimated jointly with the other parameters of interest. In the ISEM algorithm, the allocation probabilities depend on the number of clusters k c and are independent of the number of components k of the mixture model.
In addition, there exists a positive probability of a new cluster being created by a single observation. This is an advantage of the algorithm because it creates a new cluster without the need to specify transition functions. In addition, the cluster parameters are updated according to the number of allocated observations. For the clusters with at least two of these observations, the values of the parameters are taken by the maximum likelihood estimates. For a cluster with just one observation, the parameter values are generated from the posterior distribution.
In order to illustrate the performance of the ISEM algorithm, we developed a simulation study. In this simulation study, we considered four scenarios with artificial data generated from Gaussian mixture models. In addition, each one of the four scenarios was replicated M = 500 times, and the proportion of times that ISEM put a higher probability on the k c true value was recorded. We applied this same procedure to the EM algorithm, choosing the number of clusters k c via the AIC and BIC, and to the RJ algorithm. Then, the three algorithms were compared in terms of proportion of times that the k c true value was selected as the best value. The results obtained show that the ISEM algorithm outperforms the RJ and SEM algorithms. Moreover, the results also show that the AIC and BIC model selection criteria should not be used to determine the number of clusters in a mixture model due to a low success rate.
We also compared the performance of ISEM and RJ in terms of empirical convergence of the sequence of values generated using graphical tools. For this, we selected at random an artificial dataset from each scenery, and then we plotted the probability estimates for k c across iterations, the generated k c values, and the estimated autocorrelation of the sampled values (see Figure 1 and Figure 2). Again, the results show a better performance for the ISEM algorithm. While ISEM presents satisfactory stability for the probability of k c and differentiates the true k c as the best value, the probabilities estimated by RJ do not differentiate a value of k c in order to be chosen as the better value.
In order to illustrate the practical use of the proposed algorithm and compare its performance with the SEM and RJ algorithms, we applied the three algorithms to two real datasets: the Galaxy and Acidity datasets. For the Galaxy dataset, ISEM indicates k c = 3 with probability P ( k c = 3 | · ) = 0.7024 , while the RJ algorithm, the AIC, and the BIC indicate k c = 5 . However, as shown in Figure 3d, the RJ algorithm again does not differentiate a value of k c , while ISEM differentiates the k c = 3 value, and the generated values across iterations present satisfactory stability. For the Acidity dataset, the ISEM, AIC, and BIC indicate k c = 2 as the best value, while RJ attributes similar probabilities for k c = 3 and k c = 4 .
As mentioned in the Introduction, the generalization of the proposed algorithm for the multivariate case is the next step of our research. The simulation study and the application were done in R software, and the computational codes can be obtained by emailing the authors.

Supplementary Materials

The following are available online at https://www.mdpi.com/1099-4300/21/11/1063/s1.

Author Contributions

E.F.S. and C.A.d.B.P. developed the whole theoretical part of the research. E.F.S., A.K.S. and L.A.M. developed the simulation studies and real data application.

Funding

This research was funded by Conselho Nacional de Desenvolvimento Científico e Tecnológico, CNPq, grant number 308776/2014-3.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bouveyron, C.; Brunet, C. Model-based clustering of high-dimensional data: A review. Comput. Stat. Data Anal. 2014, 71, 52–78. [Google Scholar] [CrossRef]
  2. Oh, M.S.; Raftery, A.E. Model-based clustering with dissimilarities: A bayesian approach. J. Comput. Graph. Stat. 2007, 16, 559–585. [Google Scholar] [CrossRef]
  3. Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via EM algorithm. J. Roy. Statist. Soc. B 1977, 39, 1–38. [Google Scholar] [CrossRef]
  4. Akaike, H.A. New look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  5. Bozdogan, H. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrica 1987, 52, 345–370. [Google Scholar] [CrossRef]
  6. Schwarz, G.E. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  7. Saraiva, E.F.; Suzuki, A.K.; Milan, L.A. A Bayesian sparse finite mixture model for clustering data from a heterogeneous population. Braz. J. Probab. Stat. 2019. [Google Scholar]
  8. Richardson, S.; Green, P.J. On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B 1997, 59, 731–792, errata, J. R. Stat. Soc. Ser. B 1998, 60, U3. [Google Scholar] [CrossRef]
  9. Jain, S.; Neal, R. A split–merge markov chain monte carlo procedure for the Dirichlet process mixture models. J. Comput. Graph. Stat. 2004, 13, 158–182. [Google Scholar] [CrossRef]
  10. Jain, S.; Neal, R. Splitting and merging components of a nonconjugated Dirichlet process mixture model. Bayesian Anal. 2007, 3, 445–472. [Google Scholar] [CrossRef]
  11. Saraiva, E.F.; Pereira, C.A.B.; Suzuki, A.K. A data-driven selection of the number of clusters in the Dirichlet allocation model via Bayesian mixture modelling. J. Stat. Comput. Simul. 2019, 89, 2848–2870. [Google Scholar] [CrossRef]
  12. Diebolt, J.; Robert, C. Estimation of finite mixture distributions by Bayesian sampling. J. R. Stat. Soc. B 1994, 56, 363–375. [Google Scholar] [CrossRef]
  13. Escobar, M.D.; West, M. Bayesian Density Estimation and Inference Using Mixtures. J. Am. Stat. Assoc. 1995, 90, 577–588. [Google Scholar] [CrossRef]
  14. Dellapotas, P.; Papgeorgiou, I. Multivariate mixtures of normals with unknown number of components. Stat. Comput. 2006, 16, 57–68. [Google Scholar] [CrossRef]
  15. McLachlan, G.; Peel, D. Finite Mixture Models; Wiley Interscience: New York, NY, USA, 2000. [Google Scholar]
  16. Finch, S.; Mendell, N.; Thode, H. Probabilistic measures of adequacy of a numerical search for a global maximum. J. Am. Stat. Assoc. 1989, 84, 1020–1023. [Google Scholar] [CrossRef]
  17. Karlis, D.; Xekalaki, W.D. Choosing initial values for the EM algorithm for Finite mixtures. Comput. Stat. Data Anal. 2003, 41, 577–590. [Google Scholar] [CrossRef]
  18. Biernacki, C.; Celeux, G.; Govaert, G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput. Stat. Data Anal. 2003, 41, 561–575. [Google Scholar] [CrossRef]
  19. Ferguson, S.T. A bayesian analysis of some nonparametric problems. Ann. Stat. 1973, 2, 209–230. [Google Scholar] [CrossRef]
  20. Blackwell, D.; MacQueen, J.B. Ferguson distributions via Polya urn scheme. Ann. Stat. 1973, 1, 353–355. [Google Scholar] [CrossRef]
  21. Antoniak, C.E. Mixture of processes dirichlet with applications to bayesian nonparametric problems. Ann. Stat. 1974, 2, 1142–1174. [Google Scholar] [CrossRef]
  22. Stephens, M. Dealing with label switching in mixture models. J. R. Stat. Soc. B 2000, 62, 795–809. [Google Scholar] [CrossRef]
  23. Jasra, A.; Holmes, C.C.; Stephens, D.A. Markov Chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Stat. Sci. 2005, 20, 50–67. [Google Scholar] [CrossRef]
  24. Saraiva, E.F.; Suzuki, A.K.; Louzada, F.; Milan, L.A. Partitioning gene expression data by data-driven Markov chain Monte Carlo. J. Appl. Stat. 2016, 43, 1155–1173. [Google Scholar] [CrossRef]
  25. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
  26. Sinharay, S. Assessing Convergence of the Markov Chain Monte Carlo Algorithms: A Review. ETS Res. Rep. Ser. 2003, 2003, i-52. Available online: http://www.ets.org/Media/Research/pdf/RR-03-07-Sinharay.pdf (accessed on 19 August 2019). [Google Scholar] [CrossRef]
  27. Roeder, K.; Wasserman, L. Practical Bayesian Density Estimation Using Mixture of Normals. J. Am. Stat. Assoc. 1997, 92, 894–902. [Google Scholar] [CrossRef]
Figure 1. Performance of the ISEM algorithm across iterations.
Figure 1. Performance of the ISEM algorithm across iterations.
Entropy 21 01063 g001
Figure 2. Performance of the RJ algorithm across iterations.
Figure 2. Performance of the RJ algorithm across iterations.
Entropy 21 01063 g002
Figure 3. Performance of the ISEM and RJ algorithms for the Galaxy data.
Figure 3. Performance of the ISEM and RJ algorithms for the Galaxy data.
Entropy 21 01063 g003
Figure 4. Performance of the RJ algorithm across iterations for the Acidity data.
Figure 4. Performance of the RJ algorithm across iterations for the Acidity data.
Entropy 21 01063 g004
Table 1. Main mathematical notation used throughout the paper.
Table 1. Main mathematical notation used throughout the paper.
NotationDescription
kNumber of components
k c Number of clusters
θ j Parameter of the j-th component, for j = 1 , , k
θ k = ( θ 1 , , θ k ) The whole vector of parameters
w j Weight of the j-th component, for j = 1 , , k
Y i The i-th sampled value, for i = 1 , , n
c i The i-th indicator variable, for i = 1 , , n
y = ( y 1 , , y n ) The vector of independent observations
c = ( c 1 , , c n ) The vector of latent indicator variables
k c i Number of clusters excluding the i-th observation
n j , i Number of observations assigned to the j-th component, excluding the i-th observation
Table 2. Number of clusters and parameter values used for simulating the datasets.
Table 2. Number of clusters and parameter values used for simulating the datasets.
Artificial DatasetNumber of ClustersParameter Values
A 1 k c = 2 μ 1 = 0 , μ 2 = 3 ,
σ 1 2 = 1 , σ 2 2 = 1 ,
w 1 = 0.80 , w 2 = 0.20 ,
A 2 k c = 3 μ 1 = 6 , μ 2 = 0 , μ 3 = 4
σ 1 2 = 3 , σ 2 2 = 2 , σ 3 2 = 1
w 1 = 0.50 , w 2 = 0.30 , w 3 = 0.20
A 3 k c = 4 μ 1 = 6 , μ 2 = 0 , μ 3 = 7 , μ 4 = 14
σ 1 2 = 1 , σ 2 2 = 2 , σ 3 2 = 2 , σ 4 2 = 1
w 1 = 0.10 , w 2 = 0.40 , w 3 = 0.40 , w 4 = 0.10
A 4 k c = 5 μ 1 = 13 , μ 2 = 7 , μ 3 = 0 , μ 4 = 6 , μ 5 = 11
σ 1 = 1 , σ 2 = 2 , σ 3 = 3 , σ 4 = 2 , σ 5 = 1
w 1 = 0.15 , w 2 = 0.20 , w 3 = 0.30 , w 4 = 0.20 , w 5 = 0.15 ,
Table 3. Proportion of times the algorithms chose the k c values as the number of clusters.
Table 3. Proportion of times the algorithms chose the k c values as the number of clusters.
Data Set k c true k c P ˜ ( k c = j · ) AIC BIC Data Set k c true k c P ˜ ( k c = j · ) AIC BIC
ISEMRJISEMRJ
A 1 210.0140.0020.0500.210 A 2 310.0000.0000.0000.004
20.9760.9720.2940.44820.2760.0940.1040.438
30.0100.0260.2380.22430.7200.6720.3040.384
40.0000.0000.1520.08240.0040.2320.2620.138
50.0000.0000.1480.02850.0000.0020.1840.028
60.0000.0000.1180.00860.0000.0000.1460.008
A 3 410.0000.0000.0000.000 A 4 510.0000.0000.0000.000
20.0000.0040.0000.00020.0060.0000.0000.006
30.0000.0000.0100.06630.0060.0000.0000.018
40.9560.4760.2260.45040.2180.0100.0380.210
50.0440.4740.2520.29650.6820.5090.3220.446
60.0000.0440.2140.12260.0280.4420.2460.222
70.0000.0000.1840.05670.0000.0390.2100.072
80.0000.0020.1140.01080.0000.0000.1840.026
Table 4. Estimated probability for k c .
Table 4. Estimated probability for k c .
Data Set k c true k c P ˜ ( k c = j · ) AIC BIC Data Set k c true k c P ˜ ( k c = j · ) AIC BIC
ISEMRJISEMRJ
A 1 210.00000.0000786.7166793.3133 A 2 310.00000.00041160.7581167.355
20.90060.5252762.5204779.012020.01220.01361129.9811146.472
30.09620.2862764.1440790.530530.86940.38361114.0241140.411
40.00320.1138769.2648805.546340.11240.31401118.7891155.070
50.00000.0466768.0492814.225650.00580.17161120.1081166.284
60.00000.0160775.1082831.179660.00020.07441130.5581186.630
≥70.00000.0122--≥70.00000.0424--
A 3 410.00000.00001273.8861280.482 A 4 510.00000.00021416.1241422.721
20.00000.00001276.2811292.77320.00000.00041388.7381405.230
30.00000.00021251.3571277.74330.00000.00281358.4741384.861
40.84120.16961188.4701224.75140.00140.01141357.0371393.318
50.15000.30141186.0751232.25250.83400.27881355.9221402.098
60.00880.24001191.7471247.81860.15200.30041325.9271381.998
70.00000.16321197.0281262.99570.01240.22241331.9401397.907
80.00000.08161200.3371276.19980.00020.01861331.3521407.213
≥90.00000.0440--≥90.00000.0750--
Table 5. Times of the iterations, in seconds.
Table 5. Times of the iterations, in seconds.
Artificial DatasetAlgorithmSummary
Min1o Q.Med.Mean3o Q.Max.s.d.
A 1 ISEM0.00640.00820.00910.01090.01050.49870.0107
RJ0.00320.01370.01580.02080.02020.38550.0174
A 2 ISEM0.00550.01000.01140.01370.01460.38060.0108
RJ0.00320.01690.01960.02490.02430.77090.0181
A 3 ISEM0.00590.01120.01230.01420.01390.49510.0100
RJ0.00200.02180.02550.03300.03200.47850.0239
A 4 ISEM0.00590.01300.01460.01790.01870.51490.0108
RJ0.00260.02320.02660.03390.03230.54900.0231
Table 6. Estimated probabilities for k c , real datasets.
Table 6. Estimated probabilities for k c , real datasets.
Data Set k c P ˜ ( k c = j · ) AIC BIC Data Set k c P ˜ ( k c = j · ) AIC BIC
ISEMRJISEMRJ
Galaxy10.00000.0000484.6819489.4954Acidity10.00000.0000455.5740461.6608
20.00000.0008451.0018463.035420.71940.0502380.3449395.5620
30.70240.1200426.7421445.0995930.26380.3164382.7395407.0869
40.27480.2530427.4915453.965440.01520.3040382.3660415.8437
50.02220.2592410.3666444.060750.00160.1724391.7630434.3709
60.00060.1848413.7755454.689760.00000.0832386.1420437.8802
70.00000.1084422.1793470.313770.00000.0452388.1296448.9981
80.00000.0472423.5542478.908880.00000.0186395.3957465.3945
≥90.000000226--≥90.00000.0010--
Table 7. Iteration times in seconds.
Table 7. Iteration times in seconds.
Artificial DatasetAlgorithmSummary
Min1o Q.Med.Mean3o Q.Max.s.d.
GalaxyISEM0.00230.00380.00450.00530.00540.24680.0062
RJ0.00000.00000.00000.00000.00000.00000.0000
AcidityISEM0.00550.01000.01140.01370.01460.38060.0108
RJ0.00460.01280.01490.01880.01800.45880.0160
Back to TopTop