Mixture Complexity and Its Application to Gradual Clustering Change Detection

We consider measuring the number of clusters (cluster size) in the finite mixture models for interpreting their structures. Many existing information criteria have been applied for this issue by regarding it as the same as the number of mixture components (mixture size); however, this may not be valid in the presence of overlaps or weight biases. In this study, we argue that the cluster size should be measured as a continuous value and propose a new criterion called mixture complexity (MC) to formulate it. It is formally defined from the viewpoint of information theory and can be seen as a natural extension of the cluster size considering overlap and weight bias. Subsequently, we apply MC to the issue of gradual clustering change detection. Conventionally, clustering changes have been regarded as abrupt, induced by the changes in the mixture size or cluster size. Meanwhile, we consider the clustering changes to be gradual in terms of MC; it has the benefits of finding the changes earlier and discerning the significant and insignificant changes. We further demonstrate that the MC can be decomposed according to the hierarchical structures of the mixture models; it helps us to analyze the detail of substructures.


Motivation
Finite mixture models are widely used for model-based clustering (for overviews and references see McLachlan and Peel [1] and Fraley and Raftery [2]). In this field, determining the number of components is a typical issue. It refers to the following two aspects: the number of elements used to represent the density distribution and the number of clusters used to group the data (referred to as mixture size and cluster size, respectively). In this study, we consider the problem of interpreting the cluster size when the mixture size is given. Many existing information criteria have been applied for this issue by regarding it as the same as mixture size; however, it may not be valid when the components have overlaps or weight biases. Therefore, we need to reconsider the definitions and meanings of the cluster size.
For instance, let us observe three cases of the Gaussian mixture model, as shown in Figure 1. Although the mixture size is two in any case, the situations are different. In case (a), the two components are distinct from each other and their weights are not biased; therefore, it is sound to believe that the cluster size is two as well. Meanwhile, in case (b), although their weights are not biased, the two components are very close to each other; then, as proposed in the work of Hennig [3], we may need to regard them as one cluster by merging them. In case (c), although the two components are distinct from each other, their weights are biased; as proposed in Jiang et al. [4] and He et al. [5], we may need to regard the small components as outliers rather than a cluster. Overall, in cases (b) and (c), it may be more difficult to say that the cluster size is exactly two than in case (a). This observation gives rise to the problem of formally defining the complexity of clustering structures that reflects the overlaps and weight biases. This paper introduces a novel concept of mixture complexity (MC) to resolve this problem. It is related to the logarithm of the cluster size. For example, the exponentials of the MC are 2.00, 1.39, and 1.21 for cases (a), (b), and (c), respectively. In other words, given the mixture size, MC estimates the cluster size continuously rather than discretely. There are two reasons for the need of MC. First, it theoretically evaluates the cluster size in the finite mixture model considering the overlap and imbalance between the components. Although their impacts on the cluster size have been discussed independently, we present a unified framework to interpret the cluster size with a continuous index. It presents a new perspective on model-based clustering and can be practically applied to cluster merging or clustering-based outlier detection. The second is the application of MC to the issue of gradual clustering change detection. Conventionally, clustering changes have been considered to be abrupt, induced by changes in the mixture size or cluster size. In reality, however, there are cases where mechanisms for generating data change gradually (or incrementally in the context of concept drifts [6]). We thereby present a new methodology for tracking such changes by observing MC's changes.
We further show that MC can be used to quantify the cluster size in hierarchical mixture models. We demonstrate that the MC of a hierarchical mixture model can be decomposed into the sum of MCs for local mixture models. It enables us to evaluate the complexity of the substructures as well as the entire structure.
The concept of MC has been applied to the clustering merging problem in [7]. This study further investigates the theoretical properties of MC and proposes a new application for the issue of gradual clustering change detection.

Significance and Novelty
The significance and novelty of this paper are summarized below.

Mixture Complexity for Finite Mixture Models
We introduce a novel concept of MC to continuously measure the cluster size in a mixture model. It is formally defined from the viewpoint of information theory and can be interpreted as a natural extension of the cluster size considering the overlaps and weight biases among the components. We further demonstrate that MC can be decomposed into a sum of MCs according to the mixture hierarchies; it helps us in analyzing MC in a decomposed manner.

Applications of MC to Gradual Clustering Change Detection
We apply MC to the issue of monitoring gradual changes in clustering structures. We propose methods to monitor changes in MC instead of the mixture size or cluster size. Because MC takes a real value, it is more suitable for observing gradual changes. We empirically demonstrate that MC elucidates the clustering structures and their changes more effectively than the mixture size or cluster size.
The remainder of this paper is organized as follows. Section 2 discusses related work. In Section 3, we introduce the concept of MC and present some examples. Theoretical properties of MC are shown in Section 4. Section 5 discusses the application of MC to clustering change detection problems and Section 6 describes the experimental results. Finally, Section 7 concludes this paper. Proofs of the propositions and theorems are described in Appendices. Programs for the experiments are available at https: //github.com/ShunkiKyoya/MixtureComplexity, accessed on 17 August 2022.

Related Work
The issue of determining the best mixture size or cluster size (often referred to as model selection) has extensively been studied. For example, AIC [8], BIC [9], and MDL [10] have been used to select the mixture size; ICL [11] and MDL-based clustering criteria [12,13] have been invented to select the cluster size. These methods have conventionally considered the cluster size as the same as the mixture size by regarding one mixture component as one independent cluster. See also a recent review by McLachlan and Rathnayake [14] focusing on the number of components in a Gaussian mixture model.
Differences between the mixture size and cluster size have also been widely discussed. For example, McLachlan and Peel [1] pointed out that there were cases that Gaussian mixture models with more than one mixture sizes were needed to describe one skewed cluster; Biernacki et al. [11] argued that in many situations, the mixture size estimated by BIC was too large to regard it as the cluster size. The problem of estimating the cluster size under a given mixture size has also been investigated by Hennig [3]; he proposed methods to identify the cluster structure by merging heavily overlapped mixture components. MC differs from his approach in that it interprets the clustering structure by only measuring the overlap rate rather than deciding whether to merge based on a certain threshold.
The degree of overlap or closeness between components was evaluated using various measures, such as the classification error rate or the Bhattacharyya distance [15]. Wang and Sun [16] and Sun and Wang [17] formulated the overlap rate of Gaussian distributions from the geometric nature of them. All of the works above have been limited to the case of two components. On the other hand, MC considers the overlap between any number of components.
Deciding whether a small component is a cluster or a set of outliers is also a significant matter. For example, clustering algorithms such as DBSCAN [18] and constrained kmeans [19] avoided generating small components to obtain a better clustering structure. Jiang et al. [4] and He et al. [5] associated the small components with outlier detection problems. MC evaluates the small components by continuously measuring the impacts on the cluster size. Some other notions have been proposed to quantify the clustering structure. Fuzzy clustering [20] is also a method used to estimate the clustering structures with cluster overlap; however, MC is more suitable for consistent estimation in that it assumes the background mixture distributions. Rusch et al. [21] evaluated the crowdedness of the data under the concept of "clusteredness". However, its relations to the cluster size are indirect. Recently, descriptive dimensionality (Ddim) [22] was proposed to define the model dimensionality continuously. It can be implemented to estimate the clustering structure under the assumption of model fusion, that is, models with a different number of components are probabilistically mixed. MC differs from Ddim because it evaluates the overlap and weight bias in the single model without model fusion.
Clustering under the data stream has been discussed with various objectives [23][24][25]. We consider the problem of detecting changes in the cluster structure; Dynamic model selection (DMS) [26][27][28] addressed this problem by observing the changes in the models (corresponding to mixture size or cluster size in this paper). Because the models are valued discretely, the detected changes have been considered to be abrupt. Refer also to the notions of tracking best experts [29], evolution graph [30], and switching distributions [31], which are similar to DMS.
Furthermore, the issues of gradual changes have been discussed to investigate the transition periods for absolute changes. The MDL change statistics [32] and differential MDL change statistics [33] were proposed to measure the degree of gradual changes. The notions of structural entropy [34] and graph entropy [35] were proposed to measure the degree of model uncertainty in the changes. This study quantifies the degree of gradual changes using the fluctuations in MC and presents a new methodology to detect them.
MC is based on the mutual information between the observed and latent variables, which has been considered in the clustering fields. For example, Still et al. [36] regarded clustering as data compression and applied mutual information to measure its degree. In this paper, we present a novel interpretation of mutual information as a continuous number of clusters. Furthermore, we also present its novel applications for interpreting clusterings and clustering change detection.

Mixture Complexity
In this section, we formally introduce the mixture complexity and describe its properties using some examples and theories.

Definitions
Given the data {x n } N n=1 and the finite mixture model f that have generated them, we consider interpreting the cluster size of f . The distribution f is written as where K denotes the mixture size, {ρ k } K k=1 denote the proportions of each component summing up to one, and {g k } K k=1 denotes the probability distributions. The random variable X following the distribution f is called an observed variable because it can be observed as a datum. We also define the latent variable Z ∈ {1, . . . , K} as the index of the component from which the observed variable X originated. The pair (X, Z) is called a complete variable. The distribution of the latent variable P(Z) and the conditional distribution of the observed variable P(X|Z) can be given by To investigate the clustering structures in f , we consider the following quantity: where H(Z) and H(Z|X) denote the entropy and conditional entropy, respectively, of the latent variable Z defined as where γ k (X) := P(Z = k|X).
The quantity I(Z; X) is well-known as the mutual information between the observed and latent variables; it is also known as the (generalized) Jensen-Shannon Divergence [37]. We can interpret I(Z; X) as the volume of cluster structures as follows. Because I(Z; X) is a subtraction of the latent variable's entropy with and without the knowledge of the observed variable, it represents the amount of information about the latent variable possessed by the observed data. Thus, its exponent exp(I(X; Z)) denotes the number of the latent variables distinguished by the observed variable; it can be interpreted as a continuous extension of the cluster size. For more information about entropy and mutual information, see the book written by Cover and Thomas [38]. However, I(Z; X) cannot be calculated analytically even if f is known. Thus, noting that ρ k = E X [γ k (X)], we approximate I(Z; X) using the data {x n } N n=1 as follows: We call this the MC of the mixture model f .

Definition 1.
Given the posterior probabilities {γ k (x n )} k,n , we define the mixture complexity (MC) as If the data have weights {w n } n , we define the MC as The weighted version of MC is defined for later use. Note that there are other ways to approximate I(Z; X); we adopt the form of Definition 1 because it has the decomposition property shown in Section 4.2. See also the methods used to approximate the entropy of the mixture model [39,40] that can also be applied to approximate I(Z; X).
In practice, only the data {x n } N n=1 can be obtained without the underlying distribution f . Then, we estimate the posterior probabilities { γ k (x n )} k,n from the data {x n } N n=1 and further estimate the MC as It can be calculated even if the model f cannot be estimated.

Examples
In this subsection, we discuss some examples of MC to understand its notions.

MC with Different Overlaps
First, we set N = 600 and generated the data x 1 , . . . , x 600 ∈ R 2 as follows.
where N (x|µ, Σ) denotes a multivariate normal distribution with mean µ and covariance Σ, I d denotes a d-dimensional identity matrix, and α ∈ R is the parameter that determines the degree of overlap between two components.
By varying the value of α among 0, 0.6, . . . , 6.0, we generated the data and measured the MC by setting ρ 1 , ρ 2 = 1/2 and g 1 , g 2 as the actual distributions. The exponential of the MC for each α is plotted in Figure 2a. It is evident from the figure that the MC smoothly increases from 1.0 to 2.0 as the two components become isolated.

MC with Different Mixture Biases
Next, we set N = 600 and generated the data x 1 , . . . , x 600 ∈ R 2 as follows: where α ∈ {0, . . . , 300} is the parameter that determines the degree of bias between the proportion of two components. By varying α among 0, 30, . . . , 300, we generated the data and measured the MC by setting ρ 1 = (300 + α)/600, ρ 2 = (300 − α)/600 and g 1 , g 2 as the actual distributions. The exponential of the MC for each α is plotted in Figure 2b. It is evident from the figure that the MC smoothly decreases from 2.0 to 1.0 as the balance becomes biased.

Theoretical Properties
In this subsection, we discuss the theoretical properties of MC.

Basic Properties
We discuss the basic properties of MC. The proofs are described in Appendix A. First, we discuss the minimum and maximum of MC. We show that MC takes the minimum when the components entirely overlap and maximum when they are entirely separate. Proposition 1. If the components entirely overlap, i.e., there exists γ 1 , . . . , γ K such that γ k (x n ) = γ k for all k and n, then, MC {γ k (x n )} k,n ; {w n } n = 0.

Proposition 2.
If the components are entirely separate, i.e., for all x n , there is a unique index k n that satisfies In particular, if the components are entirely balanced, i.e., ρ 1 = · · · = ρ K = 1/K, then, Moreover, MC takes 0 only if the components are entirely overlapping as stated in Proposition 1 and takes log K only if the components are entirely separate as stated in Proposition 2.
Next, we show that the value of MC is invariant with the representation of the mixture distribution. For example, consider the following three mixture distributions: In f 2 and f 3 , we need to manually remove the redundant components and regard the mixture size as two [1]. On the other hand, the following property indicates that the MCs for f 1 , f 2 , and f 3 are the same; thus, we need not to care about their differences in evaluating MC.

Decomposition Property
In this section, we discuss a method to decompose MC along the hierarchies in mixture models; this can help us in analyzing the structures in more detail.
Consider that the mixture distribution f has a two-stage hierarchy, as shown in Figure 3. It has K components {g k } K k=1 on the lower side and L components {h l } L l=1 on the upper side, where {g k } K k=1 denote the probability distributions and {h l } L l=1 denote their mixture distributions, respectively. We construct the hierarchy as follows. First, we estimate the distribution f = ∑ K k=1 ρ k g k . Then, we obtain {h l } L l=1 by partitioning (or clustering) the lower components into L groups. Formally, we denote Q (l) k ∈ R ≥0 as the proportion of the lower component k ∈ {1, . . . , K} that belongs to the upper component l, which satisfies According to the hierarchy, we can decompose the MC.

Theorem 1.
We can decompose the MC as follows: The proof is described in the Appendix B. For notational simplicity, we will use the following terms: Then, we can rewrite Theorem 1 as Contribution(component l), In Theorem 1, the MC of the entire structure (MC(total)) is decomposed into a sum of the MC among the upper components (MC(interaction)) and their respective contri An example of the decomposition is illustrated in Figure 4 and Table 1. In this example, there are K = 4 lower components generated from a Gaussian mixture model; additionally, there are L = 2 upper components on the left and right sides. By decomposing MC(total), we can evaluate the complexities in the local structures as well as those in the entire structure.

Consistency
In this subsection, we discuss the consistency of the MC: as the estimated distribution becomes close to the true distribution, the estimated MC also converges to the true value. Formally, we define the set of K-component mixture models as We assume that the space F K is weakly identifiable, that is, where δ Θ is the Kronecker's delta function on Θ. This condition states that the same distributions should have the same mixtures of parameters. See Teicher [41] and Yakiwitz and Spragins [42] for sufficient conditions on this kind of identifiability; in their work, it has been shown that this is satisfied in Gaussian or gamma mixtures. We also assume some true mixture distribution written as generates the data x N . We consider estimating the true mixture complexity written as MC({γ k (x n )} k,n ) by substituting the estimated distribution f ∈ F K into f . We restrict our analysis to the case that K ≥ K so that F K contains distributions that are equivalent to f . Then, we show that MC({γ k (x n )} k,n ) converges to MC({γ k (x n )} k,n ) as f and f become closer.
To analyze the convergence, we re-parametrize the estimated parameters using the method proposed in Liu and Shao [43]. They note that if f = f , there exist integers 0 = i 0 < · · · < i K ≤ K such that the following holds under some permutation of the components: Then, they parametrize the parameters in f using two kinds of parameters defined as and rewrite f as In this parametrization, f = f is equivalent to the parameter ψ has nothing to do with equivalence. This parametrization represents two types of convergence in mixture models. First, it overlaps the components to the true distributions, which is realized by The other is shrinking the weights of the redundant components to zero, which is realized by We use the following conditions for our proof: is differentiable once and for every k = 1, . . . , K and there exists > 0 such that (C3) Let us define the approximations of mixture proportions as Then, as N → ∞, they satisfy Condition (C1) is a usual differentiability condition, and (C2) and (C3) require consistency of the parameters. It is known that consistent estimations are possible by penalized maximum likelihood estimation [44,45] or Bayesian estimation [46], for example. Then, the consistency of the MC is shown as the following theorem.

Theorem 2.
Under assumptions (C1), (C2), and (C3), the following holds as N → ∞: The proof is described in Appendix C. Theorem 2 shows the convergence rate of the estimation error of the MC. It is interesting that this even holds when K = K . Therefore, it can be said that MC is a fundamental quantity to represent the cluster structures in mixture models by overcoming the differences in mixture size.
We discuss the overview of the proofs below. First, applying Theorem 1 repeatedly, we decompose the entire MC into the following four terms: (a) Interaction between ∑ K l=1 r l h l and ∑ K k=i K +1 ρ l g l .
The procedure of the decomposition is also illustrated in Figure 5. Then, we show that . . , h K tends to g 1 , . . . , g K ; (d) tends to 0 because for all l, all components in h l tends to g l .
The proofs are mainly based on the mean-value theorem. However, differentiation of log f by ρ k (k ∈ I ∞ ) may be infinite; we need additional treatments to avoid it.

Applications
In this section, we propose methods to apply the MC to clustering change detection problems. Formally speaking, given the dataset X := {{x n,t } N n=1 | t ∈ 1, . . . , T}, where t denotes the time and {x n,t } N n=1 denote the data generated at each t, we consider the problem of monitoring the changes in the clustering structures over t = 1, . . . , T.
First, we briefly summarize the method named sequential dynamic model selection (SDMS) [28] that addresses this problem. Then, we introduce our ideas and discuss the differences between SDMS.
Hereafter, we assume that the data points x n,t are d-dimensional vectors and consider a Gaussian mixture model for each t.

Sequential Dynamic Model Selection
SDMS is an algorithm that is used to sequentially estimate models and find changes. In clustering change detection problems, it sequentially estimates the mixture sizesK t and parameters ηK t := {ρ k,t ,μ k,t ,Σ k,t }K t k=1 and finds model changes as changes inK t . The estimation procedures are explained below. First, depending on the estimated mixture size at the last time pointK t−1 , we set the candidate for K t . Then, for each K t in the candidate, we estimate the parameters θ K t from the data {x n,t } N n=1 and calculate a cost function L SDMS ({x n,t } N n=1 ; K t , θ K t ,K t−1 ). Finally, we select K t as the mixture size that minimizes the costs. The candidates of K t are set as where K max is a pre-defined parameter. The cost function denotes the sum of the code length functions of the model and model changes given by Code Length of the Model The score L model ({x n } N n=1 ; K, η K ) denotes a sum of the logarithm of the likelihood functions and penalty terms corresponding to the complexity of the model. In this study, we consider two likelihood functions and four penalty terms. For the (logarithm of) likelihood functions, we consider the observed likelihood L({x n } N n=1 ; θ K ) and complete likelihood L({x n , z n } N n=1 ; θ K ), provided by where {z n } N n=1 are the latent variables for the data estimated by z n := argmax z∈1,...,K P(Z = z|X = x n ).
They correspond to the likelihood of the observed data and complete data, respectively; the former is used to determine the mixture size, and the latter is used to determine the cluster size under the assumption that it is equal to the mixture size. For the penalty terms, we consider AIC [8], BIC [9], NML [13], and DNML [47,48]. By combining the log-likelihood and the penalty terms, we consider the following six scores: • AIC with observed likelihood (AIC): • AIC with complete likelihood (AIC+comp): • BIC with observed likelihood (BIC): • BIC with complete likelihood (BIC+comp): where D := (K − 1) + d(d + 3)/2 denotes the number of the free parameters required to represent a Gaussian mixture model; PC NML (N, K) and PC DNML (N, {z n } N n=1 , K) denote the parametric complexities. In our experiments, we estimated the parameter η K by conducting the EM algorithm [49] implemented in the Scikit-learn package [50] ten times and selected the best parameter that minimized each score. Note that in NML and DNML, we only considered the complete likelihood functions because only the methods to calculate their parametric complexities are known.

Track MC
In SDMS, clustering changes are detected as the changes of the mixture size or cluster size K; because it is discrete, the changes have been considered to be abrupt. Then, we propose to track MC instead of K while estimating the parameters using SDMS. Because MC takes a real value, monitoring it is more suitable for observing gradual changes than monitoring K. The algorithm for tracking MC is explained in Algorithm 1.

Require:
EstimateK t and {ĝ k,t }K t k=1 from the data {x n,t } N n=1 using SDMS. 3:

Track MC with Its Decomposition
In addition to monitoring the MC of the entire structure, we also propose an algorithm to track its decomposition. To accomplish this, we must estimate the upper L components and their corresponding partitions Q (l) k,t for each t.
Here, we assume that the upper L components are common at every t and estimate the partition Q (l) k,t after estimating the lower components at each time. Specifically, we consider µ k,t as a point with weights ρ k,t for each k and t and cluster them. As the clustering algorithm, we modified the fuzzy c-means [20] to handle the weighted points. Formally, we estimated the centers of the upper L componentsμ l and their corresponding partitions where m > 0 is parameter that determines the fuzziness of the partition. We estimatedμ l and Q (l) k,t by minimizing one iteratively while fixing another. We can formulate the iteration as follows: Finally, we present an algorithm to track the MC and its decomposition in Algorithm 2. We can analyze the structural changes in more detail by evaluating the decomposed values.

Algorithm 2 Tracking MC with its decomposition
Require: A dataset X = {{x n,t } N n=1 | t ∈ 1, . . . , T}, parameters m and L. EstimateK t and {ĝ k,t }K t k=1 from the data {x n,t } N n=1 using SDMS.

Experimental Results
In this section, we present the experimental results that demonstrate the MC's ability to monitor the clustering changes. We compare our methods to the monitoring of K.

Analysis of Artificial Data
To reveal the behaviors of MC, we conducted experiments with two artificial datasets called move Gaussian dataset and imbalance Gaussian dataset. Their experimental designs are discussed below. First, we generated artificial datasets X = {{x n,t } N n=1 | t ∈ 1, . . . , T} by setting T = 150 and N = 1000. The datasets have one transaction period t = 51, . . . , 100 in which the data change their clustering structures gradually. Then, we estimated the MC and K using the methods in Sections 5.1 and 5.2 by setting K max = 10. To compare them, we first created a simple algorithm to detect the changes from the sequence of MC or K. Then, we compared the abilities of this algorithm in terms of the speed and accuracy of detecting the change points. Moreover, to evaluate the abilities to find the changes in the opposite direction, we performed experiments with the same datasets in the reverse order.
Given a sequence of the MC or K written as y 1 , . . . , y 150 , we constructed an algorithm to detect the change points as follows. For t = 10, . . . , 150, we raised a change alert if |median(y t−9 , . . . , y t−5 ) − median(y t−4 , . . . , y t )| > ε in the case of MC, and median(y t−9 , . . . , y t−5 ) = median(y t−4 , . . . , y t ) in the case of K, where ε is the threshold to raise an alert in MC. It should be to some extent large for avoiding too many false alerts and smaller than 1 to find the changes earlier than with monitoring K. In this section, we set ε as 0.01 so as not to raise alerts from t = 1 to 10 assuming that we know that there are no changes in this period. We calculated the medians instead of the means of the subsequences for robustness. However, to avoid redundant alerts, we neglected them when the difference between t and the latest alert was less than 5 even if the conditions were satisfied.
To evaluate the quality of the algorithm, we calculated Delay and False alarm rate (FAR), defined as Delay := min(t * − 51, 50), where t * denotes the first time point in the transaction period when the algorithm generated an alert, ACCEPT denotes the set of time points when alerts can be defined as {t | ∃t − 9, . . . , t ∈ [51, . . . , 100]} = [51,109], and ALERT denotes the set of time points when the algorithm generates alerts.

Move Gaussian Dataset
The move Gaussian dataset is a set of three-dimensional Gaussian distributions, whose means move gradually in the transaction period. Formally, for each t, we generated the data {x n,t } 1000 n=1 as follows: 6 (101 ≤ t ≤ 150). The first and second dimensions of some data are visualized in Figure 6. In the direction t = 1 → 150, the number of clusters increases from two to three as the two clusters leave; in the direction t = 150 → 1, it decreases from three to two as the two clusters merge. The experiments were performed ten times by randomly generating the datasets; accordingly, the average performance scores were calculated. The differences in the scores between the MC and K for each criterion are presented in Table 2; the estimated MC and K in one trial are proposed in Figure 7. This figure illustrates the result of BIC as an example. Table 2. Difference in the average performance score between MC and K for the move Gaussian dataset. With respect to the speed to find changes, in every criterion, MC performed as well as K in the direction t = 1 → 150; however, it performed significantly better than K in the direction t = 150 → 1. The reason for the differing performances is discussed below. In the direction t = 1 → 150, the model selection algorithms underestimated the number of components at the beginning of the transaction period. In such time points, they ignored the overlapping of the two components and considered them as one cluster. Thus, MC, based on such model selection methods, was unable to find the changes earlier than K. However, in the direction t = 150 → 1, the overlap between the components was correctly estimated at some time points before K changed. In this case, MC changed smoothly according to the overlap and found changes earlier than K.
With respect to the accuracy of finding changes, MC performed as well as K in terms of FAR. Additionally, it is evident from Figure 7 that MC stably estimated the clustering structures.

Imbalance Gaussian Dataset
The imbalance Gaussian dataset is a set of three-dimensional Gaussian mixture distributions whose balances change gradually in the transaction period. Formally, for each t, we generated the data {x n,t } 1000 n=1 as follows: (101 ≤ t ≤ 150).
The first and second dimensions of some data are visualized in Figure 8. In the direction t = 1 → 150, the number of clusters decreases from four to three as the edge cluster disappears. In the direction t = 150 → 1, it increases from three to four as the edge cluster emerges. The experiments were performed ten times by randomly generating datasets; accordingly, the average performance scores were calculated. The difference in the scores between the MC and K for each criterion are listed in Table 3. The estimated MC and K in one trial are plotted in Figure 9. This figure illustrates the result of BIC as an example. Table 3. Differences in the average performance score between MC and K for the imbalance Gaussian dataset. In terms of the speed to find changes, in every model selection method, MC performed significantly better than K in the direction t = 1 → 150; however, MC performed as well as K in the direction t = 150 → 1. The reason for the differing performances is discussed below. In the transaction period, all model selection methods counted the minor components as independent clusters. Then, in the direction t = 1 → 150, MC changed smoothly according to the imbalance and determined the changes earlier than K. In the direction t = 150 → 1, K increased significantly early in the transaction period. Then, MC increased along with K and determined the changes simultaneously.
In terms of the accuracy of finding changes, MC performed as well as K in terms of FAR. Additionally, it is evident from Figure 9 that MC stably estimated the clustering structures. N points from f . Then, we recorded the time to calculate {γ k,n } from f and calculated the MC from {γ k,n }. We repeatedly measured the computation times by increasing N and d. For each N and d, we measured them ten times and took their averages.
The increase in the computation times is illustrated in Figure 10. In (a), although both computation times increased linearly as N grew, calculating MC was faster than calculating {γ k,n }. In (b), although the time to calculate {γ k,n } increased as d grew, and the computation time for MC was almost constant because K and N were not changed. Overall, the cost of computing MC is much smaller than that of computing or estimating {γ k,n }.

Analysis of Real Data
We analyzed two types of real data named the beer dataset and house dataset, which are summarized in Table 4. In the following subsections, we discussed the detail of the datasets and results of the experiments.

Beer Dataset
We discuss the results of the beer dataset, obtained from Hakuhodo, Inc. and M-CUBE, Inc. This has also been analyzed in [28,34]. The dataset comprises the records of customer's beer purchases from November 1st, 2010 to January 31th, 2011. The dataset X is constructed as follows. The time unit is a day. For each day t ∈ {τ, . . . , T}, x n,t ∈ R d denotes the n-th customer's consumption of the beer from time t − τ + 1 to t, where we set τ = 14. The dimension d of the vector is 16, which correspond to the consumptions of the following drink: First, we compare the plots of the estimated MC and K in Figure 11. The results of BIC and NML are illustrated as an example. Note that we omit the results of AIC because it chose K max for K t at many t. In any method, the score was high at the end and beginning of the year, reflecting the increased activities in transactions. However, because the critical changes in the clustering structure and changes due to ineffective components were mixed, the sequence of K had a lot of change points; as a result, it was difficult to interpret their meanings. On the other hand, MC identified the clustering structure by discounting the effects of the ineffective components. As a result, the sequence of MC highlighted the significant changes at the end and beginning of the year. It is also worthwhile noting that the differences of the scores between the model selection methods were much smaller in MC than those in K; this indicates that both BIC and NML estimates similar clustering structures under the concept of MC even though the number of components differs significantly.  Tables 5 and 6, respectively, and the plots of each decomposed value are illustrated in Figures 12 and 13, respectively. The indices of the upper components are manually rearranged so that they correspond with each other; then, it can be observed that the results were also similar to each other. The structures can be extensively evaluated by analyzing the decomposed values. For instance, let us analyze the decomposed values at the end and beginning of the year. As evident from the tables, they had different characteristics. It can be observed from the figures that the contributions increased in all components, indicating that they were related to the increase in MC(total). The weight of the component decreased in cluster 1 and increased in component 2 and 3, indicating that the customers moved from component 1 to component 2 and 3. Additionally, MC(component l) increased in all components, indicating that the complexity or diversity increased within them. (n) MC(component 4) Figure 13. Plots of the decomposition of MC with NML in the beer Dataset.

House Dataset
We discuss the results of the house dataset, obtained from the UCI Machine Learning Repository [51]. The dataset comprises the records of electricity consumption in a house every five minutes from 16 December 2006 to 26t November 2010. The dataset X is constructed as follows. The time unit is 15 min from 00:00-00:15 to 23:45-24:00. For each t, the data {x n,t } N n=1 denotes the set of the records on the various days included in the t-th time unit. The dimension d of the vector is 3, which corresponds to the metering of the following three points: First, we compare the plots of the estimated K and the corresponding MC in Figure 14. The results of BIC and NML are illustrated as an example. Note that we omit the results of AIC because it chose K max for K t at many t. It can be observed from the figure that the MC smoothly connected the discrete changes in K; therefore, MC expressed gradual changes in the dataset more effectively than K. Additionally, the MCs in BIC and NML were more similar to each other than K as well as in the beer dataset. The values of MC started increasing from around 7:00 a.m.; after slight fluctuations, the value reached its peak around 21:00. Therefore, MC seemed to represent the amount of activities in this house.  Tables 7 and 8, respectively, and the plots of each decomposed value are illustrated in Figures 15 and 16, respectively. The indices of the upper components are manually rearranged so that they correspond with each other; then, it can be observed that the results were also similar to each other. The structures can be extensively evaluated by analyzing the decomposed values. For instance, let us analyze the decomposed values in component 3. It can be observed from the tables that the value in metering(C) was specifically high in this component. Looking at the contribution (component 3), there were two peaks around 9:00 and 21:00; it represented the increased activities in this component. However, the proportions of the weight and MC were different. W(component 3) was specifically high at 9:00, indicating that the first half of the peaks was due to the increase in the weight of the component; whereas, MC(component 3) was specifically high at 21:00, indicating that the second half of the peaks was due to the increase in the complexity within the component.

Conclusions
We proposed the concept of MC to measure the cluster size continuously in the mixture model. We first pointed out that the cluster size might not be equal to the mixture size when the mixture model had overlap or weight bias; then, we introduced MC as an extended concept of the cluster size considering the effects of them. We also presented methods to decompose the MC according to the mixture hierarchies, which helped us in extensively analyzing the substructures. Subsequently, we implemented the MC and its decomposition to the gradual clustering change detection problems. We conducted experiments to verify that the MC effectively elucidates the clustering changes. In the artificial data experiments, MC found the clustering changes significantly earlier in the case where the overlap or weight bias was correctly estimated. In the real data experiments, MC expressed the gradual changes better than K because it discerned the significant and insignificant changes and smoothly connected the discrete changes in K. We also found that the MC took similar values for each model selection method; it indicates that the estimated clustering structures are alike under the concept of MC. Moreover, its decomposition enabled us to evaluate the contents of changes.
Issues of the MC will be tackled in future study. For example, it does not capture the clustering structure well when the number of the components is underestimated; thus, we need to explore the model selection methods that are more compatible with MC. Also, we further need to study its theoretical aspects, such as convergence and methods for approximating the mutual information. Furthremore, we need to consider extending the concept of MC into other clustering approaches, e.g., considering co-clustering by relating non-diagonal blocks in co-clustering and cluster overlaps in finite mixture models.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of the Basic Properties
We present a proof of Propositions 1-4. We can directly calculate as follows: The equality of (a) holds only if the components are entirely separate and the equality of (b) holds only if the components are balanced. Thus, MC equals log K only if the components are entirely separate and balanced.
Also, by applying the Jensen's inequality to x → −x log x, we obtain that which is equivalent to that MC({γ k (x n )} k,n , {w n } n ) ≥ 0. The equality holds only if the components entirely overlap.
Appendix A.4. Proof of Proposition 4 By applying Theorem 1 to partition I 1 ∪ · · · ∪ I L into each set, we can calculate that

Appendix B. Proof of the Decomposition Property
We present a proof of Theorem 1. Let Then, we can calculate as Step 1 First, we show that Using Proposition 3, it is easily shown as follows: Step 2 Second, we show that It is also evident from Proposition 3: Step 3 Third, we show that To this end, we further decompose the left hand as ; they correspond to the unconditional and conditional entropy of the latent variables, respectively. On the other hand, the true MC is defined as Then, it is sufficient to show that First, we show that Indeed, by the mean-value theorem, there exist r m 1 , . . . , r m K between r 1 , . . . , r K and ρ 1 , . . . , ρ K and ρ m ∞ between 0 and ρ ∞ such that (1 + log r m l )( r l − ρ l ) (l = 1, . . . , K ), Also, from assumption (C3), if N is sufficiently large, log r m l and log(1 − ρ m ∞ ) are finite because r m l and ρ m ∞ become arbitrarily close to ρ l (> 0) and 0, respectively. Similarly, there exist ρ m 1 , . . . , ρ m K between ρ 1 , . . . , ρ K and ρ 1 , . . . , ρ K such that Also, from the central limit theorem, ρ l converges to ρ k at the speed of O P (1/ √ N). Using (A2)-(A4), we can calculate as Next, we show that We first define the following functions for l = 1, . . . , K : F l (φ, ψ, x) := r l h l (x) ∑ K l =1 r l h l (x) log r l h l (x) ∑ K l =1 r l h l (x) .
They are all finite because r m l become arbitrarily close to ρ l and Θ become arbitrarily smaller as N → ∞; condition (C1) is also employed.
To this end, we write the left hand as G({θ k } k∈I l ) and consider it as a function of {θ k } k∈I l . Then, for all other parameters, G({θ k } k∈I l ) = 0. Also, the derivative of G by θ l is O P (1) as N → ∞. Indeed, it can be rewritten as Also, we define the posterior probabilities within h l as γ (l) k (x) := s k g k (x) h l (x) (k ∈ I l ).
Then, the derivatives are bounded as where, it is assumed that {θ k } k∈I l are sufficiently close to θ k , which holds if N is sufficiently large because of condition (C2). Therefore, by the mean-value theorem, there exist {θ m k } k∈I l such that which concludes the proof.