A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

Clustering is a major unsupervised learning algorithm and is widely applied in data mining and statistical data analyses. Typical examples include k-means, fuzzy c-means, and Gaussian mixture models, which are categorized into hard, soft, and model-based clusterings, respectively. We propose a new clustering, called Pareto clustering, based on the Kolmogorov–Nagumo average, which is defined by a survival function of the Pareto distribution. The proposed algorithm incorporates all the aforementioned clusterings plus maximum-entropy clustering. We introduce a probabilistic framework for the proposed method, in which the underlying distribution to give consistency is discussed. We build the minorize-maximization algorithm to estimate the parameters in Pareto clustering. We compare the performance with existing methods in simulation studies and in benchmark dataset analyses to demonstrate its highly practical utilities.


Introduction
In data analysis or data mining, there are two fundamental types of methodologies: clustering and classification [1]. Clustering, which is categorized as an exploratory paradigm, detects the underlying structure behind the data and grasps the rough image before proceeding to more intensive and comprehensive data analysis [2,3]. On the other hand, classification predicts unknown class labels of test data based on models constructed by training data with known class labels. The former is called supervised learning, while the latter is called unsupervised learning in pattern recognition [4].
Clustering algorithms fall roughly into three categories: hierarchical, partitioning, and mixture model-based algorithms [5]. In hierarchical clustering, each observation is considered as one cluster in the initial setting. Then clusters are merged recursively based on a similarity matrix defined beforehand. The resultant clusters are expressed as a dendrogram. The partition algorithm starts with a fixed number of clusters and searches for the cluster centers to minimize an objective function such as the squared distances between the centers and observations. It finds the centers simultaneously. The model-based algorithm assumes a mixture of probability distributions, which generates the observations and assigns the distributions to one of the mixture components. A Gaussian-mixture distribution-based approach is widely used in this context.
In this paper, we propose a new clustering, called Pareto clustering in the framework of quasilinear modeling [6][7][8]. It combines the cluster components by the Kolmogorov-Nagumo average [9] in a flexible way. We consider a generalized energy function as an objective function to estimate cluster parameters, which is an extension of the energy function proposed by [10]. The objective function consists of a survival function of the Pareto distribution, which is widely used in extreme value theory [11]. We investigate the consistency of the parameters, resulting in the underlying probability distribution of the generalized energy function. We find that k-means [12,13] and fuzzy c-means [14] have the underlying probability distributions with singular points at the cluster centers. This fact shows a clear difference from the model-based clustering such as a Gaussian-mixture modeling. Moreover, we show that the quasilinear modeling based on the Kolmogorov-Nagumo average connects k-means, fuzzy c-means, and a Gaussian-mixture modeling using the hyperparameters of the generalized energy function. See [15,16] for the discussion of the relation between k-means and fuzzy c-means.
The paper is organized as follows. In Section 2, we introduce the generalized energy function as the objective function of the Pareto clustering and discuss the consistency of the parameters. Moreover, we show that k-means, fuzzy c-means, and a Gaussianmixture are all derived from the generalized energy function as special cases. This fact leads to the fact that the parameters can be estimated in a unified manner by the the minorize-maximization (MM) algorithm [17], where the monotone decrease of generalized energy function is guaranteed. In Section 3, we demonstrate the performance of the Pareto clustering based on simulation studies and benchmark datasets and show its practical utilities. We summarize the results of the Pareto clustering and discuss the extensions and applications in various scientific fields.

Generalized Energy Function
Let T be a non-negative random variable with a probability density function f (t). The survival function of T is defined as Then for d-dimensional random variables x 1 , . . . , x n , we define a generalized energy function to be minimized with respect to a parameter µ for clustering where µ = (µ 1 , . . . , µ K ) is a set of centers and τ > 0 is the shape parameter. If we take S(t) = exp(t), the function corresponds to the energy function proposed by [10], where τ can be interpreted as the temperature in physics. The formulation in (2) is called the Kolmogorov-Nagumo average [9,18] and is widely applied to bioinformatics, ecology, fisheries, etc. [6,8,19]. In Equation (2), we express an average of probabilities that x i belongs to the kth cluster over all K clusters using 1/ Nagumo average of the energy of x i with the probabilistic meanings. In effect, we take summation of the Kolmogorov-Nagumo average over the observations {x 1 , ..., x n }. Remark 1. The generalized energy function (2) has a relation with the Archimedean copula defined by for {u k } K k=1 in (0, 1), cf. [20] for an introductory discussion. In principle, the generalized energy function is a function from a vector of K cluster energy functions to a integrated energy function. The Archimedean copula is that of K marginal cumulative distribution functions to the joint cumulative distribution function. In this way, the generalized energy function expresses an interactive relation for cluster energy functions analogous with the Archimedean copula expressing the correlation among variables.
We consider an estimator of the generalized energy function aŝ If we assume that x i (i = 1, . . . , n) is distributed according to a probability density function p(x, µ * ), the expected generalized energy function is given by Here we define a function for a set of cluster centers as Thus, we note that where . This property is a key idea in the following discussion. (1) is convex in t. We define a function G of (µ * , µ) as
Here Equality in (13) holds if and only if µ = µ * from the convexity for S −1 . The Equality (14) is shown by for any t ≥ 0 as seen in (10). Equality (15) holds due to (7). Theorem 1. If the p(x, µ * ) has a form such as where Z(µ * ) > 0 is a normalizing constant. Then we have Proof. Note that which concludes (18) from Lemma 1.
We note thatμ is asymptotically consistent to true parameter µ * if the probability density function has the form in (17).

Pareto Distribution
Let us consider a generalized Pareto distribution, where the survival function and its inverse function are defined by where β > 0 denotes the shape parameter. Then the generalized energy function is If we consider β → 0, then which is reduced to the energy function proposed by [10]. Hence, we can understand that Rose's clustering (maximum-entropy clustering) is generated by a survival distribution function of an exponential distribution. Then we have The gradient with respect to µ k is given by which exactly leads to the estimation equations of fuzzy c-means if we take β = m − 1 [14]. Furthermore, we have lim τ→∞,β→0 which is the loss function of k-means. The corresponding survival function is lim β→0 (1 + βt) . Note that the loss function is directly derived from (2) as In addition, we have because S(0) = 1.

Fréchet Distribution
Next, we consider Fréchet distribution with the survival function defined as where γ < 0 is the shape parameter. The generalized energy function is given by We find that Hence, this energy function is reduced to that of the K-means algorithm as shown in the Pareto distribution case. The estimating equation is given by where When we assume the unbiasedness for the estimating function in (30), that is the underlying distribution has a density function proportional to where We confirm that as τ goes to 0. Then we consider the limit of τ to ∞, which provides which is equal to (22). This also leads to Fuzzy c-means if we take as γ = 1/(1 − m) [14].

Estimation of Variances and Mixing Proportions in Clusters
In stead of the Euclidean distance Bezdek et al. [14] considered a common variance structure On the other hand, we estimate distinct Σ k for each µ k .
For this purpose, we modify the generalized energy function in (2) to allow for a variances Σ 1 , . . . , Σ K and mixing proportions π 1 , . . . , π K (∑ K k=1 π k = 1 and π k ≥ 0 for k = 1, . . . , K) as where θ = (µ k , Σ k , π k ) K k=1 . We assume that S(t) is convex so that the domain of S −1 (t) can be extended from [0, 1] to [0, ∞) to allow for |Σ k | − 1 2 . The estimator of this modified generalized energy function is given aŝ The expected generalized energy function is given by where p(x, θ * ) is the underlying probability density function. For a cumulative distribution function On the other hand, it always holds that L S (µ) = L F (µ) for the original generalized energy function in (2).
Similarly to (6), we define and we notice that E θ (x)dx is also independent of µ as Lemma 2. Assume that the survival function S(t) in (1) is convex in t. We define a function G of (µ * , µ) as Then for any θ and θ * with equality if and only if θ = θ * .
Proof. It is obvious from Lemma 1 and the fact that E θ (x)dx is independent of µ.
From Lemma 2, we can easily show the following theorem regarding L S (θ).

Theorem 2.
If the p(x, θ * ) has a form such as where Z(θ * ) > 0 is a normalizing constant. Then we have For the Pareto distribution, we have from (37) where From (44), the underlying probability density function is where Z τ,β (θ * ) is a normalizing constant. When β → 0, we have which is the negative log likelihood function of the normal mixture distributions apart from a constant term (2π) −d/2 when τ = 1/2. Similarly, we have the estimation equation of fuzzy c-means allowing for the Mahalanobis distance when τ → ∞. Moreover, we have k-means with the use of the Maha-lanobis distance when τ → ∞ and β → 0. For the other extreme cases, we observe that both lim β→∞ L τ,β (θ) and lim τ→0 L τ,β (θ) diverge or converge to 0 depending on the values of π k and Σ k (k = 1, . . . , K). Hence we choose large values for τ and small values for β in the subsequent data analysis.

Estimating Algorithm
The direct optimization of L τ,β (θ) in (46) is difficult due to the mixture structure. Thus, we employ the idea of expectation and maximization (EM) algorithm [21] and the minorizemaximization (MM) algorithm [17] similar to [19]. Our proposed clustering method (Pareto clustering) is as follows in Algorithm 1.
Repeat the following steps for t = 0, . . . , T − 1 and The initial values (µ k ) are determined by the hierarchical clustering in a similar way to the algorithm by [22]. The derivation of the estimating algorithm is as follows. First, we have where q k (x i ) is a positive weight such as ∑ K k=1 q k (x i ) = 1 and φ(t) is the convex function defined in (49). The equality holds if and only if which is equivalent to Based on q where Similarly, we have which means that Next we consider where λ is a Lagrange multiplier. Then which means (74) (46) is monotonically decreasing in the estimating algorithm. That is, we have

Remark 2. The generalized energy function in
See Appendix B for more details.

Remark 3.
The estimating algorithm of fuzzy c-means by [14] is given as where {u (t) ik } 1/m is called the membership function of x i in cluster k at the iteration step t. These are special cases of (52) and (53) with τ → ∞, Σ k = I, π k = 1/K and β = m − 1. Hence we observe that the original algorithm of fuzzy c-means can be interpreted as the EM algorithm.

Remark 4.
In analogy with the membership function of fuzzy c-means by [14], we define q (t) k (x i ) in (52) as a membership function of x i in cluster k at the iteration step t in Pareto clustering. Hence we estimate cluster C k as where ∪ K k=1 C k = {x 1 , . . . , x n }.

Evaluation of Clustering Methods
We compare the performances of k-means, fuzzy c-means, Gaussian mixture modeling (Gaussian), partitioning around medoids (PAM), and Pareto clustering. To implement these methods, we use the kmeans function in the stat package [24], the cmeans function in the e1071 package [14], the Mclust in the mclust package [22] and the pam function in the cluster package [25] in the statistical software R, where the default settings are used for each function. In Pareto clustering, τ = 0.5 and β = 1 are used as the default settings. We assume that the number of clusters K is known and compare the performances as in [26].

Metrics
Cluster C k (k = 1, . . . , K) estimated by a clustering method is evaluated by a predefined reference class set D ( = 1, . . . , L) such as where Recall(C k , D ) = Precision(D , C k ). Precision(C k , D ) counts data points in cluster C k belonging to class . Hence max Precision(C k , D ) represents the purity of cluster C k regarding the classes. By taking the weighted average, we have where n is the sample size. Recall(C k , D ) counts data points in a class set D estimated to be in cluster C k . Precision and recall correspond to the positive predictive value and sensitivity, respectively [27]. A metric combining precision and recall is proposed by [28] such as which is the harmonic mean of Precision(D k , C ) and Recall(D k , C ), and is called the F-measure [29]. The cluster level similarity between the estimated centerμ = (μ 1 , . . . ,μ K ) and the reference value (ground truth) of center µ * = (µ * 1 , . . . , µ * K ) is the centroid index (CI) proposed by [26] as Here, q indicates the index of the element of the reference center µ * that is the nearest toμ . The function orphan(µ * k ) indicates whether µ * k is an isolated element (orphan) or not, which is not nearest to any elements ofμ. Hence CI (μ, µ * ) indicates the dissimilarity betweenμ and µ * . Due to the asymmetry of CI (μ, µ * ) with respect toμ and µ * , we take the maximum of CI (μ, µ * ) and CI (µ * ,μ). Hence CI(μ, µ * ) measures how many clusters are differently located amongμ and µ * .
Another metric to measure the similarity betweenμ and µ * is defined as the mean squared error (MSE) over the number of clusters K, which is given as Differently from Purity and F-value, MSE can be calculated based on only estimated and reference centersμ and µ * . This property is useful in a situation where the reference class sets D 1 , . . . , D K are difficult to determine but µ * is easily identified. We use MSE in the simulation studies to evaluate the accuracy ofμ and Purity and F-value in the analysis of benchmark datasets.

Simulation Studies
We generate samples according to the density function p τ,β (θ * ) in (50) using the Metropolis-Hastings algorithm [30,31] as Figure 1 illustrates the perspective plots and contour plots for (τ, β) = (0.5, 1), (0.5, 0), (10, 1). The shape of p τ,β (θ * ) varies according to the values of τ and β. The Gaussian mixture distribution corresponds with τ = 0.5 and β = 0 in panel (b). When β = 1, the variance of each component increases and the contours connect with each other. On the other hand, for a large value of τ = 10, the distribution shows high peaks around the centers. This indicates that p τ,β (θ * ), including fuzzy c-means when τ → ∞, has a quite different shape from he Gaussian mixture distribution. Other versions of the shapes are also illustrated in Appendix C. The performance of each method is evaluated by MSE based on 100 simulated samples with sample size n = 3000.

Benchmark Data Analysis
The performance of our proposed method is evaluated using benchmark datasets prepared by [32]. It includes a variety of datasets with low and high cluster overlap, various sample sizes, low and high dimensionalities and unbalanced cluster sizes. Hence, these datasets are suitable for clarifying the statistical performance of the clustering methods. In this setting, we compare the performance of k-means, fuzzy c-means, Gaussian, PAM, and Pareto clustering as well as the variants of Pareto clustering with several values of (τ, β) as explained in Table S1. The characteristics of the benchmark datasets such as the sample sizes, the number of clusters, and dimensionality are summarized in Table S2. Figure 2 illustrates the results of MSE in the simulation studies. Pareto clustering provides the best performance in panel (a), where the samples are generated by the underlying distribution p 0.5,1 (θ * ) of Pareto clustering. The shape of the distribution is similar to Gaussian mixture; however, the variance of each component becomes larger and the contour lines are connected to each other as in panel (a) of Figure 1. On the other hand, in panel (c), the variance of each component becomes smaller and contour lines are completely separated. In the both cases, the performances of the Gaussian mixture are clearly degenerated. In the case of panel (b) in which the data are generated from the Gaussian mixture, the performances are comparable to each other, suggesting that k-means, fuzzy c-means, PAM, and Pareto clustering are robust to the underlying distributions to some extent.

Results
In the benchmark data analysis, metrics of Purity, F-value, and CI are evaluated in Tables S3-S5, where variance Σ k and mixing proportion π k (k = 1, . . . , K) in Pareto clustering are estimated. For the two-dimensional shape datasets such as Flame, Compound, D31, Aggregation, Jain, Pathbased, and Spiral in the upper rows of Table S3, existing methods such as k-means, fuzzy c-means, PAM, and Gaussian mixture outperform our proposed methods. In high-dimensional data with d = 1024 (Dim1024), k-means and a Gaussian mixture do not work well. Other methods achieved the maximum value (1) of Purity. For datasets with a large number of clusters, D31 (K = 31) and A3 (K = 50), PAM performs best. For datasets with large sample size of n ≥ 5000 and a moderate number of clusters K = 8, 15, our proposed method performs best. As for the effect of τ, it barely affects the performance of our proposed method. On the other hand, β slightly affects the performance, resulting that the intermediates among Gaussian mixture, Pareto clustering, k-means, and fuzzy c-means, such as GP, GPKF 1 , GPKF 10 , and GPKF 100 , show relatively good performances as a whole. We observe similar tendencies regarding the F-value (Table S4). As for the CI, the values are relatively small for all methods, suggesting that cluster locations are properly estimated. However, some methods do not work for some datasets: Gaussian mixture for A3 (CI = 18), Birch1 (CI = 34) and Birch2 (CI = 49); k-means for D31 (CI = 7), Dim1024 (CI = 4), A3 (CI = 6), Birch1 (CI = 12) and Birch2 (CI = 23); and fuzzy c-means for D31 (CI = 5), A3 (CI = 7), Birch1 (CI = 18) and Birch2 (CI = 25). On the other hand, PAM and our proposed methods show stable results. The results, where Σ k and π k are not estimated and fix Σ k = I and π k = 1/K in Pareto clustering, are shown in Tables S6-S8.

Discussion
We propose a new clustering method based on the generalized energy function derived from the Kolmogorov-Nagumo average. The survival function used in the generalized energy function plays an important role to ensure the minimum consistency of the parameters, which is shown in Lemma 1 using the property of divergence G(µ * , µ). We consider two examples of the survival function based on the Pareto and Fréchet distributions and show a connection among k-means, fuzzy c-means, and Gaussian mixture, leading to new methods that are intermediates among them. For the underlying distribution of our method in (50), we observe that k-means and fuzzy c-means do not have probabilistic interpretations because the corresponding underlying distributions become singular. We also propose an estimating algorithm for cluster locations, variances, and mixing proportions using the MM algorithm.
Simulation studies and benchmark data analysis show that intermediates among kmeans, fuzzy c-means, and the Gaussian mixture perform well. This observation suggests that our proposed method has a wide range of applications in which k-means, fuzzy c-means, and the Gaussian mixture are used. For example, simultaneous deep learning and clustering [33] in which a deep neural network and k-means are jointly used, image segmentation using fuzzy c-means in a deep neural network [34], an application of fuzzy c-means in classification problems [35] and a parallel computation for large datasets by fuzzy c-means [36] can be investigated in the framework of the generalized energy function by the Pareto distribution.
As for the tuning parameters τ and β, we consider an approach using the leave-oneout cross validation in the Supplementary Materials in order to improve the clustering performance. The objective function in the leave-one-out cross validation is derived from an anchor loss as in [37] to estimate the optimal values of τ and β properly. The benchmark data analysis suggests that the performance is insensitive to the values of τ but is sensitive to the values of β. Hence, this approach should be useful to determine the optimal value of β.
Banerjee et al. [38] has proposed a clustering method based on Bregman divergences and clarified the relationship between the exponential families and the corresponding Bregman divergences. They separately consider hard and soft clustering; the former corresponds to k-means style clustering and the latter corresponds to mixture model clustering. In our proposed model, the tuning parameters τ and β bridge the gap between them and the performances are investigated by simulation studies and benchmark datasets. The extension of our method by replacing the squared distance with Bregman divergence should improve its practical flexibility and utility. When β or γ divergence is used, the clustering method should be robust to contamination in observations as suggested by [39,40].
It is well known that MM algorithm and EM algorithm converge to a local optimum and the resultant clusters are sensitive to initial values [41]. One way to circumvent this difficulty is to prepare several sets of initial values and select the best one among them such as the global k-means algorithm [42]. Another approach is to combine MM algorithm and genetic algorithm (GA) to expand thoroughly the search space for the optimal solution [41,43]. The both approach can be incorporated into the Pareto clustering to make it robust to the initial values and to escape from local optimal solutions. Supplementary Materials: The following are available at https://www.mdpi.com/article/ 10.3390/e23050518/s1, A: Notations of methods and characteristics of bench-mark datasets, Figure S1: Summary of clustering methods, Table S2: Sample sizes, the number of clusters and dimensions of benchmark datasets. B: Results of benchmark data with Σ k = I in Pareto clustering, Table S3: The result of Purity (Σ k = I), Table S4: The result of F values (Σ k = I), Table S5: The result of Centroid index (Σ k = I). C: Results of benchmark data with Σ k = I in Pareto clustering, Table S6: The result of Purity (Σ k = I), Table S7: The result of F values (Σ k = I), Table S8

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Derivation of the Volume Constant v d
We define Then we have r d−1 dr, (from polar coordinate system) (A3) The last equality holds under a condition that lim t→∞ t d 2 S(t) = 0. Hence we have