Variational Information Bottleneck for Unsupervised Clustering: Deep Gaussian Mixture Embedding

In this paper, we develop an unsupervised generative clustering framework that combines the variational information bottleneck and the Gaussian mixture model. Specifically, in our approach, we use the variational information bottleneck method and model the latent space as a mixture of Gaussians. We derive a bound on the cost function of our model that generalizes the Evidence Lower Bound (ELBO) and provide a variational inference type algorithm that allows computing it. In the algorithm, the coders’ mappings are parametrized using neural networks, and the bound is approximated by Markov sampling and optimized with stochastic gradient descent. Numerical results on real datasets are provided to support the efficiency of our method.


Introduction
Clustering consists in partitioning a given data set into various groups (clusters) based on some similarity metric, such as Euclidean distance, L 1 norm, L 2 norm, L ∞ norm, the popular logarithmic loss measure or others.The principle is that each cluster should contain elements of the data that are closer to each other than to any other element outside that cluster, in the sense of the defined similarity measure.If the joint distribution of the clusters and data is not known, one should operate blindly in doing so, i.e., using only the data elements at hand; and the approach is called unsupervised clustering.Unsupervised clustering is perhaps one of the most important tasks of unsupervised machine learning algorithms nowadays, due to a variety of application needs and connections with other problems.
Examples of unsupervised clustering algorithms include the so-popular K-means [1] and expectation maximization (EM) [2].The K-means algorithm partitions the data in a manner that the Euclidean distance among the members of each cluster is minimized.With the EM algorithm, the underlying assumption is that the data comprises a mixture of Gaussian samples, namely a Gaussian Mixture Model (GMM); and one estimates the parameters of each component of the GMM while simultaneously associating each data sample to one of those components.Although they offer some advantages in the context of clustering, these algorithms suffer from some strong limitations.For example, it is well known that the K-means is highly sensitive to both the order of the data and scaling; and the obtained accuracy depends strongly on the initial seeds (in addition to that it does not predict the number of clusters or K-value).The EM algorithm suffers mainly from low convergence, especially for high dimensional data.
Recently, a new approach has emerged that seeks to perform inference on a transformed domain (generally referred to as latent space), not the data itself.The rationale is that because the latent space often has fewer dimensions it is more convenient computationally to perform inference (clustering) on it rather than on the high dimensional data directly.A key aspect then is how to design a latent space that is amenable to accurate low-complex unsupervised clustering, i.e., one that preserves only those features of the observed high dimensional data that are useful for clustering while removing out all redundant or non-relevant information.Along this line of work, we can mention [3] which utilizes Principal Component Analysis (PCA) [4] for dimensionality reduction followed by K-means for clustering the obtained reduced dimension data; or [5] which uses a combination of PCA and the EM algorithm.Other works that use alternatives for the linear PCA include Kernel PCA [6], which employs PCA in a non-linear fashion to maximize variance in the data.
The usage of deep neural networks (DNN) for unsupervised clustering of high dimensional data on a lower dimensional latent space has attracted considerable attention, especially with the advent of autoencoder (AE) learning and the development of powerful tools to train them using standard backpropagation techniques [7,8].Advanced forms include Variational autoencoders (VAE) [7,8] which are generative variants of AE that regularize the structure of the latent space and the more general Variational Information Bottleneck (VIB) of [9] which is a technique that is based on the Information Bottleneck method [10] and seeks a better trade-off between accuracy and regularization than VAE via the introduction of a Lagrange-type parameter s which controls that trade-off and whose optimization is similar to deterministic annealing or stochastic relaxation.
In this paper, we develop an unsupervised generative clustering framework that combines VIB and the Gaussian Mixture Model.Specifically, in our approach we use the variational information bottleneck method and model the latent space as a mixture of Gaussians.We derive a bound on the cost function of our model that generalizes the evidence lower bound (ELBO); and provide a variational inference type algorithm that allows to compute it.In the algorithm, the coders' mappings are parametrized using neural networks (NN) and the bound is approximated by Markov sampling and optimized with stochastic gradient descent.Also, we show how tuning the hyper-parameter s appropriately by gradually increasing its value with iterations (number of epochs) results in a better accuracy.Furthermore, the application of our algorithm to the unsupervised clustering of various datasets, including the MNIST [11], REUTERS [12] and STL-10 [13], allows a better clustering accuracy than previous state of the art algorithms.For instance, we show that our algorithm performs better than the variational deep embedding (VaDE) algorithm of [14] which is based on VAE and performs clustering by maximizes the ELBO and can be seen as a specific case of our algorithm obtained by setting s = 1.Our algorithm also generalizes the VIB of [9] which models the latent space as an isotropic Gaussian which is generally not expressive enough for the purpose of unsupervised clustering.Other related works, but which are of lesser relevance to the contribution of this paper, are the deep embedded clustering (DEC) of [15], the improved deep embedded clustering (IDEC) of [16] and [17].For a detailed survey of clustering with deep learning, the readers may refer to [18]. Encoder

Problem Definition and Model
Consider a dataset that is composed of N samples {x i } N i=1 which we wish to partition into |C| ≥ 1 clusters.Let C = {1, . . ., |C|} be the set of all possible clusters; and C designate a categorical random variable that lies in C and stands for the index of the actual cluster.If X is a random variable that models elements of the dataset, given X = x i induces a probability distribution on C which the learner should learn.Thus, mathematically the problem is that of estimating the values of the unknown conditional probability P C|X (•|x i ) for all elements x i of the dataset.The estimates are sometimes referred to as the assignment probabilities.
As mentioned in the introduction section, we use the VIB framework and model the latent space as a GMM.The resulting model is depicted in Figure 1, where the parameters π c , µ c , Σ c , for all values of c ∈ C, are to be optimized jointly with those of the employed NNs as instantiation of the coders.Also, the assignment probabilities are estimated based on the values of latent space vector instead of the observation themselves, i.e., P C|U = Q C|U .In the rest of this section, we elaborate on the inference and generative network models for our method.

Inference network model
We assume that an observed data x i is generated from a GMM with |C| components.Then, the latent variable u i is inferred according the following procedure: 1.One of the components of the GMM is chosen according to a categorical variable C.

2.
The data x i is generated from the c-th multivariate Gaussian, i.e., P X|C ∼ N (x; μc , Σc ).

The representation
For the inference network, the following Markov chain holds

Generative network model
Since encoder extracts useful representations of the dataset and we assume that the dataset is generated from a GMM, we model our latent space also with a mixture of Gaussians.To do so, the categorical variable C is embedded with the latent representation U.The reconstruction of the dataset is generated according to the following procedure: 1.One of the components of the GMM is chosen according to a categorical variable C, with a prior distribution Q C .
2. The latent representation is generated from the c-th component, i.e., Q U|C ∼ N (u; µ c , Σ c ).
3. The decoder maps the latent representation u i to the estimation of the source xi by using the mapping Q U|X .
For the generative network, the following Markov chain holds In this section we present our clustering method.First, we provide a general cost function for the problem of the unsupervised clustering that we study here based on the variational IB framework; and we show that it generalizes the ELBO bound developed in [14].We then parametrize our model using NNs whose parameters are optimized jointly with those of the GMM.Furthermore, we discuss the influence of the hyper-parameter s that controls optimal trade-offs between accuracy and regularization.

Brief review of variational Information Bottleneck for unsupervised learning
As described in Section 2, the stochastic encoder P U|X assigns a representation u i to every element x i of the dataset.Similarly, the stochastic decoder Q X|U assigns an estimate xi of x i based on the vector u i .As per the IB method [10] a suitable representation U should strike the right balance between capturing all information about the categorical variable C that is contained in the observation X and using the most concise representation for it.This leads to maximizing the following Lagrange problem L s (P) = I(C; U) − sI(X; U) , (1) where P denotes the conditional distribution P U|X and s ≥ 0 designates the Lagrange multiplier.Instead of (1) which is not always computable in our unsupervised clustering setting, we find it convenient to maximize an upper bound of L s (P) given by Ls (P) := I(X; U) − sI(X; U).
(Using the Markov chain C − − X − − U, it is easy to see that Ls (P) ≥ L s (P) for all values of P).
Maximizing Ls (P) over P is equivalent to maximizing For a variational distribution Q U on U (instead of the unknown P U ) and a variational stochastic decoder Q X|U (instead of the unknown optimal decoder P X|U ), let Lemma 1.For given P, we have , for all Q.In addition, there exists a unique Q that achieves the maximum max Q L VB s (P, Q) = L s (P), and is given by Using Lemma 1, it is easy to see that Remark 1.As we already mentioned in the introduction section, the related work [14] performs unsupervised clustering by combining VAE with GMM.Specifically, it maximizes the following ELBO bound Let, for an arbitrary non-negative parameter s, L VaDE s be a generalization of the ELBO bound (7) of [14] given by Investigating the RHS of (8), we get where the equality holds since where (a) can be obtained by expanding and re-arranging terms under the Markov chain C − −X− −U.
Thus, by the non-negativity of relative entropy it is clear that L VaDE s is a lower bound on L VB s (P, Q).Also, if variational distribution Q is such that the conditional marginal Q C|U is equal to P C|X the bound is tight since the relative entropy term is zero in this case.

Proposed algorithm: VIB-GMM
In order to compute (6), we parametrize the distributions P U|X , Q X|U and Q U using NNs.For instance, we let the stochastic encoder P U|X be a NN f θ .Also, we set the stochastic decoder Q X|U be a NN g φ .That is where θ and φ are the weight and bias parameters of the NN layers.Furthermore, the latent space is modeled as a GMM with |C| components with parameters ψ := {π c , µ c , Σ c } |C| c=1 .Thus, we re-write our parameterizations as Using the above, the optimization problem of ( 6) can be re-written equivalently as The right hand side (RHS) of ( 12) can be approximated numerically as where L NN s,i (θ, φ, ψ) is the empirical cost for the i-th sample and given by The first term of the RHS of ( 14) can be computed using Monte Carlo sampling and the reparametrization trick [7] as where M is the number of samples for the Monte Carlo sampling step.The second term of the RHS of ( 14) is the KL divergence between a single component multivariate Gaussian and a Gaussian Mixture Model with |C| components.An exact closed-form solution for the calculation of this term does not exist.However, a variational lower bound [19] of it can be obtained as In particular, in the specific case in which the covariance matrices are diagonal, i.e., Σi := diag({σ 2 i,j } J j=1 ) and Σ c := diag({σ 2 c,j } J j=1 ), with J denoting the latent space dimension, the RHS of ( 15) can be computed as follows where μi,j and σ2 i,j are the mean and variance of the i-th representation in the j-th dimension of the latent space.Furthermore, µ c,j and σ 2 c,j represent the mean and variance of the c-th component of the GMM in the j-th dimension of the latent space.
Finally, we train NNs to maximize the cost function (13) over the parameters θ, φ, as well as those ψ of the GMM.For the training step, we use the ADAM optimization tool [20].
Once our model is trained, we assign the given dataset into the clusters.As mentioned in Section 2, we do the assignment from the latent representations, i.e., Q C|U = P C|X .Hence, the probability that the observed data belongs to the c-th cluster is computed as follows where * indicates optimal values of the parameters as found at the end of the training phase.Finally, the right cluster is picked based on the largest assignment probability value.

Effect of the hyper-parameter
As we already mentioned, the hyper-parameter s controls the trade-off between the relevance of the representation U and its complexity.As it can be seen from ( 12) for small values of s, it is the cross-entropy term that dominates, i.e., the algorithm trains the parameters so as to reproduce X as accurate as possible.For large values of s, however, it is most important for the NN to produce an encoded version of X whose distribution matches the prior distribution of the latent space, i.e., the term D KL (P θ (U|X) Q ψ (U)) is nearly zero.
In the beginning of the training process, the GMM components are randomly selected; and so starting with a large value of the hyper-parameter s is likely to steer the solution towards an irrelevant prior.Hence, for the tunning of the hyper-parameter s in practice it is more efficient to start with a small value of s and gradually increase it with the number of epochs.This has the advantage to avoid possible local minimas, an aspect that is reminiscent of deterministic annealing, where s plays the role of the temperature parameter.The experiments that will be reported in the next section show that proceeding in the above described manner for the selection of the parameter s helps getting better accuracy results and better robustness to the initialization (i.e., no need for a strong pretraining).

Description of used datasets
In our empirical experiments, we apply our algorithm to the unsupervised clustering of the following datasets.
MNIST: A dataset of gray-scale images of 70000 handwritten digits from 0 to 9 of dimensions 28 × 28 pixel each.

STL-10:
A dataset of color images collected from 10 categories.Each category consists of 1300 images of size of 96 × 96 (pixels) ×3 (rgb code).Hence, the original input dimension n x is 27648.For this dataset, we use a pretrained convolutional NN model, i.e., ResNet-50 [21] to reduce the dimensionality of the input.This preprocessing reduces the input dimension to 2048.Then, our algorithm and other baselines are used for clustering.
REUTERS10K: A dataset that is composed of 810000 English stories labeled with a category tree.As in [15], 4 root categories (corporate/industrial, government/social, markets, economics) are selected as labels and all documents with multiple labels are discarded.Then, tf-idf features are computed on the 2000 most frequently occurring words.Finally, 10000 samples are taken randomly, which are referred to as REUTERS10K dataset.

Network settings and other parameters
We use the following network architecture: the encoder is modeled with NNs with 3 hidden layers with dimensions n x − 500 − 500 − 2000 − J, where n x is the input dimension and J is the dimension of the latent space.The decoder consists of NNs with dimensions J − 2000 − 500 − 500 − n x .All layers are fully connected.For comparison purposes, we chose the architecture of the hidden layers as well as the dimension of the latent space J = 10 to coincide with those made for the DEC algorithm of [15] and the VaDE algorithm of [14].All except the last layers of the encoder and decoder are activated with ReLU function.For the last (i.e., latent) layer of the encoder we use a linear activation; and for the last (i.e., output) layer of the decoder we use sigmoid function for MNIST and linear activation for the remaining datasets.The batch size is 100 and the variational bound ( 13) is maximized by the Adam optimizer of [20].The learning rate is initialized with 0.002 and decreased gradually every 20 epochs with a decay rate of 0.9 until it reaches a small value (0.0005 is our experiments).The reconstruction loss is calculated by using the cross-entropy criterion for MNIST and mean squared error function for the other datasets.

Clustering accuracy
We evaluate the performance of our algorithm in terms of the so-called unsupervised clustering accuracy (ACC), which is a widely used metric in the context of unsupervised learning [18].For comparison purposes, we also present those of algorithms from previous art.
For each of the aforementioned datasets, we run our VIB-GMM algorithm for various values of the hyper-parameter s inside an interval [s 1 , s 2 ], starting from the smaller valuer s 1 and gradually increasing the value of s every n epoch epochs.For the MNIST dataset, we set (s 1 , s 2 , n epoch ) = (1,5,500); and for the STL-10 dataset and the REUTERS10K datset we choose these parameters to be (1, 20, 500) and (1, 5, 100), respectively.The obtained ACC accuracy results are reported in the Table 1 from which it can be seen that our algorithm outperforms significantly the DEC algorithm of [15] as well as the VaDE algorithm of [14] and GMM on the same datsets.Important to note, for the MNIST dataset the reported ACC accuracy of 96.2% using our VIB-GMM algorithm is obtained as the best case run out of ten times run all with random initializations.For instance, we do not use any pretrained values for the initialization of our algorithm in sharp contrast with the VaDE of [14] and the DEC of [15].For the STL-10 dataset, none of the compared algorithms use a pretrained network except the intimal ResNet-50 for dimensionality reduction.For REUTERS10K, we used the same pretrain parameters as DEC and VaDE. Figure 2 depicts the evolution of the ACC accuracy with iterations (number of epochs) for the four compared algorithms.
Figure 3 shows the evolution of the reconstruction loss of our VIB-GMM algorithm for the STL-10 dataset, as a function of simultaneously varying values of the hyper-parameter s and the number of epochs (recall that, as per-the described methodology, we start with s = s 1 and we increase its value gradually every n epoch = 500 epochs).As it can be seen from the figure, the few first epochs are spent almost entirely on reducing the reconstruction loss (i.e., a fitting phase) and most of the remaining epochs are spent in making the found representation more concise (i.e., smaller KL-divergence).This is reminiscent of the two-phase (fitting v.s.compression) that was observed for supervised learning using VIB in [22].

Visualization on the latent space
In this section, we investigate the evolution of the unsupervised clustering of the STL-10 dataset on the latent space using our VIB-GMM algorithm.For this purpose, we find it convenient to visualize the latent space through application of the t-SNE algorithm of [23] in order to generate meaningful representations in a two-dimensional space.Figure 4 shows 4000 randomly chosen latent representations before the start of the training process and respectively after 1, 5 and 500 epochs.The shown points (with a • marker in the figure) represent latent representations of data samples whose labels are identical.Colors are used to distinguish between clusters.Crosses (with an x marker in the figure) correspond to the centroids of the clusters.More specifically, Figure 4-(a) shows the initial latent space before the training process.If the clustering is performed on the initial representations it allows ACC accuracy of as small as 10%, i.e., as bad as a random assignment.Figure 4-(b) shows the latent space after one epoch, from which a partition of some of the points starts to be already visible.With five epochs, that partitioning is significantly sharper and the associated clusters can be recognized easily.Observe, however, that the cluster centers seem still not to have converged.With 500 epochs, the ACC accuracy of our algorithm reaches %91.6 and the clusters and their centroids are neater as visible from Figure 4-(d).

Figure 2 :
Figure 2: Accuracy vs. number of epochs for the STL-10 dataset.

Figure 4 :
Figure 4: Visualization of the latent space before training; and after 1, 5 and 500 epochs.

Table 1 :
Comparison of clustering accuracy of various algorithms.