Image Clustering Algorithm Based on Predefined Evenly-Distributed Class Centroids and Composite Cosine Distance

The clustering algorithms based on deep neural network perform clustering by obtaining the optimal feature representation. However, in the face of complex natural images, the cluster accuracy of existing clustering algorithms is still relatively low. This paper presents an image clustering algorithm based on predefined evenly-distributed class centroids (PEDCC) and composite cosine distance. Compared with the current popular auto-encoder structure, we design an encoder-only network structure with normalized latent features, and two effective loss functions in latent feature space by replacing the Euclidean distance with a composite cosine distance. We find that (1) contrastive learning plays a key role in the clustering algorithm and greatly improves the quality of learning latent features; (2) compared with the Euclidean distance, the composite cosine distance can be more suitable for the normalized latent features and PEDCC-based Maximum Mean Discrepancy (MMD) loss function; and (3) for complex natural images, a self-supervised pretrained model can be used to effectively improve clustering performance. Several experiments have been carried out on six common data sets, MNIST, Fashion-MNIST, COIL20, CIFAR-10, STL-10 and ImageNet-10. Experimental results show that our method achieves the best clustering effect compared with other latest clustering algorithms.


Introduction
Clustering is the process of dividing a collection of physical or abstract objects into classes composed of similar objects. The clusters generated by clustering algorithms are some sample sets. The samples in the same cluster are similar to each other, but different from those in other clusters.
In this paper, an efficient image clustering algorithm based on predefined evenlydistributed class centroids and composite cosine distanc e(ICBPC) is proposed. In this algorithm, PEDCC [1] is used as the clustering centers to ensure the maximum inter-class distance of latent features. PEDCC has been applied to several of our studies, such as classification [2] and out-of-distribution detection [3]. In [2], our contribution is mainly focused on classification tasks with supervised learning. In [3], our contribution is mainly focused on out-of-distribution detection, which is designed to detect test samples with non-overlapping labels relative to training data. Both algorithms are supervised learning algorithms that require the labels of the training data. In this paper, PEDCC is applied to achieve better clustering performance. Clustering is an unsupervised learning method that does not require labels of training data, while classification and out-of-distribution detection are both supervised learning methods. The data distribution constraint and contrastive constraint between samples and augmented samples are applied to improve the clustering performance. The specific training process is to input the samples and their In this paper, instead of Euclidean distance, a new composite cosine distance is proposed to better fit the PEDCC clustering model, which has never been proposed before and can be widely used for various image clustering tasks. At the same time, we applied the contrastive loss function to the clustering algorithm and achieved good results. Contrastive learning has previously been used in the field of self-supervised learning. At last, we found that, for complex natural images, a self-supervised pretrained model can be used to effectively improve clustering performance.
The paper is arranged as follows: Section 2 summarizes the related work, and our methods are introduced in detail in Section 3. Then, in Section 4, we give the experimental settings and results. Finally, Section 5 summarizes the whole paper. The code can be downloaded at https://github.com/LihengHu/ICBPC (accessed on 29 August 2022).

Clustering and Deep Learning Based Clustering Method
Clustering is one of the most important unsupervised learning tasks. The purpose of clustering is to classify similar data into a cluster based on some similarity measures. The traditional clustering methods include partition-based method [6] and hierarchical method [7]. The disadvantage of traditional clustering is that the similarity measurement method used is inefficient, and the performance of the traditional clustering method is poor on high-dimensional data, and it has high computational complexity on large-scale data sets. The solution is to reduce and transform features, which maps the original Entropy 2022, 24, 1533 3 of 16 data into a new feature space, making the generated data more easily separated by the existing classifier.
Hierarchical clustering algorithm starts with many small clusters and then gradually merges into large clusters. The partition clustering method minimizes the sum of the squared errors between the data points and their nearest cluster centers. Among them, the k-means [6] algorithm has attracted the most attention. The k-means algorithm takes k as the parameter and divides n objects into k clusters, so that the similarity within the clusters is high, while the similarity between the clusters is low.
In the last few years, deep neural networks have had great success. The success of deep learning often depends on the support of large amounts of data, and the supervised learning of large amounts of data is mature, such as [8,9]. However, it takes a lot of time and resources to mark massive data. Unsupervised learning does not need to rely on data labels, and can automatically discover the latent structure in the data, saving a lot of time and hardware resources.
Auto-encoder (AE) [10,11] is one of the most important algorithms in unsupervised representation learning. Since the dimension of the latent layer is generally smaller than that of the data layer, it can help extract the most salient features of the data. AE is mainly used to find better initializations for parameters in supervised learning and can also be combined with unsupervised clustering. AE can be thought of as consisting of two parts: an encoder that maps the raw data X to represent H, and a decoder that generates the reconstruction.
Deep embedding for clustering (DEC) [12] uses the auto-encoder as the network architecture. First, the auto-encoder is trained by rebuilding the loss, and the decoder part is discarded. The features extracted from the encoder network are used as the input of the clustering module. After that, clustering allocation is used to strengthen the loss to fine-tune the network. At the same time, the clustering is iteratively improved by minimizing the KL divergence between the distribution of soft tags and the distribution of auxiliary targets. Discriminatively boosted image clustering (DBC) [13] has almost the same architecture as DEC, with the only improvement being the use of a convolutional auto-encoder. Its performance on image data sets is superior to DEC due to the use of convolutional networks.
Pseudo-supervised deep subspace clustering (PSSC) [14] based on auto-encoder uses pair similarity measure to reconstruct loss to obtain local structural information, while similarity is a layer of learning through self-expression. Pseudo graphs and pseudo labels can benefit from the uncertain knowledge gained from online training, and are further used to monitor similar learning. Image clustering with deep semantic embedding (DSEC) [15] extracts the total semantic (attribute) features from the data set firstly, and then employs a deep semantic embedding auto-encoder to refine the lower dimensional multi-features representation. The final clustering work is implemented by iteratively optimizing a KL divergence-based clustering objective. Representation learning based on an autoencoder and deep adaptive clustering for image clustering(RLBAD) [16] presents a novel representation learning method and we use it to solve the image clustering problem. It borrows the deep adaptive image clustering (DAC) [17] algorithm and incorporates it to train a fully convolutional auto-encoder.
The DAC algorithm combines feature learning and clustering. It transforms the clustering problem into a binary pairwise classification framework to judge whether image pairs belong to the same cluster. In DAC, similarity is calculated as the cosine distance between the image label features generated by deep convolutional networks. Our algorithm employs compound cosine distances to fit the PEDCC model. Associative Deep Clustering [18] is a direct clustering algorithm for deep neural networks. The central idea is to jointly train centroid variables with the network's weights by using a clustering cost function. Predefined evenly-distributed class centroids are used as the clustering centers to ensure the maximum inter-class distance of latent features in our algorithm. DeepCluster [19] is a clustering method that jointly learns the cluster assignments of neural network parameters and resulting features. DeepCluster uses k-Entropy 2022, 24, 1533 4 of 16 means to iteratively group features and uses subsequent assignments as supervision to update the weights of the network.
An image clustering auto-encoder (ICAE) [20] combines predefined clustering centers with auto-encoders to obtain better results. ICAE differs from our algorithm mainly in the structure, the design of the loss function and the distance measure. Although an autoencoder can achieve good results, it is complex in structure and requires long training time. The algorithm that we proposed simplifies the structure by using only the encoder and discarding the decoder. At the same time, the performance of our algorithm exceeds that of the algorithm using an auto-encoder.
We compare the experimental results of these algorithms in Section 4.6.

PEDCC
Zhu and Zhang proposed the classification supervised auto-encoder (CSAE) [1] to implement the classification function with a unified auto-encoder network structure using the predefined evenly-distribution class centers, and to generate samples of different classes according to the class label. PEDCCs are class center points evenly distributed on the unit hypersphere of the latent feature space, which are used as the training target of the classification network to maximize the inter-class distance. Figure 2 shows PEDCC visual instances. As mentioned above, PEDCCs are some evenly-distributed points on the hypersphere, whose distribution can be regarded as the sum of a set of Dirac functions. In CSAE, the samples were labeled. In contrast, we use PEDCC for clustering. We learn the mapping function and map the different classes of samples to these predefined class centers, so that different classes can be distinguished by the strong fitting ability and effectiveness of deep learning.

Methods
In this section, we will introduce the implementation process of the ICBPC algorithm and loss function. Section 3.1 introduces the algorithm process and Sections 3.2-3.4 introduce the design of the loss function.

ICBPC
The implementation process of ICBPC algorithm is shown as Algorithm 1. First, we perform data augmentation on each unlabeled image X to obtain X. Then, both the original image and the augmented image are input into the encoder to obtain its latent features Z and Z. Then, the distance between the two features are reduced by contrastive loss (loss 2 ). MMD [5] (loss 1 ) is used to make the distribution close to PEDCC distribution (maximizing distribution similarity between latent features and Dirac distribution within classes). In two loss functions, we replace the Euclidean distance with a composite cosine distance to fit the model.

Composite Cosine Distance for Normalized Features and PEDCC
Euclidean distance is generally used to measure the distance in different loss functions. To better fit our PEDCC-based clustering model, we normalized the latent features and then replaced Euclidean distance with composite cosine distance. For Euclidean distance d 2 , we have: where θ is the angle between x 1 ,x 2 . In this paper, we use d θ = 1 − cosθ as a new distance metric for all loss functions, that is the original Euclidean distance d 2 is 2 * d θ .
The cosine distance does not meet the triangle inequality criterion of the conventional distance metric, that is, the sum of the side lengths of the two short sides will be less than the side length of the long side. However, in the training process of our loss function, this property may be a good thing. In the process of gradual iteration between the initial value and the training target, the sum of the cosine distances in each step will be shorter than the cosine distance in one step, which can speed up the convergence, and also be proved by later experiments.
The change of derivative values of d 2 and d 2 θ within the range of 0 to 180 • are shown in the Figure 3. It can be seen from the figure that when θ is greater than 90 • , d 2 θ has a larger gradient and the training is easier to converge.
To improve the derivative of cosine distance at small angles, we could use √ d θ . It can enhance the ability of network parameter updating in the later training period. The change of derivative values of √ d θ within the range of 0 to 180 • are shown in the Figure 3. It can be seen that with the decrease of the θ angle, the gradient gradually increases, which is conducive to the network update in the later stage of training, and avoids the problem that the gradient of d 2 θ gradually tends towards zero. In our two loss functions, d 2 is necessary in this paper, replaced by composite cosine distance d 2  . It can be seen from the figure that when θ is greater than 90 • , the new distance has a larger gradient and the training is easier to converge, and when θ is small, the gradient is still greater than zero to strengthen the training of the small angle. Experiments show that this distance can obtain a better clustering effect compared with Euclidean distance.

Clustering Loss Function
The loss function based on PEDCC utilizes the concept of PEDCC in CSAE network to set PEDCCs as the clustering centers of classes, and these clustering centers are evenlydistributed on the hypersphere of feature space, maximizing the inter-class distance and obeying Dirac distribution within the class. Our algorithm uses MMD to measure the distance between the samples' distribution and PEDCC distribution. The basic principle of the MMD is to find a function that assumes that two different distributions have different expectations. If the function is evaluated with empirical samples from the distribution, the function will indicate whether they are from different distributions. Our loss 1 aims to utilize the distribution difference between the samples' distribution and PEDCC distribution in latent features, so that the features extracted from the encoder meet the distribution of PEDCC. The MMD algorithm is used as loss 1 to train the network, and the formula is as follows: where Z is the intermediate latent features, Z means the latent features of the augmented data, M means its dimension, l i = [Z, Z] is the latent features of the image and its augmented latent features; u i represents the PEDCC class centers, C is its number, and k(x, y) is the kernel function. By iteratively minimizing loss 1 , the probability distribution of latent features can be closer to that of PEDCC. The underlying features are also going to be close to these points on the hypersphere.
The kernel function k(x 1 , x 2 ) is usually expressed in the form of radial basis function, and its value is inversely proportional to the square of the distance between x 1 and x 2 . The formula of the kernel function is as follows: where composite cosine distance replaces Euclidean distance d 2 .
Loss 1 uses the MMD algorithm based on a radial basis to make the latent feature distribution the same as the predefined PEDCC, achieving the best clustering. In loss 1 , cosine distance is used to better measure the distance between two features, which makes the radial basis-based MMD algorithm easier to converge.

Data Augmentation Loss Function
The main purpose of data augmentation is to reduce the overfitting of the network and help the network extract more discriminative features. By transforming the training images, a network with a stronger generalization ability can be obtained, which can better adapt to the application scenarios.
We use some common data augmentation. One type of augmentation involves spatial and geometric transformation of data, such as cropping, resizing (with horizontal flipping) and rotation [21]. The other type of augmentation involves appearance transformation, such as color distortion (including color dropping, brightness, contrast, saturation) [22], Gaussian blur, and Sobel filtering.
For different datasets, we should adopt different data augmentation methods to get better clustering effect for datasets. For example, for the color image datasets, we mostly adopt color conversion, brightness adjustment and other methods, as shown in Figure 5. However, geometric processing are used such as cutting and rotation, as shown in Figure 6, to achieve better clustering effect for MNIST. The samples X augmented by unlabeled data X are input into the encoder to the features Z and Z, which can be used to achieve better clustering.
Contrastive loss functions are ::::::: function :: is : used to constrain the features of th mented sample ::::::: samples : and the features of the original sample ::::::: samples. Contrastive loss is mainly used for dimensionality reduction, that is, after dime ality reduction (feature extraction) of the originally similar samples, the two samp still similar in the feature space. However, after dimensionality reduction for the ori dissimilar samples, the two samples are still dissimilar in the feature space. Similar loss function can well express the matching degree of the samples.
The contrastive loss function has the following expression: where d represents the distance of the features of the two samples, x 1 repr the original sample, x 2 represents the augmented sample or random negative sam represents the label of whether the two samples match or not, y =1 represents the sim or match of the two samples, y =0 represents the mismatch, and margin is the set thr N is the number of sample pairs. Margin is usually set to 0.3.
As mentioned above, d 2 is also replaced by d 2 c in Eq.(4). Formula is as follows: When x 2 is the augmented sample, y=1 (that is, the samples are similar). If the d in the feature space is large, it indicates that the current model is not good, so the increased.
When x 2 is the random negative sample, y=0 (the samples are not similar). samples are not similar and the distance of is small, the loss value will increase.
Contrastive loss function is used to constrain the features of the augmented samples and the features of the original samples.
Contrastive loss is mainly used for dimensionality reduction, that is, after dimensionality reduction (feature extraction) of the originally similar samples, the two samples are still similar in the feature space. However, after dimensionality reduction for the originally dissimilar samples, the two samples are still dissimilar in the feature space. Similarly, the loss function can well express the matching degree of the samples.
The contrastive loss function has the following expression: where d represents the distance of the features of the two samples, x 1 represents the original sample, x 2 represents the augmented sample or random negative sample. y represents the label of whether the two samples match or not, y = 1 represents the similarity or match of the two samples, y = 0 represents the mismatch, and margin is the set threshold. N is the number of sample pairs. Margin is usually set to 0.3. As mentioned above, d 2 is also replaced by d 2 c in Equation (4). Formula is as follows: When x 2 is the augmented sample, y = 1 (that is, the samples are similar). If the distance in the feature space is large, it indicates that the current model is not good, so the loss is increased. When x 2 is the random negative sample, y = 0 (the samples are not similar). If the samples are not similar and the distance is small, the loss value will increase.
Loss 2 expects that the cosine distance of the augmented samples in the latent feature space is the minimum to achieve correct clustering. In loss 2 , the cosine distance also replaces Euclidean distance, so that the original and augmented samples have the same direction, rather than the same value.

Loss Function
The loss function of the whole algorithm is combined with the above two loss functions, as follows: where λ is the weight of loss 2 . For different data sets, the weights of the two loss functions will be adjusted, and different weights will lead to different results. The weights are shown in Table 1. For the kernel function of MMD loss, is set to 2.0 and kernel number is set to 5.0 in our experiments.

Using Self-Supervised Pretrained Model
Self-supervised pretrained model is a network that is trained on a large amount of data by self-supervised learning. Since the pretraining model can bring up more effective image features, further implementation of clustering algorithm on the pretraining model can make the algorithm obtain more discriminative features, and achieve better clustering performance, especially for complex natural images, such as CIFAR-10, STL-10 and ImageNet-10. In the experiments, we use the typical Barlow Twins [23] self-supervised learning algorithm to pretrain the ResNet model on the Imagenet.

Datasets
We used six datasets to verify the performance of our algorithm. The six datasets are MNIST, COIL20, FASHION-MNIST, CIFAR-10, STL-10, and ImageNet-10 as Table 2. We randomly choose 10 subjects from the ImageNet dataset to construct the ImageNet-10 dataset for our experiments. All datasets before inputting the network are normalized to [−1, 1].

Experimental Setup
Before starting the experiment, we set the number of classes of classification and the dimension of middle layer features. Set the initial learning rate to 0.001 and use the Adam optimizer. The batch-size is set to 100 and the training epoch is 400. The network structure keeps unchanged during the training. The settings of hyper-parameters are shown in Table 1. The values in Table 1 are set when the clustering results are the best. The value of λ is set differently for the six different datasets. Setting the value of λ to 8 achieves the best clustering results for MNIST, Fashion MNIST, STL-10 and ImageNet-10. When the value of λ is set to 9, the best clustering results can be obtained for COIL20 and CIFAR-10. All our experimental results are averaged after 4 times of training.

Evaluation Metrics
We use the following two indicators to validate our algorithm: Cluster Accuracy (ACC) [24] and Normalized Mutual Information (NMI) [24].

Encoder Architecture
ResNet [25] can solve the problem of deep neural network degradation. So, our algorithm uses the residual network structure ResNet-18 as the encoder, and the specific network structure of the encoder is shown in Table 3. For CIFAR-10, STL-10, and ImageNet-10, we adopt a self-supervised pretrained ResNet model trained on the ImageNet dataset. The network only trains the last two blocks, and the parameters of the other parts are frozen.
The dimension of the latent feature of the middle layer is the dimension of the predefined class center. The dimension of the middle layer is different for different datasets and can be determined according to the experiment. Taking MNIST as an example, the performance of the models in different dimensions is shown in Table 4. Other datasets also obtain the best latent features dimension through experiments. The best dimensions of the latent features used for each dataset are shown in Table 5. It can obtain the best model performance. Through training, the distribution of latent feature Z can be close to the PEDCC distribution. Table 5. Dimension of latent features. The value of the setting is obtained by the experiment.

Analysis on Computational Time and Clustering
We used the PyTorch deep learning framework to do all the training on an Inter(R) I7-6700K CPU, 32GB RAM, and a Nvidia GTX 1080 TI GPU. There are two loss functions in total, and the convergence time is fast. Taking the COIL20 dataset as an example, only 14 s are needed for each epoch, achieving the highest accuracy within 400 epochs. It only requires 4 s to obtain ACC and NMI for network testing. The proposed composite cosine distance can significantly improve the convergence speed. The change of loss value with epoch is shown in Figure 7, which shows that our algorithm converges faster than the ICAE algorithm.
I7-6700K CPU, 32GB RAM, and a Nvidia GTX 1080 TI GPU. There are 3 ::: two : loss functions in total, and the convergence time is fast. Taking the COIL20 data set :::::: dataset as an example, only 14 seconds is needed for each epoch, achieving the highest accuracy within 400 epochs. It only requires 4 seconds to get ACC and NMI for network testing. The proposed composite cosine distance can significantly improve the convergence speed. The change of loss value with epoch is shown in Figure 8. It can be seen from Figure 8 :::::: which :::::: shows that our algorithm converges faster than the ICAE algorithm. classes :: of ::: the ::::::: MNIST : and set the feature dimension to 3 for training. As shown in Figure 9, we visualize the resulting features in 3D coordinates. It can be seen from the figure that the distance between each category is far enough. To demonstrate the clustering effectiveness of our model, we select four classes of the MNIST and set the feature dimension to 3 for training. As shown in Figure 8, we visualize the resulting features in 3D coordinates. It can be seen from the figure that the distance between each category is far enough. submitted to Journal Not Specified

Ablation Experiment
We tested the effectiveness of each loss function with some ablation e

Ablation Experiment
We tested the effectiveness of each loss function with some ablation experiment. Experimental results are shown in Table 6, which shows that the best clustering effect can be obtained by using the two loss functions and composite cosine distance.

Effectiveness of Self-Supervised Pretrained Model
For CIFAR-10, STL-10, and ImageNet-10, we adopt self-supervised pretrained ResNet model trained on the ImageNet. We resize STL-10 to 224 × 224 × 3 to fit the pretrained model. The network only trains the last two blocks, and the parameters of other parts are frozen. As shown in Table 7, a self-supervised pretrained model can be used to effectively improve the clustering performance for complex natural images. The clustering performance of Fashion-Mnist is not improved by the pretrained model. It can be seen that the pretrained model is more effective for complex natural images.

Compared with Auto-Encoder
The algorithm that we proposed simplifies the algorithm structure by using only the encoder and discarding the decoder. At the same time, the performance of our algorithm exceeds that of the algorithm using the auto-encoder. We compared the two structures, and the results are shown in Table 8. The encoder-only model has shorter training time and higher accuracy.

Compared with the Latest Clustering Algorithm
We compared the ICBPC clustering algorithm with the latest clustering algorithm, and our algorithm achieved excellent results in all four datasets, as shown in Table 9.
In Table 9, all the results are reported by running the code they posted or are taken from the corresponding paper. The mark "-" means that the result is not available for the paper or code. The significance of bold in the Table 9 represents the best result.
Compared with deep clustering algorithms using auto-encoders such as DCN and DEN, our model is simpler in structure, faster in training, and can achieve good clustering performance by PEDCC. Compared with other algorithms that learn feature representations for clustering such as JULE, our algorithm uses PEDCC to make the inter-class distances large enough for better clustering performance.

Statistical Analysis of Experimental Data
All our experimental results are averaged after 4 times of training. We calculate the standard deviation of the experimental data to verify the stability of the algorithm. As shown in Table 10, the standard deviation values of the experimental results are low, which can prove the stability of our algorithm.

Conclusions
This paper presents an image clustering algorithm based on predefined evenlydistributed class centroids and composite cosine distance. In this algorithm, an encoder only network structure is adopted and PEDCC is used as the clustering center to ensure the maximum distance between classes of latent features. Data distribution constraints and contrastive constraints between samples and augmented samples are applied to improve the clustering performance. We use composite cosine distance instead of Euclidean distance to better fit the PEDCC model. This algorithm achieves better performance than the existing clustering algorithms on MNIST, COIL20, Fashion-MNIST, CIFAR-10, STL-10 and ImageNet-10. For complex natural images, a self-supervised pretrained model is used to achieve better clustering performance. In the future, we will continue to use the characteristics of PEDCC for feature representation learning, to obtain better clustering and recognition results.