Self-Supervised Clustering Models Based on BYOL Network Structure

.


Introduction
As an effective machine learning technique, clustering plays an important role in data mining [1][2][3], statistical analysis [4][5][6], and pattern recognition [7][8][9].It aims to partition the data into different clusters according to the similarity between the data samples [10].Therefore, various clustering methods have been developed over the past decades to extract the inherent features and structures of the data [11,12].In the current era of big data, more and more high-dimensional data pose huge challenges to traditional clustering due to insufficient representability.For this reason, some dimensionality reduction [13] and representation transformation [14] techniques have been widely studied to map the original data into a new feature space, where the data representation is easier to separate by the existing classifiers.Nevertheless, limited to their high computational complexity, the traditional data transformation methods [15][16][17] fail to process large-scale and highdimensional data.Although some random feature [18] methods and random projection [19] methods can yield a low-dimensional representation and a better approximation of userspecified kernel, the representation ability of features learned from these shallow models is generally limited.
In recent decades, deep learning [20] based on neural networks has been widely studied to discover good representation of the data.Meanwhile, the optimization of the deep neural network along with unsupervised clustering has exhibited great promise and excellent clustering performance, which is referred to as deep clustering [21].Most deep clustering methods can be categorized as either generative models [22] or discriminative models [23].Generative models aim to learn the embedding representation or distribution of the original data through the generative process.The clustering then processes the learned distribution or representation in a simultaneous (end-to-end) or asynchronous fashion.Some of the prominent techniques that have a significant impact are deep clustering methods based on autoencoder (AE) [24], deep clustering methods based on variational autoencoder (VAE) [25], and deep clustering methods based on generative adversarial network (GAN) [26].However, these clustering methods, which rely on generative models, necessitate complex data generation procedures, which can be computationally expensive and may not be necessary for both clustering and representation learning purposes.
Different from generative models, discriminative models, such as contrastive learningbased methods, remove the costly generation step and directly discriminate the representation by learning the decision boundary.As the most representative contrastive learning method, Simple Framework for Contrastive Learning of Representations (SimCLR) [27] exploits the representation between different views of samples, wherein the similarities between different views of one sample (positive pairs) are maximized and those between different samples (negative pairs) are minimized.Based on this idea, some two-step clustering models have been designed.Supervised Contrastive Learning for Pretrained Visual Representations (SCAN) [28] mines the nearest neighbors of each image as prior guidance to optimize the cluster network, while Semantic Pseudo-labeling for Image Clustering (SPICE) [29] and Robust learning for Unsupervised Clustering (RUC) [30] generate pseudo-labels via self-learning methods to guide the clustering.These methods employed a two-stage operation where the clustering and the representation learning were decoupled.They focus more on the optimization of the neural networks to achieve more discriminative representations but suffer from a lack of clustering-oriented guidance, which results in suboptimal clustering performance.
Recently, more contrastive learning-based models have been constructed to excavate representation and perform clustering in an end-to-end fashion.Among these methods, Contrastive Clustering (CC) [31] performs both instance-level contrastive learning for exploiting the discriminative representations and clustering-level contrastive learning for separating different clusters.Following this idea, Graph Contrastive Clustering (GCC) [32] proposes a graph Laplacian-based contrastive loss to enhance the discriminative and clustering-specific characteristics of features.To further improve the quality of learned representations, Cross-instance guided Contrastive Clustering (C3) [33] takes into account the cross-sample relationships, thereby increasing the number of positive pairs and reducing the impact of false negatives.Even though the contrastive models above yield excellent clustering results, they usually rely on a large number of negative pairs to capture the uniform representations, which requires a large batch size and high computational complexity.Moreover, different instances from the same cluster are regarded as negative pairs and wrongly pushed away, which may inevitably lead to the cluster collision issue.
Different from these traditional contrastive learning-based models, some self-supervised methods, such as Bootstrap Your Own Latent (BYOL) [34], perform non-contrastive learning to capture discriminative representations only with positive pairs.However, the absence of negative pairs in contrastive learning hinders the ability of self-supervised representation learning methods to achieve uniform representations across clusters, which may lead to the issue of the collapse of clustering [35], i.e., assigning all data samples into fewer clusters than desired.Therefore, it is crucial to introduce an effective clustering enhancement method to improve the quality of the clustering assignment.
To solve these issues, a novel end-to-end Self-supervised Clustering model based on BYOL network structure with Instance-level and Cluster-level discriminations (BSC-IC) is proposed in this paper to perform clustering and representation learning simultaneously only with positive pairs.Taking inspiration from the concept of "cluster assignments as representations" [36], we enhance the original BYOL network by incorporating a Softmax layer to convert representations into cluster assignments.Subsequently, we also integrate adversarial learning [37] into cluster assignments not only to improve discrimination among clusters but also to mitigate the issue of collapsed clusters.To mitigate the high interdependence between the target and online networks in BYOL, we propose a novel selfenhancement loss.This loss evaluates the similarity of cluster assignments among positive pairs within a mini-batch across the online network itself.To further enhance the clusteringoriented guidance.A new cluster-level discrimination is integrated into the discriminative network to promote clustering performance by measuring the self-correlation between the learned cluster assignments.
The rest of this paper is organized as follows.The related work is presented in Section 2. The contrastive clustering model with instance-level and cluster-level discrimination is designed in Section 3. Experiments are performed in Section 4. The ablation study and its analysis are provided in Section 5. Conclusions are given in Section 6.

Related Work 2.1. Contrastive Clustering
CC [31] is a contrastive learning-based clustering method that aims to discover meaningful groups or patterns in a given dataset by emphasizing the dissimilarity or contrast between data points.In CC, instance-level and cluster-level contrastive learning are respectively conducted in the row and column spaces by maximizing the similarities of positive pairs while minimizing those of negative ones.However, this method usually relies on a large number of negative pairs to capture the uniform representations, which requires a large batch size and high computational complexity.

Bootstrap Your Own Latent
BYOL [34] is a self-supervised deep learning method used for representation learning.It is designed to learn meaningful representations from unlabeled data, allowing the model to capture useful patterns and information without the need for negative samples.BYOL consists of two identical neural networks called the online network and the target network.From an augmented view of a data sample, BYOL trains the online network to predict the representation of the target network from a different augmented view of the same data sample.

BSC with Instance-Level and Cluster-Level Discriminations
The contrastive-based clustering models usually rely on a large number of negative pairs to capture uniform representations, which requires a large batch size and high computational complexity.In contrast, some self-supervised methods perform non-contrastive learning to capture discriminative representations with only positive pairs but suffer from the collapse of clustering.To solve these issues, a novel end-to-end Self-supervised Clustering model based on BYOL network structure with Instance-level and Cluster-level discriminations (BSC-IC) is designed in this section.Figure 1 illustrates the framework of the BSC-IC model, which consists of three joint learning components: the self-supervised learning network, the instance-level discriminative network, and the cluster-level discriminative network.The self-supervised learning network adopts a similar structure to BYOL to capture the good cluster assignments of the data with only positive pairs, which includes an online-target network and a target network.A little different from BYOL, the Softmax layer is equipped to convert the representation to the cluster assignment.The novel instance-level discriminative network and cluster-level discriminative network are designed to provide clustering-oriented guidance for self-supervised learning.
The framework of the proposed BSC-IC model.

The Self-Supervised Learning Network for Representation Capturing
The self-supervised learning network in BSC-IC is designed for representation learning, and contains an online network and a target network.The online network with parameters ξ is defined by an encoder, f ξ , to extract the representation features, which is followed by a Softmax layer, S ξ , to convert the representation to the cluster assignment of the input data.The target network has the same architecture as the online network but adopts a different set of parameters, θ.
In detail, given a set of data X = {x i |1 ≤ i ≤ N } ∈ R N×D in a mini-batch, N is the batch size and D is the dimension of the data.Data augmentations are first conducted to obtain two augmented views of the original data as positive pairs.The first augmented view X a then feeds into the online network to outputs the cluster assignment Z a = {z a i |1 ≤ i ≤ N } ∈ R N×K .Simultaneously, the second augmented view X b is fed into the target network to generate the cluster assignment Ẑb = { ẑb i |1 ≤ i ≤ N } ∈ R N×K , where K is the number of clusters.Self-supervised learning is then performed to maximize the similarity of positive pairs and realize the mutual optimization between the target and online networks.Unlike the cosine distance metric used in BYOL, the similarity of cluster assignments for positive pairs is measured using Kullback-Leibler (KL) divergence.The KL divergence is more suitable for capturing the difference between probability distributions.The loss for the mutual improvement in the self-supervised learning network is defined as (1).
In order to calculate the overall mutual-improvement loss of BSC-IC, we symmetrize the loss L mi by separately inputting X a into the target network and inputting X b into the online network to compute Lmi = KL(Z b , Ẑa ).Finally, the overall mutual-improvement loss of BSC-IC is denoted as (2).
The self-supervised learning network above is made up of two highly interdependent networks, in which the poor optimization of any network can deteriorate the whole structure.Particularly, the subsequent clustering may corrupt the quality of representation space and destroy the preservation of local structure.Moreover, to break the highly mutual interdependence across online and target networks, we define a novel loss, named the self-improvement loss, as (3) to evaluate the similarity of the cluster assignments between positive pairs of the online network itself.
where Z a and Z b indicate the cluster assignments obtained by the online network itself from two augmented views, respectively.

The Instance-Level Discriminative Network for Data Clustering
To alleviate the collapse of clustering, the instance-level discriminative network D(•) with parameters η in BSC-IC is constructed to provide clustering-oriented guidance for the self-supervised learning network.
Given the data X in a mini-batch, we input their two augmented views X a and X b to the online network, and obtain the corresponding cluster assignments Z a and Z b .Then, a one-hot-style prior distribution P ∼ Cat(K, p = 1/K ) is imposed on the learned cluster assignments Z (the alternative to Z a or Z b ), and the adversarial learning between Z and P is conducted to make Z closer to the form of one-hot, so as to enhance the discrimination of clusters and alleviate the collapse problem.Referring to the WGAN-GP method [38], the adversarial losses of the instance-level discriminative network for the generator L BSC−IC Adv-G and the discriminator L BSC−IC Adv-D are defined as ( 4) and ( 5), respectively. where where r = p + (1 − )z subject to ∼ U[0, 1] is a representation sampled uniformly along straight lines between the prior distribution P and the soft assignments Z, ( ∇ r D(r) 2 − 1) 2 is the one-centered gradient penalty that limits the gradient of the instance-level discriminative network to be around 1, and δ is the gradient penalty coefficient.
Here, the adversarial loss for the generator L BSC−IC Adv-G is designed to minimize the Wasserstein distance between the generated assignments and the one-hot distribution, which encourages the generator network to generate more sharp cluster assignments.In contrast, the adversarial loss for the discriminator L BSC−IC Adv-D is formulated to maximize the Wasserstein distance between the generated assignments and the one-hot distribution.Both adversarial losses train the model through the competitive process between the generator and the discriminator.

Cluster-Level Discriminations Network for Clusters Enhancement
To further benefit from the strength of capturing clustering-oriented information, a new cluster-level discrimination is integrated into the discriminative network to promote clustering performance by measuring the self-correlation between the learned cluster assignments.
Specifically, given a set of data X = {x i |1 ≤ i ≤ N } in a mini-batch, the online network takes in two augmented views as input, denoted as X a and X b .Subsequently, the cluster assignments Z a ∈ R N×K and Z b ∈ R N×K are obtained, where N is the batch size and K is the number of clusters.Each column of the cluster assignments can be regarded as the representation of one cluster.Let y a i and y b i be the i-th column of Z a and Z b for 1 ≤ i ≤ K, and we combine y a i with y b i to form the same cluster pair (y a i , y b i ) and leave the other K−1 pairs as (y a i , y b j ) for ∀j = i to be different cluster pairs.A cluster-level similarity matrix with size K is defined in the column space of the cluster assignments, where c clu ij is measured by the cosine distance as ( 6) Then, the cluster-level discriminative loss L BSC−IC clu is defined as (7).
where the diagonal elements as c clu ii are restricted to 1 to maximize the similarity between the same clusters, the non-diagonal elements as c clu ij for ∀i = j are restricted to 0 to minimize the similarity between different clusters, and λ clu is a positive constant to trade off two terms.

Training of the BSC-IC
Integrating the self-supervised learning network and the instance-level discriminative network, the final loss function of BSC-IC is defined as (8).
The parameters α clu , α si , and α mi are used to balance the significance of different loss terms.We use the adaptive moment estimation (Adam) to optimize the parameters of both the self-supervised learning network and the instance-level discriminative network.Notably, the self-supervised learning network is optimized specifically for minimizing L BSC−IC in respect of the online network only while keeping the target network unchanged.This is indicated by the stop-gradient operation in Figure 1.Consequently, Equation ( 9) is only used to update the parameters of the online network ξ.
where α is the learning rate.Drawing inspiration from BYOL, the target network's parameters θ are updated using a weighted moving average of the online parameters ξ.This update process can be performed using Equation (10).
where τ ∈ [0, 1] represent the target decay rate that controls the moving rate of parameters updating.Similar to the online network, Equation ( 11) is employed to update the parameters of the instance-level discriminative network η.And the overall algorithm of BSC-IC is presented in Algorithm 1. by eq (7); Calculate the self-supervised learning network loss L BSC−IC by (8); Update the parameter of online network ξ by ( 9); Update the parameter of target network θ by (10); Update the parameter of discriminative network η by (11); end for end for Output: The online network as clustering network.

Experiments
In this section, we perform experiments on six well-known real-world datasets to verify the efficiency of the presented model.All the datasets, methods of comparison, evaluation metrics, implementation details, and experimental results are elaborated.
Three metrics, i.e., the clustering accuracy (ACC), the normalized mutual information (NMI), and the adjusted rand index (ARI), are utilized to evaluate the clustering performance of different algorithms.For all metrics, a higher value is better.All clustering algorithms are conducted on a computer with two Nvidia TITAN RTX 24G GPUs.

Implementation Details
Similar image augmentations as DCCS [51] and CC [31] are conducted first to obtain the augmented samples.For low-detailed grayscale image datasets, cropping and horizon-tal flipping are employed as the augmentation strategies.For high-detailed color image datasets, color distortion and grayscale conversion are incorporated.Specifically, the color distortion method alters various attributes of the image, including contrast, saturation, brightness, and hue, while the grayscale conversion step transforms the color image into a grayscale format.
ResNet-18 is employed to extract the representation for the self-supervised learning network of BSC-IC.A Softmax layer is used to convert the representation into the cluster assignment of data with a dimension of cluster number K. A three-layer fully connected network is utilized as the instance-level discriminative network to divide the data samples into different clusters, and the dimensions of various layers are set to K-1024-512-1.
The Adam optimizer with a learning rate of 0.0003 is adopted to simultaneously optimize the self-supervised learning network and the discriminative network.The moving average parameter τ in the self-supervised learning network is set to 0.99, the discriminative network's gradient penalty coefficient δ is set to 10, and the default batch size N is set to 64.The BSC-IC model involves three control parameters, which are utilized to trade off the effects of different terms in the total loss function.The recommended values of various parameters on different datasets are listed in Table 2.
Table 3 lists the number of hyperparameters of different models.It can be seen that the proposed BSC-IC model has fewer hyperparameters compared with other models, which indicates a simpler model architecture and ease for parameter tuning in BSC-IC.

Experimental Results
The clustering results of the testing algorithms on six datasets in terms of ACC, NMI, and ARI are listed in Table 4, Table 5, and Table 6, separately, and reveal some interesting observations.The best results are shown in bold.First and foremost, compared with the traditional distance-based clustering methods, like K-means, AC, NMF, and SC, all the deep clustering methods show obvious advantages.This emphasizes that deep clustering has the ability to enhance clustering performance by the capturing semantic information of samples through deep neural networks.
Secondly, BSC-IC significantly outperforms most deep clustering methods on all six datasets.This demonstrates the efficiency of self-supervised representation learning only with positive pairs in our model, which helps to extract the similarities and dissimilarities between different views of samples and capture important clustering-orientated information.It is worth noting that GCC achieves the best performance on the CIFAR-10 and CIFAR-100 datasets.But this relies on a large number of negative pairs to capture the uniform representations, which requires a large batch size, like 256, and high computational complexity.In our model, a smaller batch size, like 64, and only positive pairs can also achieve good clustering performance.Figure 2 shows the ACC curves obtained by CC, GCC, and our model with different batch sizes on the CIFAR-10 and CIFAR-100 datasets.It can be seen that the ACCs of CC and GCC drop sharply with the decrease in batch size.Specifically, when the batch size changes from 256 to 64, the ACC of GCC drops by approximately 18 percentage points on the CIFAR-10 dataset and 9 percentage points on the CIFAR-100 dataset.Similarly, the ACC of CC drops by about 20 percentage points on the CIFAR-10 dataset and 6 percentage points on the CIFAR-100 dataset.In contrast, our model yields a more stable ACC without the influence of the value of the batch size.

Ablation Study and Analysis
The ablation study and analysis are carried out in this section to further understand the effect of each term in the loss function, including the self-improvement term (denoted as SI), the mutual-improvement term (denoted as MI), the instance-level discriminative term (denoted as IL), and the cluster-level discriminative term (denoted as CL).The ablation study of BSC-IC on the MNIST and ImageNet-10 datasets is presented in Table 7.The check mark and the cross mark respectively represent the inclusion and exclusion of each terms.In the discriminative network, the instance-level discriminative term focuses on optimizing the assignment of instances within clusters, while the cluster-level discriminative term aims to optimize the relationships between clusters.Together, they provide effective clustering guidance for self-supervised learning.From 1 and 2 in Table 7, it can be seen that the absence of any of them will lead to a suboptimal solution for cluster assignments.The most fatal is that the absence of both of them will lead to a collapse of the clustering as 3 in 7.
In the self-supervised learning network, the self-improvement term aims to ensure the stability of the network structure, while the mutual-improvement term provides the alignment between positive pairs for the capture of uniform representations.Together, they provide effective optimization over the online and target networks for the capture of discriminative representations.From 4 and 5 in Table 7, it can be seen that the absence of any of them will lead to a decrease in cluster accuracy.Moreover, the absence of both terms as 6 will disrupt the optimization over the online and target networks and prevent our method from performing clustering.

Conclusions
This paper develops a novel end-to-end self-supervised clustering model based on the BYOL network structure method to jointly seek high-quality representation and perform clustering.The basic self-supervised learning network is first modified, followed by the incorporation of a Softmax layer to capture the cluster assignments as data representation.The mutual-improvement loss and the self-improvement loss together provide effective optimization over online and target networks in BYOL for the capture of discriminative representations.Then, adversarial learning and self-correlation measuring are performed on the learned cluster assignments to promote clustering.The instance-level discriminative loss and the cluster-level discriminative loss together provide effective clustering guidance for self-supervised learning.Experimental results on real-world datasets show the efficiency of the proposed model.supervision, and helped with grammar correction.All authors have read and agreed to the published version of the manuscript.
Input data X, the batch size N, the number of clusters K, the maximum iterations MaxIter, the hyperparameters α mi , α si and α clu .for epoch ∈ {0, 1, ..., MaxIter} do for each batch do Calculate the mutual-improvement loss L BSC−IC mi by (2), the self-improvement loss L BSC−IC si by (3), and the instance-level discriminative losses L BSC−IC Adv-G by (4) and L BSC−IC Adv-D by (5), the cluster-level discriminative loss L BSC−IC clu

Figure 2 .
Figure 2. The impact of batch size on accuracy in CIFAR-10 and CIFAR-100 datasets.

Table 1 .
Brief description of datasets used in our experiments.

Table 2 .
The recommended values of the parameters on different datasets.

Table 3 .
The number of hyperparameters on different methods.

Table 4 .
Clustering results of tested algorithms in term of ACC on six datasets.

Table 5 .
Clustering results of tested algorithms in term of NMI on six datasets.

Table 6 .
Clustering results of tested algorithms in term of ARI on six datasets.

Table 7 .
The results of the ablation study.