Graph Clustering with High-Order Contrastive Learning

Graph clustering is a fundamental and challenging task in unsupervised learning. It has achieved great progress due to contrastive learning. However, we find that there are two problems that need to be addressed: (1) The augmentations in most graph contrastive clustering methods are manual, which can result in semantic drift. (2) Contrastive learning is usually implemented on the feature level, ignoring the structure level, which can lead to sub-optimal performance. In this work, we propose a method termed Graph Clustering with High-Order Contrastive Learning (GCHCL) to solve these problems. First, we construct two views by Laplacian smoothing raw features with different normalizations and design a structure alignment loss to force these two views to be mapped into the same space. Second, we build a contrastive similarity matrix with two structure-based similarity matrices and force it to align with an identity matrix. In this way, our designed contrastive learning encompasses a larger neighborhood, enabling our model to learn clustering-friendly embeddings without the need for an extra clustering module. In addition, our model can be trained on a large dataset. Extensive experiments on five datasets validate the effectiveness of our model. For example, compared to the second-best baselines on four small and medium datasets, our model achieved an average improvement of 3% in accuracy. For the largest dataset, our model achieved an accuracy score of 81.92%, whereas the compared baselines encountered out-of-memory issues.


Introduction
As a powerful tool, the Graph Neural Network (GNN) has been designed to deal with graph data such as social networks, knowledge graphs, citation networks, etc.The invention of the GNN has greatly facilitated graph-related tasks such as graph classification [1][2][3], neural machine translation [4,5], relation extraction [6,7], relational reasoning [8,9], and graph clustering [10][11][12].Unlike traditional clustering methods such as K-means, GNNbased graph clustering models use deep neural networks for representation learning before clustering.Adaptive graph convolution (AGC) [11] is a method that can adaptively choose its neighborhood over various graphs.A deep attentional embedded graph clustering model (DAEGC) [13] can learn to aggregate neighbors by calculating their importance.The adversarially regularized graph autoencoder (ARGV) [14] introduces adversarial regularization to learn and improve the robustness of representations.The work on attributed graph embedding (AGE) [15] proposed a Laplacian filtering mechanism that can effectively denoise features.The deep fusion clustering network (DFCN) [16] is a hybrid method that integrates embeddings from autoencoder (AE) [17] and graph autoencoder (GAE) [18] modules for representation learning.
Recently, there has been growing interest in contrastive learning.Applying contrastive learning to deep graph clustering has become more common than before.The principle of contrastive learning is to bring similar or positive sample pairs closer and push dissimilar or negative sample pairs further away from each other.Graph clustering is a fundamental but challenging task in graph analysis.The contrastive multi-view representation learning method (MVGRL) [19] has achieved its best performance by contrasting the embeddings of the nodes and sampled sub-graphs.Specifically, it constructs an extra diffusion graph for contrastive learning.The node embeddings from one view are contrasted with the subgraph embeddings from the other.The method determines which nodes and sub-graphs belong to the positive pair and which belong to the negative pair.The self-consistent contrastive attributed graph clustering method (SCAGC) [20] can maintain the consistency between the learned representation and cluster structure by performing contrastive learning between clusters and between nodes with the guidance of clustering results.Inspired by the deep graph infomax method (DGI) [21], the community detection-oriented deep graph infomax method (CommDGI) [22] introduced a community mutual information loss to capture the community structural information for nodes.
Although promising performance has been achieved, there still exist problems that need to be addressed.Firstly, in existing methods, manual augmentation such as feature masks and edge drops can result in semantic drift, which leads to sub-optimal performance.Secondly, most of the methods perform contrastive learning on feature-based (first-order) contrastive similarity, ignoring structure-based (second-order) contrastive similarity, which leads to sub-optimal performance.Figure 1 shows the difference between first-order contrastive learning and second-order contrastive learning.To solve the above-mentioned problems, we propose a contrastive graph clustering method termed Graph Clustering with High-Order Contrastive Learning.To address the first problem, we build two views by performing Laplacian smoothing with different normalizations on the same features.We build two similarity matrices with features.Each element in the similarity matrices denotes the similarity between nodes.We argue that the corresponding embeddings can be mapped into the same space using the alignment loss between the similarity matrices.To address the second problem, we build a contrastive similarity matrix using the similarity matrices.Inspired by [23], we perform contrastive learning by minimizing the loss between the contrastive similarity matrix and an identity matrix.In this way, our model can implement contrastive learning at the structure level.Meanwhile, the contrastive similarity matrix is built using the feature-based similarity matrix, and contrastive learning can also be assumed to be at the feature level to some degree.Furthermore, we can learn clustering-friendly representations naturally without the manual sampling that is applied in most contrastive methods and we need no extra clustering algorithms for training.Moreover, our method can be trained on large datasets.The key contributions of this paper are as follows:

•
Without any manual augmentations, we use two different Laplacian smoothing methods to build two views for contrastive learning and design an alignment loss to force the learned embeddings to map into the same space.

•
We design a novel structure-based contrastive loss without a sampling phase.By contrasting two similarity matrices, our model can learn clustering-friendly representations.It is worth noting that our model can also be applied to large-scale datasets.

•
Extensive experiments on five open datasets validate the effectiveness of our model.

Related Works
In this paper, we roughly divide deep graph clustering models into two kindsreconstructive and contrastive-and we introduce them in the following subsections.The definitions of the acronyms used here can be found in Appendix A.2.

Deep Reconstructive Graph Clustering
Reconstructing graphs or features is a basic learning paradigm in many deep clustering graphs.It can be divided into three categories: reconstruction only, adversarial regularization, and hybrid.The graph autoencoder (GAE) [18] is a basic model that is often introduced in graph clustering models as the framework.DAEGC [13] and MGAE [12] are models that are trained by reconstructing the given structure or raw features.ARGV and AGAE [10,14] can improve the robustness of the learned representations by introducing adversarial regularization.SDCN, AGCN, and DFCN [16,24,25] are typical hybrid models.SDCN can alleviate over-smoothness by integrating the representations from the AE and GCN.Based on SDCN, AGCN includes an adaptive fusion mechanism to improve the graph representations.DFCN includes a triple loss function to improve the robustness of the graph representations.All these models need an extra clustering module to learn clustering-friendly representations.Our model can naturally learn the clustering-friendly representations through high-order contrastive learning.

Deep Contrastive Graph Clustering
The effectiveness of contrastive learning has been widely validated.Applying contrastive learning to the deep graph clustering model has recently become a trend.The aim of Sublime [26] is to improve the anchor graph by constructing a learned auxiliary graph.By contrasting the node embeddings of the anchor graph and the learned graph, Sublime can reduce the impact of noisy connections or missing connections.Inspired by [23], DCRN [27] is used to perform feature decorrelating in two different ways, but it still needs a clustering module to learn clustering-friendly representations.GDCL [28] employs a debiased method to choose negative samples.Specifically, it defines the nodes and their augmented ones as the positive pairs and defines the node pairs with different pseudo-labels as the negative pairs.This way, it alleviates the impact of false-negative samples.SAIL [29] utilized self-distillation to maintain distribution consistency between low-layer node embeddings and high-layer node features and alleviate the problem of smoothness.The idea behind AFGRL [30] is that augmentation on graphs is difficult to design.Therefore, it employs an augmentation-free method by combining KNN, K-means, and the adjacency matrix to capture the local and global similarities of nodes, and the obtained guidance can help contrastive learning.AutoSSL [31] adaptively combines different pre-text tasks to improve graph representation learning.These contrastive models are characterized by manual augmentation, sampling positive and negative pairs, and firstorder contrastive learning.Manual augmentation can result in semantic drift, the sampling strategy needs an extra clustering-oriented module to define the positive and negative pairs, and first-order contrastive learning can only learn clustering-friendly representations from the feature perspective, ignoring the structure perspective.Our model can effectively alleviate these issues.

Proposed Method
In this section, we propose an algorithm for the Graph Clustering with High-Order Contrastive Learning model.The entire framework of our model is shown in Figure 2. Below, we describe the proposed GCHCL model.

Problem Definition
In this paper, V = {v 1 , v 2 , . . ., v n } is a set of N nodes, and E denotes an edge set.Given an undirected graph G = (X, A), X ∈ R n×d denotes the attribute matrix, and A = (a ij ) n×n denotes the given adjacency matrix.In the adjacency matrix A, a ij ∈ 0, 1. a ij = 1 indicates an explicit connection between v i and v j .Otherwise, there exists no direct connection between them.We let D = diag(d 1 , d 2 , . . ., d N ) ∈ R n×n be the degree matrix.d i indicates the ith row in D and d i = ∑ n j=1 a ij .The Laplacian matrix of the graph is built as L = D − A. Details about the notations used are shown in Table 1.

Notation Meaning
First-order similarity matrix S ∈ R b×b Second-order similarity matrix S ∈ R b×b Contrastive similarity matrix I ∈ R b×b Identity matrix

Double Laplacian Smoothing
In several works, Laplacian smoothing has been proven to be effective in alleviating the impact of high-frequency noise [15,32].In [15], the GCN was decoupled into a graph filter and a linear transformation, and it was demonstrated that the decoupled GCN could achieve the same or even better performance in representation learning compared to the GCN.Generally, the features are convolved by the Laplacian matrix to avoid gradient explosion during training; therefore, the Laplacian matrix needs to be normalized.There are two types of normalization: random walk normalization and symmetric normalization.During the aggregation step, the random walk-normalized Laplacian matrix treats the neighbors equally.However, the symmetric-normalized Laplacian matrix considers both the degree of the target node and its neighbors' degrees.The larger the degree of the neighbor, the smaller its contribution to the aggregation.The random walk-normalized Laplacian matrix is constructed as follows: The symmetric normalized matrix is constructed as follows: where A = A + I, D is the degree matrix of A. With these two types of normalized Laplacian matrices, we construct two different views for the same feature matrix, as follows: where t is the power of the filter operation.

Structure Alignment
After randomly sampling batches of nodes, we construct two different views for each batch of nodes without augmentation and force them to be mapped into the same space.For simplicity, we use a simple linear transformation as the encoder.First, we sample nodes with an assigned batch size: where Sample is a random sample operation, and b is the assigned batch size.X b rw , X b sym ∈ R b× f .After sampling, nodes are input to the encoder in batches, as follows: To force the two views of sampled attributes to be mapped into the same embedding space, we design a structure-aligning loss.Specifically, we build two similarity matrices using the output of the encoder.By minimizing the alignment loss between the two similarity matrices, we can map the embeddings to the same space and maintain the consistency of their distribution.The processing is as follows: S sym =< Z sym , Z sym > (15) where <> denotes the operation of the inner product, and Sim denotes a similar metric function such as a cosine function.

High-Order Structure Contrastive Learning
Instead of performing contrastive learning on the first-order contrastive similarity, we perform contrastive learning on the second-order contrastive similarity.Compared to contrastive learning on the first-order similarity, contrastive learning on the second-order similarity can provide a wider view.In a structure-based contrastive similarity matrix, S ij denotes the structural similarity of node i and node j.Moreover, structure-based contrastive learning is based on the similarity matrix; therefore, it also implies a similarity of features.Inspired by [23], we implement contrastive learning as follows: where S is the structure-based contrastive similarity matrix.

Joint Optimization
On the one hand, the alignment of the structure similarity matrices can force the embeddings to map into the same space.On the other hand, contrastive learning on the similarity matrices can naturally benefit the clustering task.By jointly optimizing these two objective functions, we train our model as follows: The details of the training process are shown in Algorithm 1.

Complexity Analysis
In this paper, we denote d as the dimension of the encoder, b as the sampled size of the nodes, and f as the dimension of the raw features.The computational complexity of our model is Randomly sample b nodes from each view using (7) and (8) 6: Generate the embeddings Z rw and Z sym using ( 9)- (12) 7: Build the similarity matrix using ( 14) and (15) 8: Build the contrastive similarity matrix using (17) 9: Calculate the alignment loss of the similarity matrices using (16) 10: Calculate the contrastive loss of the contrastive similarity matrix and an identity matrix using (18) 11: Update the whole framework by minimizing (19) 12: end for 13: end for 14: Obtain the fusion embeddings Z f using ( 13 We conducted extensive experiments on five widely used benchmark datasets: Cora, Dblp, Amap, Corafull, and Reddit.More details can be found in Table 2.

•
Cora [18] is a citation dataset.Each node denotes a machine learning paper, and each edge denotes the citation relationship between two papers.The papers within it are divided into seven classes: case-based, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning, and theory.Each node's feature is represented by a 0, 1 vector.Each dimension is a keyword from a specific vocabulary.

•
Dblp [24] is a cooperative network.The authors are categorized into four classes: database, data mining, machine learning, and information retrieval.Each edge represents a collaborative relationship between authors.The node features consist of elements from a bag-of-words method represented by keywords.• Amap [33] is a co-purchase graph dataset.Each node denotes a type of good, and each edge denotes the corresponding goods that are often purchased together.These nodes are divided into eight classes according to the category of the goods.

•
Corafull [33] is similar to Cora but is larger, and the papers within it are divided into 70 classes.

•
Reddit [1] is constructed from Reddit posts from September 2014.Each node denotes a post, and each edge denotes two posts commented on by the same user.The posts are divided into 41 classes.The node features are the average of 300-dimensional GloVe word vectors associated with the content of the posts, including the title, comments, score, and number of comments.

Experimental Setup
All experiments were run on a computer with a GeForce RTX 1080Ti GPU, 64 G RAM, and Pytorch 1.8.1.We set the maximum number of iterations for training to 100 for all datasets.We optimized our model using the Adam optimizer.When the training process stopped, we ran the K-means clustering algorithms on the learned embeddings.To reduce the impact of randomness, we repeated each experiment 10 times and report the average results.

Parameter Setting
In our model, we used a single-layer MLP as the encoder.The dimension of the output was 100 for Reddit and 500 for the other datasets.For simplicity, we used no activation function except for a linear transformation.In our model, instead of inputting the whole feature matrix for training, we performed the training in batches with an assigned batch size.Specifically, we denoted b as the batch size.For Amap and Reddit, we set b = 256; for Cora and Corafull, we set b = 512; and the batch size for Dblp was 1024.Regarding the compared baselines, we utilized the settings specified in their respective papers.The details of the hyperparameters are shown in Table 3.

Ablation Study
We performed an ablation study from two perspectives: (1) To validate the effectiveness of high-order contrastive learning, we implemented two experiments, one on first-order contrastive learning and one on second-order contrastive learning.(2) To assess the effectiveness of each component in our model, we conducted experiments by individually removing the structure alignment and contrastive learning.
In Table 9, we can observe that the contrastive learning on the first-order similarity matrix consistently underperformed compared to the second-order similarity matrix.This is because first-order contrastive learning is based on feature similarity, which may lead to representation bias.However, second-order contrastive learning is based on neighborhood similarity, which can alleviate this bias.In addition, compared to first-order contrastive learning, second-order contrastive learning can learn clustering-oriented representations more effectively.In Table 10, we can observe that each component in our model contributed to the performance.Specifically, when we removed the contrastive part, the performance decreased significantly on all datasets.This is because without CL, the representation bias impacted the performance across all datasets.When SA was omitted, the impact on the performance for the Cora, Dblp, Amap, and Corafull datasets was minimal, but for the Reddit dataset, it was significant.This was because CL carried a risk of reducing useful relationships, which could harm performance, but SA could preserve these relationships, alleviating this issue.The model conducted graph convolution five times on Reddit, and no more than three times on the other datasets.By aggregating more neighbors, the number of similar nodes to the target one increased in the embedding space.When the model performed contrastive learning on the similarity matrices of the Reddit dataset, it reduced more useful relationships compared to the other datasets.Therefore, the performance decreased more on the Reddit dataset compared to the others.

Hyperparameter Analysis
In this paper, we introduced two hyperparameters b and t. b denotes the batch size of the input features, and t is used to control the power of the Laplacian smoothing before training.
In Figure 3, we show how the performance varied with changes in the batch size within the range of {256, 512, 1024, 2048}.From this figure, we can see that the performance fluctuation on the Amap, Cora, and Corafull datasets was not sensitive to changes in the batch size.However, a larger batch size enhanced clustering performance on Dblp; when the batch size was 1024, the clustering achieved the best results, whereas on Reddit, a smaller batch size was more beneficial for representation learning.This is because Dblp aggregated the first-order neighborhood for its representation, whereas Reddit aggregated the fifth-order neighborhood.A larger batch size facilitated the reduction of redundant relationships in Dblp but increased the risk of reducing useful relationships in Reddit.In Figure 4, we illustrate how the performance varied with changes in the Laplacian smoothing power.From this figure, we can see that the ACC stabilized when the power reached 2, except for the Reddit dataset.On Reddit, the model achieved its best performance when t was equal to 5, and it maintained stability within the range of [3,6].In summary, our model demonstrated low sensitivity to these two hyperparameters, even when they varied within considerable ranges.

Visualization Analysis
To demonstrate the effectiveness of our model in the clustering task, we illustrate a series of similarity matrices in Figure 5, showing the quality of the learned representations in each cluster.In Figure 5, we can observe that our model outperformed the other methods with respect to both the of clusters and the clarity of the clustering structure.The color scale ranges from 0 to 1, where brighter colors indicate higher similarity between corresponding nodes.A diagonal block denotes a cluster.The quality of the representation can be assessed from 2 perspectives: (1) whether the number of diagonal blocks equals the number of real clusters, and (2) whether the diagonal blocks can be easily recognized.Considering these criteria, our model can learn representations of the highest quality.

Conclusions
In this paper, we propose GCHCL, a high-order contrastive learning method for graph clustering without manual augmentation.We contrast two high-order structures, constructed using two different Laplacian smoothing methods, to reveal the nodes' similarity at the structural level, and we align the high-order structures to force the corresponding embeddings to map into the same space.After building a contrastive structure using the high-order structures, we perform contrastive learning by aligning the contrastive structure with an identity matrix.In this way, our model can naturally learn the clusteringfriendly representations.Extensive experiments on datasets of various scales validate the effectiveness of the proposed model.

Figure 1 .
Figure 1.First-order contrastive learning and second-order contrastive learning.Z 1 and Z 2 denote the features, and S 1 and S 2 are the similarity matrices built by Z 1 and Z 2 .

Figure 2 .
Figure 2. The overall framework of the GCHCL model.

Algorithm 1 4 :
Specifically, the complexity of the encoder is O(b f d), the complexity of constructing a similarity matrix is O(b 2 d), and the complexity of constructing the contrastive similarity matrix is O(b 3 ).Thus, the entire computational complexity of the proposed model is O(b f d + b 2 d + b 3 ).The complexity of our model is dominated by the scale of the batch size.Graph Clustering with High-Order Contrastive Learning Input: Attribute matrix X, adjacency matrix A, training iteration T, identity matrix I, number of clusters K, number of nodes n, hyperparameters t, b 1: Build two kinds of normalized Laplacian matrices using (1) and (2) 2: Build two views of the filtered attributes using (3)-(6) 3: for i = 1 to T do for j = 1 to (n mod b) do 5:

) 15 :
Perform K-means clustering on Z f Output: The clustering result O

Figure 3 .
Figure 3.The sensitivity of our model to the batch size.

Figure 4 .
Figure 4.The sensitivity of our model to the power of smoothing.
through a cross-module approach, enriching the information for learning.The reason our model outperformed SDCN and DFCN was that they heavily relied on the provided graph, which could not fully reveal the complete connections between nodes and may have misled representation learning.The utilization of a similarity matrix in our model can greatly alleviate this.•The baselines from SCAGC to Sublime are graph clustering models based on contrastive learning.All of them implemented contrastive learning at the feature level, which could not effectively capture the neighborhood of each node, an important aspect for clustering tasks.Our model directly performed contrastive learning at the structural level.This allows the contrastive learning in our model to facilitate the clustering task more effectively.• On the Reddit dataset, most of the baselines struggled with the training cost, leading to OOM (out-of-memory) issues.There are two reasons for this: (1) they usually input the whole dataset into the model during training, and (2) the entire adjacency matrix consistently participated during training.In our model, we input batches of features into the model instead of the whole feature matrix, which greatly reduced the computations.
of the Cora dataset demonstrated the highest quality for clustering.The baselines from GAE to DFCN were classical deep graph clustering models and were mostly trained by reconstructing the raw features or the given graphs.GAE, VGAE, MGAE, ARGE, ARVGE, AGCN, and DAEGC were sub-optimal compared to our model because they only used a single view for embeddings, which had a limitation in providing diverse features for representation learning.SDCN and DFCN learned the representations

Table 9 .
Performance comparison of first-order contrastive learning and second-order contrastive learning.

Table 10 .
The effectiveness of each component in our model.SA denotes second-order structure alignment, and CL denotes second-order contrastive learning.

Table A1 .
Definitions of acronyms.