Auxiliary Graph for Attribute Graph Clustering

Attribute graph clustering algorithms that include topological structural information into node characteristics for building robust representations have proven to have promising efficacy in a variety of applications. However, the presented topological structure emphasizes local links between linked nodes but fails to convey relationships between nodes that are not directly linked, limiting the potential for future clustering performance improvement. To solve this issue, we offer the Auxiliary Graph for Attribute Graph Clustering technique (AGAGC). Specifically, we construct an additional graph as a supervisor based on the node attribute. The additional graph can serve as an auxiliary supervisor that aids the present one. To generate a trustworthy auxiliary graph, we offer a noise-filtering approach. Under the supervision of both the pre-defined graph and an auxiliary graph, a more effective clustering model is trained. Additionally, the embeddings of multiple layers are merged to improve the discriminative power of representations. We offer a clustering module for a self-supervisor to make the learned representation more clustering-aware. Finally, our model is trained using a triplet loss. Experiments are done on four available benchmark datasets, and the findings demonstrate that the proposed model outperforms or is comparable to state-of-the-art graph clustering models.


Introduction
The attribute graph data is ubiquitous in the real-world. For example, data from social networks [1], citation networks [2], protein-protein interaction networks [3]. For the lack of labeled data, there exists a need to divide data into groups.
In the early days, graph clustering methods used only structure information for network embedding. Utilizing structure information, some methods [4,5] based on random walk implement representation learning by maximizing the probability of cooccurrence of node pairs. Recently, refs. [6][7][8] suggest mining meaningful features from networks with BDM (Block decomposition method). For example, by BDM, ref. [6] obtain graph motif complexity for network clustering. Removing a minimum subset of edges, ref. [7,8] can obtain the desired clusters with minimum loss of information contribution, which is calculated by algorithmic complexity obtained from BDM. Along with the development of deep models, plenty of deep clustering models have emerged [9][10][11][12][13][14]. However, the conventional deep clustering models focus on investigating Euclidean structure data. For example, data of faces, data of animals, data of vehicles. Unlike Euclidean structure data, the relationships between nodes in the graph have nothing to do with their positions in space. For this reason, the traditional deep models cannot handle both the attribute and structure of graph data properly. Recently, the question of how to exploit both graph structure and node attribute sufficiently has attracted more and more attention in clustering tasks. Graph Convolutional Network (GCN) [2] is a powerful model to meet the need mentioned above. A great number of graph clustering models based on GCN have been developed. Inspired by AutoEncoder, Graph auto-encoder (GAE) [15] implements representation learning in an encoder-decoder mechanism. Following GAE, ARGE [16] improves representation learning by introducing an adversarial training module. MGAE [17] proposes to exploit the interplay between node attribute and structure information. GAT [18] introduces an attention mechanism to specify different weights to different neighbors. Following GAT, ref. [19] aggregates its neighbors by learning an attention mechanism in an unsupervised way. SDCN [20] is a deep model that can alleviate the impact of over-smoothness by fusing embeddings from different modalities. Based on SDCN, DFCN [21] improves performance by integrating global structure information into local structure information.
To some degree, these GCN based models exploit structure information in different ways and achieved noticeable improvements. However, we found that there are three kinds of cases that lead to a sub-optimal performance: (1) Methods that ignore global structure completely. (2) Methods that have taken global structure into consideration but trained with only the guidance of given graph structure. (3) Methods that ignore the guidance of the structure. All these mentioned methods fail to exploit the global structure appropriately and lead to a sub-optimal performance consequently.
To solve this issue, unlike those shallow models mentioned before [4][5][6][7][8], we propose a deep graph clustering model termed Auxiliary Graph for Attribute Graph Clustering. In particular, we construct an additional graph as a supervisor based on the similarity between nodes in their raw feature space. However, the newly constructed graph is rife with erroneous relationships due to the underlying noise in the raw data. To mitigate the impact, we employ a filtering technique to choose a certain number of nodes closest to the target nodes. We retain the relationships between each target node and a predefined number of neighbors who can be considered somewhat dependable. Assuming that the remaining relationships are untrustworthy, they are disregarded. We combine embeddings from various layers to generate representations that are highly discriminative. Finally, we have created a training technique that incorporates both reconstruction loss and clustering loss. On the one hand, we optimize our model by forcing it to reconstruct a graph that can approximate both the pre-defined graph and the auxiliary graph. On the other hand, we employ a clustering-oriented optimization whose efficacy has been thoroughly proved. In the former scenario, these two types of rebuilding are complementary. In the latter case, the clustering-friendly model enables learned representations to facilitate the clustering operation.
Our contributions are summarized as follows: • We build an auxiliary graph to reveal the relationships that were missed by the given graph. With the supervising of both auxiliary graph and given graph, the learned representations are improved to be more reliable. • The optimization by clustering loss based on fusing embeddings from multiple layers facilitates both the discriminativeness and the clustering-awareness of representations. • Extensive experiments on four popular benchmark datasets are conducted and the results validate the superiority of our method over the state-of-the-art methods.

Related Works
Deep clustering has always attracted extensive attention. During the past few years, plenty of deep clustering models have emerged [9,11,13,[22][23][24][25][26][27][28][29][30]. Among them, AutoEncoder is a basic DNN model that is widely used for subsequent deep clustering models. In DEC [11], a target distribution is designed to prevent large clusters from distorting hidden feature space, which alleviates the impact of data imbalance. Inspired by [29], IDEC [9] improved DEC by introducing the optimization of reconstruction. Training by reconstructing the input data can keep the local structure-property for embeddings. DSC [22] also introduces an auto-encoder framework to the subspace clustering module. The auto-encoder module can learn a non-linear mapping that facilitates subspace clustering. DMC [30] keeps the local structure by minimizing the distance between the target point and its K-nearest neighbors. At the same time, it also constructs a clustering-friendly objective that improves representations. By forcing the embeddings from the noisy encoder to approximate that from a clean encoder, DEPICT [24] improves the robustness of representations. Although effective, deep clustering models neglect the information from graph structure, which contains a wealth of information that can improve representation learning greatly.
Recently, GCN-based deep clustering models have gained much attention. And an abundance of excellent models have been proposed [15,16,[19][20][21][31][32][33][34]. Ref. [15] designed an encoder-decoder framework that based on graph convolution network (GAE) and its variation (VGAE) that was based on VAE [31]. As an unsupervised graph-based representation learning method, it is popular for the tasks of clustering. AGC argues that each graph has its distinct structure, and it is unreasonable to perform clustering tasks on different graphs by aggregating neighbors with a fixed neighborhood. Instead of keeping a fixed neighborhood for each graph, AGC proposed measurement for choosing a proper scale of the neighborhood. In DAEGC [19], neighbors are not equally important to the target nodes. It can capture the importance of neighbors for the target node by an attention network. Some develop different training schemes to improve clustering performance. MGAE [17] corrupts node features by a pre-defined probability to disturb the information so that the interaction between node content and structures can be reinforced and the representation capacity of the network can be improved. ARGE [16] incorporates an adversarial training scheme into GAE, which can learn a robust representation. Instead of reconstructing the graph only, ref. [35] improved the performance of ARGE by reconstructing both the graph and features. EGAE-JOCAS [32] utilizes K-means and spectral clustering jointly to guide the representation learning and improve performance. Some models [20,21,34] combine deep features of multi-modality to alleviate over-smoothness. In SDCN [20], a GCN module and an auto-encoder module are integrated. Incorporated with representations from the auto-encoder module, GCN is capable of capturing the relationship between nodes of longer distances. DFCN [21] improved SDCN by dynamically integrating features of multi-modality and optimizing with triplet guidance which could generate robust representations. AGCN [34] argues that when fusing features, features from different modalities should not be considered to be of equal importance. It proposed to adaptively fuse features of different modalities at each layer, and again adaptively fuse features of different layers.
Most of the methods mentioned above achieved promising performance in clustering, but few consider that there are plenty of relations that are missed by the given graph structure.

Proposed Method
The proposed model consists of the graph encoder, graph decoder, and clustering module, which will be introduced in turn as follows. Figure 1 is the flow chart of our proposed method.  Figure 1. This is the framework of AGAGC. From top to bottom, our model consists of three components: auxiliary graph creation, graph auto-encoder, and clustering procedure. The top section represents the construction of the auxiliary graph A s and consists of two steps: build and process. The construction of an auxiliary graph is a prerequisite to training. As the backbone of the middle section, we employ a graph auto-encoder (GAE). As an encoder, we introduce a GCN module. The encoder accepts as input the feature matrix X and the provided graph A. After encoding, we concatenate the embeddings of each GCN layer to get the output, denoted by H f . As per GAE, we employ an inner product as our model's decoder. The graph decoder generates a symmetric matrix M by implementing the inner-product on H f and then applying a sigmoid function. During training, M is required to approximate both the pre-defined graph A and the auxiliary graph A s (minimize L ra and L rs , respectively). The bottom section is a module for clustering. H f serves as its input. The goal of introducing this module is to increase representations' awareness of clustering. The module for clustering generates Q using a Student's t-distribution. The clustering module creates a target distribution P by Q for the purpose of producing cluster-structured representations. By minimizing L c (KL divergence) between Q and P, the model can improve the cluster-friendliness of the representations.

Problem Definition
Given an undirected graph G = (V, E), V = {v 1 , v 2 ..v n } is a set of nodes, and |V| = n. E is the edge set. X T = [x 1 , x 2 , ..x n ] ∈ R dxn denotes a feature matrix of nodes. A ∈ R nxn denotes a symmetric adjacent matrix that indicates the connection of nodes, i.e., if node i links node j, More notations are summarized in Table 1.

Notations
Meaning Target distribution

Graph Encoder
GCN is used as a powerful tool for extracting features by integrating topological information into node attributes. In our model, we use the GCN as a basic module for encoding.
In GCN, nodes' features are filtered in the frequency domain. As a result, the filtered features are supposed to be robust for being enhanced by their neighbors. After filtering, the features are transformed linearly by a weight matrix with an activation function. This process is formulated as the following equation: H 0 denotes the input of the encoder, H 0 = X. l ∈ {0, 1, 2, . . . , L} denotes the index of the layer, and L denotes the index of the last layer in the encoder. W l is the parameter of l th −layer. φ denotes an activation function such as Tanh or LeakRelu.
In GCN, different layers are supposed to generate features of different scales.
It is supposed that the embeddings from fusing features from multiple layers should be more discriminative than those embeddings from the single layer. We apply a fusion strategy to the encoder, i.e., we simply concatenate each layer of the encoder for generating robust representations, and this operation can be formulated as the following equation: In (2), Concat denotes a concatenate function.

Graph Decoder
A graph decoder is usually used to reconstruct the original graph. Following Graph Auto-Encoder [15], we use an inner-dot operation as a graph decoder. The output of the decoder is a symmetric matrix that is constructed by the output of the encoder.
σ is an activative function that scales the values to the range of (0, 1). M is seen as the recovery of the original graph.

Optimization by Reconstructing Graphs
For the purpose of revealing the relationship between nodes thoroughly, we need to find the latent relationship between nodes that are missed by the original graph. To achieve this, we use M to reconstruct the original graph and the complementary graph simultaneously.

Optimization by Reconstructing Original Graph
After obtaining M from the graph decoder, we optimize the model by minimizing the reconstruction loss between M and A: This process is widely used in a graph auto-encoder model. Here we minimize the loss between M and the original graph to keep the performance in a basic level.

Optimization by Reconstructing Complementary Graph
There are three parts to this optimization process. We describe them in the following sections: graph build, graph process, and minimization of reconstruction loss.

•
Graph Build To make a complement to the given graph, we build a graph based on some similarity metric such as cosine similarity, which can discover the latent relationships between nodes in a global view. The complementary graph is constructed by the following equations: After calculating the similarity between each pair of nodes, we obtain a graph capturing the global relationships. • Graph Process After graph building, we obtain an initial graph S that unavoidably contains noise. To obtain a relatively clean graph, we need to filter noise. We introduce a simple but effective filtering mechanism.
At first, we rank each row of S in descending order by a sort function. After ranking, S rank i = { S ir 1 , S ir 2 , S ir 3 , . . . , S ir n }, S ir k ≥ S ir k+1 . And then, by using a filter mechanism, we only keep relations of top-K highest confidence, and we reduce the rest to 0 to decrease the impact of false relations. • Minimization of reconstruction loss After the process of filtering, we obtain a more reliable graph A s . And we implement representation learning by minimizing the loss between M and A s , which is formulated as: The A and A s are supervisors that are complementary to each other.

The Joint Reconstruction Loss
A single supervisor may lead to bias in representation learning. Instead of using one single supervisor, we minimize the reconstruction loss by both supervisors A and A s . The objective function is formulated as follows: λ is a hyper-parameter used to control the importance of L rs .

Clustering Module
For unsupervised learning approaches, there are no given labels for target functions. We need an optimization that can be used to guide our model to facilitate clustering tasks. As most graph clustering models do, we introduce an alternative strategy to conduct a clustering-oriented optimization. We use Student's t-distribution as the kernel to measure the similarity between centroids and embeddings: µ k denotes the centroid of cluster k. It is initialized by k-means or random vectors. q ij denotes the probability that node i belongs to cluster j. To improve the accuracy of centroids, we generate a target distribution. By matching the Student's t-distribution of Q to the target distribution of P, the clusters' centroids and embeddings are simultaneously optimized. The target distribution is constructed by the following equation: In (11), f k = ∑ i q ik . And the optimizing process is to minimize the KL divergence loss between q ij and p ij :

Joint Optimization
To train the graph encoder-decoder and clustering module jointly, we design the objective function as: L c denotes the clustering loss, and L rec denotes the reconstruction loss. After training, we can obtain the clustering results Y from Q, and the prediction of node i is assigned by: Specifically, Y = [y 1 , y 2 , ..., y n ], y i is the position of the max value in q i , which is a pseudo label of cluster as well. The detailed steps are summarized in Algorithm 1.

Complexity Analysis
For the sparsity of the matrix, the computational complexity of GCN is linear with |E|. Let d be the maximum number of neurons in hidden layers, the complexity is O(|E|d 2 ). In addition, we let k be the number of clusters, and the computational complexity of (10) is O(nk + nlogn). Taking both GCN and clustering module into account, the final complexity is O(|E|d 2 + nk + nlogn).

Datasets
We implement experiments on four widely used graph datasets. More details about them are summarized in Table 2. • Citeseer This is a citation dataset. Papers in it are divided into six categories: Agents, Artificial Intelligence, Database, Information Retrieve, Machine Language, HCI. Each edge represents a citation relationship between documents. Each node denotes a paper whose feature is represented by a {0, 1} vector. Each dimension is a keyword from a specific vocabulary. • Dblp It is a cooperative network. Authors in it are divided into four classes: database, data mining, machine learning, and information retrieval. An edge represents a cooperative relationship between authors. The node features are the elements of a bag-of-words represented by keywords. • Acm It is a paper network. An edge between nodes represents that these two papers are written by the same author. Papers are divided into three classes: Database, Wireless Communication, and Data Mining. The features are bag-of-words of keywords from corresponding areas. • Pubmed It is a citation dataset about Diabetes. The publications in it are divided into 3 classes: Diabetes Experimental, Diabetes type1, and Diabetes type2. Each node is represented by a tf-idf vector of keywords.

Baselines
We compare our proposed method with 12 methods which can be divided into 4 types: Non-model based, Auto-Encoder based, Graph Auto-Encoder based, and Hybridmodule based. •

Parameter Settings
As most GCN based models do, we use a 2-layer network for our model. Dimensions of each layer are d-256-16. Specifically, d is the dimension of input. The training process is divided into two steps. In the first step, we pre-train the network without the clustering module to minimize the reconstruction loss of similarity and graph structure. In the second step, together with cluster loss, we train the whole network. After analyzing the effect of hyperparameters, we set λ = 0.1 for Citeseer and λ = 10 for the other. Also, we set K = 100 for top-K similarity. For Citeseer, Dblp, and Acm, we set the learning rate to 0.001, for Pubmed, we set it to 0.005. For Dblp and Pubmed, we train the network for 500 epochs, and 100 epochs for Acm, 400 epochs for Citeseer. For fairness, we set the dimension of the network of GAE&VGAE the same as ours. In addition, for dealing with Pubmed, we use a sampling strategy for training. The sampling rate is set to 0.25 in our experiment. In each epoch, we sample a subgraph that contains 25% nodes of the dataset for training. For AE, GAE&VGAE, we use K-means to obtain the clustering results. For clustering methods, we follow the settings of their corresponding papers. We repeat the experiment 10 times to obtain the average result, which shows in Table 3. All experiments are implemented with PyTorch and run on a GPU (GeForce GTX 1080Ti).

Metrics
We use four popular metrics to evaluate the clustering performance: ACC (Accuracy), NMI (Normalized Mutual Information), ARI (Average Rank index), and F1 (macro F1-score). ACC is obtained by counting the matching pairs of predictions and labels and calculating the ratio of correctly matched pairs in the total matchings. NMI is used to measure the mutual information between prediction and true labels. ARI is used to measure the decision of clustering. F1 is an overall measurement for precision and recall. Higher values denote better performance.

Analysis of Result
In our experiments, our method was compared with 12 other methods on four benchmark datasets. Tables 3-6 show the results. Bold numbers represent the best performance, the underline denotes the second best. From these tables, we have these observations:

•
We can observe from these tables that the proposed method outperforms all the compared baseline methods on four benchmark datasets on most metrics. For example, in Dblp, our model outperforms the second-best one by nearly 4 pp (pp: percentage point), 7 pp, 8 pp, 5 pp on ACC, NMI, ARI, and F1 respectively. In Pubmed, compared to the second strongest, our model outperforms it by nearly 2 pp, 3 pp, 3 pp, 2 pp on ACC, NMI, ARI, F1 respectively. There are three reasons for the effectiveness of our model: First, we fuse embeddings from multiple layers to generate discriminative representations; Second, we construct a filtered graph from the original feature space to preserve the global relations of nodes; Last, we develop a joint training strategy to learn representations that can facilitate clustering and preserve both local relations and intrinsic global relations of nodes. • AE, DEC, and IDEC only use node features for generating embeddings, which leads to a sub-optimal clustering performance compared with GCN-based models. K-means clustering is directly performed in the original feature space, it can be used to measure the quality of features. From k-means, we can observe that the quality of data in Acm is the best. • In GAE, VGAE, ARGE, and ARVGE, they generate embeddings from a single layer. Compared with them, besides reconstructing intrinsic relationships, our model can fuse multi-scale features to strengthen the discriminativeness for embeddings. • DAEGC exploited the attention mechanism for aggregating. Although considering the relations between nodes in a wider range, it implements representation learning by the supervision from the given graph structure, which cannot exploit the hidden relations that are missed by the given graph. Compared with it, our model has two advantages: First, we explore relations from a global view. Second, the explored relations come from original space, which can be considered to be more intrinsic.
• SDCN, AGCN, and DFCN are powerful deep clustering models that exploit multimodality to generate discriminative embeddings. Regardless of alleviating the problem of over-smoothness, these models fail to explore the latent relations of nodes that cannot be observed from the given graph. However, by measuring the similarity between nodes, our model successfully revealed the missing relations from the original feature space and outperforms the mentioned models.

Ablation Study
To make it clear how each part contributes to the proposed model, we implement experiments by removing them. Also, we conduct experiments to validate the strategy of fusing embeddings of each layer to improve the representations. The results of these experiments are shown in Table 7 and Table 8, respectively.  Table 7 illustrates how each component of the model influences its performance. No single component of the model can outperform the other two across all datasets. In Citeseer and Pubmed, deleting the fusion portion has the most significant effect on performance. We conclude that the feature from various scales strengthens the representations in these datasets. However, in Dblp, similarity supervision has the greatest impact, indicating that the effectiveness of mining latent edges is promising. The combination of similarity and fusion dominates the performance of Citeseer, whereas the combination of similarity and adjacent dominates the performance of Dblp. Compared to other datasets, however, it appears that only the incorporation of three-part data can result in significant improvements for Acm and Pubmed.

The Effectiveness of Each Layer
To demonstrate the efficacy of the fusion technique, we implement the clustering task on each layer individually. Table 8 provides the results. H 1 and H 2 represent embeddings from layer-1 and layer-2, respectively, whereas H f represents the combination of H 1 , H 2 . We can observe that, across all datasets, the power of single-layer representation is consistently weaker than that of multiple-layer representation. In addition, we discovered that for varied datasets, individuals have varying preferences for the neighborhood scale. For Citeseer and Pubmed, embeddings of layer-1 are preferred, whereas embeddings of layer-2 improve clustering performance for Dblp and Acm. However, optimal performance can be achieved by combining embeddings from both layers, validating the efficacy of our fusion technique.

Analysis of Hyperparameters
In our experiments, we introduce 2 hyperparameters. K is the number of top-K nearest neighbors for target nodes. But it is used for choosing the top-K values of each row in the similarity matrix. λ is a hyper-parameter that is used for adjusting the importance of reconstruction of original relations of nodes.

Analysis of λ
We empirically choose the range of λ as {100, 10, 1, 0.1, 0.01}. In Acm, the fluctuation of the performance is slow and tiny, but it is clear to see that the best performance is achieved when λ = 10. The best values for λ in Pubmed and Dblp is 10 too, as we can observe easily in Figure 2. However, the best performance is achieved in Citeseer when λ = 0.1. These observations can validate that: (1) The auxiliary graph is helpful for clustering tasks.
(2) Compared to Citeseer, the auxiliary graph plays more import roles in Pubmed, Dblp, and Acm. From the degree of improvement, Dblp is benefited most. It achieves an improvement of nearly 20% in Acc from 0.01 to 10. Although achieving improvement, the degree is not as much as Dblp's. The reason may be that compared to the given graph of Dblp, graphs of the other datasets can cover relationships more completely. Also we can observe that the performance tend to decrease to different degrees for all datasets when λ varies from 10 to 100. There are two reasons for this: (1) Although filtered, the auxiliary graph still contains noise, and putting too much weight on it will increase the impact of noise. (2) There exists linked pairs in the given graph, they belong to the same cluster, but they are not linked in the auxiliary graph. Putting too much emphasis on the auxiliary graph may ignore this kind of relationship, which leads to a sub-optimal performance. According to the reasons above, we cannot put too much weight on the auxiliary graph during the training.

Analysis of K
The range of K is {1, 2,3,4,5,6,7,8,9,10,20,30,50, 100, N}. N denotes the number of nodes in dataset. First of all, from the Figure 3 it is not hard to observe that for all datasets the best performances are achieved when K = 100. However, when K = N then all the performances decrease to different degrees. There exists too much noise in an unfiltered auxiliary graph that will harm the performance noticeably. First of all, for all datasets, the best choice for K is 100 according to the figure. For Acm, the performance always keeps stable when K varies. Although a little, the auxiliary graph still improves the performance. For Citeseer, Dblp, and Pubmed, the performance can be improved substantially when K reaches or passes a certain thresh. In our experiment, the thresh for Citeseer is 5, 4 for Dblp, 50 for Pubmed. In most cases, the performance increases as the K increases. However, when K = N, the performance become worse than K = 100. This is because a full connected graph which is built by raw features contains much more noise than a filtered graph.

Study on the Influence of Graph Structure and Attribute
To study how the structure influences our method, we conduct experiments in two different ways: (1) Remove the attribute from the input; (2) remove structure information from the input. The results are shown in Figure 4. From this figure we can easily observe that with the structure only our method can achieve better performance than the performance with features only. We can infer that for these datasets, the structure plays a more critical role than the feature does in our method. And we can easily observe that when we integrate attribute with structure as input, we can achieve the best performance over other methods that are compared in our experiments.

Conclusions
In this paper, we propose a clustering model termed Auxiliary Graph for Attribute Graph Clustering. In our model, we build an auxiliary graph to reveal the latent relations of nodes in a global view. To reduce the impact of inherent noises in datasets, we disregard unreliable relations by a filter mechanism. With the help of the auxiliary graph, our model can learn a more reliable representation. With the help of the fusion strategy and clustering module, the discriminativeness and clustering-awareness of learned representations are both improved. Experiments on four benchmark datasets demonstrate that our model can outperform state-of-the-art baselines in most cases. Although achieving promising performance, our model still has room to improve. In the future, we will improve our model to fit different datasets, especially large-scale datasets.