A Scalable Deep Network for Graph Clustering via Personalized PageRank

: Recently, many models based on the combination of graph convolutional networks and deep learning have attracted extensive attention for their superior performance in graph clustering tasks. However, the existing models have the following limitations: (1) Existing models are limited by the calculation method of graph convolution, and their computational cost will increase exponentially as the graph scale grows. (2) Stacking too many convolutional layers causes the over-smoothing issue and neglects the local graph structure. (3) Expanding the range of the neighborhood and the model depth together is difﬁcult due to the orthogonal relationship between them. Inspired by personalized pagerank and auto-encoder, we conduct the node-wise graph clustering task in the undirected simple graph as the research direction and propose a Scalable Deep Network (SDN) for graph clustering via personalized pagerank. Speciﬁcally, we utilize the combination of multi-layer perceptrons and linear propagation layer based on personalized pagerank as the backbone network (i.e., the Quasi-GNN module) and employ a DNN module for auto-encoder to learn different dimensions embeddings. After that, SDN combines the two embeddings correspondingly; then, it utilizes a dual self-supervised module to constrain the training of the embedding and clustering process. Our proposed Quasi-GNN module reduces the computational costs of traditional GNN models in a decoupled approach and solves the orthogonal relationship between the model depth and the neighborhood range. Meanwhile, it also alleviates the degraded clustering effect caused by the over-smoothing issue. We conducted experiments on ﬁve widely used graph datasets. The experimental results demonstrate that our model achieves state-of-the-art performance.


Introduction
Graph-structured data often contains abundant node features and topological information.Benefiting from its powerful expressive ability, graph-structured data are often used to model drug discovery [1], social networks [2], and recommender systems [3].
Moreover, the graph clustering task has attracted extensive attention as an important part of unsupervised learning on graphs.There are many directions in graph clustering tasks, such as node-wise graph clustering, and graph-wise graph clustering.In recent years, Graph Neural Networks (GNNs) have become a popular field in deep learning, which improves the performance of graph clustering tasks effectively.Graph Convolutional Networks (GCNs) [4] are one of the representative methods that can utilize node features and topology information to obtain low-dimensional embeddings.On this basis, many graph clustering models combined with deep learning techniques have been proposed to achieve state-of-the-art effects.Kipf et al. [5] learn embedding through GCN layers, then the decoder reconstructs the features as similar as possible to the original features.Ahn et al. [6] optimize the aforementioned model to solve the norm-zero tendency of isolated nodes.In addition, Pan et al. [7] combine graph convolution with the adversarial training method.However, the above models fail to pay attention to the importance of nodes.Therefore, Veličković et al. [8] introduce the attention mechanism into the graph convolution, which can aggregate the information based on nodes' importance.Wang et al. [9] adopt the receptive field with an attention mechanism to encode the features and jointly optimize the clustering and embedding module.Although the models with the graph auto-encoder as the backbone network can generate the embeddings effectively; it ignores the information in the data structure.Combining different orders of the embeddings and the structural information, Bo et al. [10] integrate structural information into a deep clustering method for the first time to improve the clustering effect.In addition, Li et al. [11] adopts the combination of deep clustering and the graph auto-encoder to design a triple self-supervised module to supervise the embedding and clustering module.
However, the above models have the following limitations: (1) Existing models are limited by the calculation method of graph convolution, and their computational costs will increase exponentially as the scale of the graph grows.(2) Stacking too many convolutional layers will introduce the over-smoothing noises and ignore the local graph structure (3) The model depth is orthogonal to the neighborhood range, which makes it difficult to expand both.
Therefore, researchers expect to find more scalable models.Klicpera et al. [12] propose a simple model using the relation between graph convolution and pagerank, which enables an efficient neighborhood expansion.Wu et al. [13] entirely reduce the computational costs by decoupling the feature propagation from the training process.Following the idea of SGC, Frasca et al. [14] consider the features of different receptive layers and splice them without ignoring information, while Zhu et al. [15] average them to generate combined features with the same dimension.Meanwhile, Zhang et al. [16] simplify the GNN from the perspective of spectral graph theory and it can select different high-order information orders according to different graphs.Despite this, the above models ignore the difference in node importance in the aggregation process.Chen et al. [17] adopt a constant decay factor to solve this issue, while Zhang et al. [18] use the receptive field weighted aggregation with an attention mechanism to aggregate neighborhood information.However, the above methods are suitable for supervised or semi-supervised learning scenarios, lacking a task-oriented model framework for unsupervised clustering tasks.
In response to the above problems, we propose a network that can effectively utilize various types of information in a graph with high scalability.We adopt a dual selfsupervision module to guide the training of the Quasi-GNN module and the DNN module.With this dual-supervised module, the entire model can be trained in an end-to-end manner for graph clustering.In addition, it should be mentioned that the algorithm of our proposed method requires a vector form of the data as input in addition to the graph.
In summary, our contributions are described as follows: • A highly scalable deep network to process graph-structured data is proposed.This network can combine the topological information and the node features effectively to obtain potential embeddings for clustering tasks.

•
A linear propagation based on personalized pagerank is proposed, which improves the performance of the clustering task and alleviates the over-smoothing issue.

•
We conduct extensive experiments on five real-world datasets and achieve superior performance with fewer iterations.The experimental results show that our model outperforms the current state-of-the-art methods.

Related Work
Graph clustering divides the unlabeled nodes into different clusters with a certain metric.After that, we can mine the relationships between different nodes in a graph.The early graph clustering models perform poorly on real-world datasets due to their shallow architecture and learning capabilities, such as matrix factorization [19] and DeepWalk [20].In addition, Sieranoja et al [21].propose two complementary algorithms for graph clustering called K-algorithm and M-algorithm.The combination of these two algorithms can obtain several local optimizations on the graph and they can be used with different cost functions.However, the two algorithms fail to integrate the graph topology information, which limits their final performance.
Recently, many more effective models applied to unsupervised learning are proposed, such as auto-encoder [22] and generative adversarial networks (GAN) [23].On this basis, many graph clustering models combined with deep learning techniques have been proposed and they have achieved good performance.The Graph Auto-encoder (GAE) [5] combines the auto-encoder with graph convolution.It first utilizes the two GCN layers to capture the information between graph topology and node features and then reconstructs an adjacency matrix to be as similar to the original matrix.The ARGA [7] adopts the adversarial training scheme to normalize the embedding process to obtain more robust embeddings.However, none of the above-mentioned models are clustering task-oriented joint optimization training methods.The DAEGC [9] combines the two components to jointly optimize the embedding module and the clustering module, which improves the quality of the embeddings and the clustering effect.Meanwhile, the SDCN [10] integrates the structural information into deep clustering by using a transfer operator to combine the auto-encoder with the GCN module.It can conduct end-to-end clustering training with the dual self-supervision module.Although these methods have superior performance, they still use the GCN module based on the message passing mechanism as the backbone network, limiting the scalability of these models.
However, the above models also have several drawbacks: (1) There is no solution to the orthogonal relationship between the model depth and the neighborhood range (2) Too many smoothing iterations or GCN layers stacking lead to over-smoothing issues.
To solve the limitations of the traditional GNN models, scalable GNN models are proposed.Early scalable GNNs simplified the model by sampling the graph, the GraphSAGE [24] samples the neighbors around the target node with the same probability, while the Fast-GCN [25] samples the nodes according to the importance of each node.Due to their nodewise sampling or layer-wise sampling method, they fail to learn large-scale sparse graphs effectively.On this basis, the GraphSAINT [26] proposes a subgraph-wise sampling method with high scalability, which decouples sampling from GNNs and further reduces the computational costs.
The other direction of scalable GNNs in recent years is to simplify the model structure.The SGC [13] transforms the nonlinear GCN into a simple linear model by repeatedly eliminating the nonlinear function between the GCN layers and folding the final function into a linear function.The PPNP [12] modifies the propagation scheme by adopting the relationship between GCN and pagerank.The AGC [16] advocates the use of high-order graph convolution to capture the global features of the graph and it can adaptively select the appropriate order according to different graphs, while the AGE [27] optimizes the model of GAE by decoupling the GCNs and modifies the GNN models from the perspective of graph signal processing.The S2GC [15] adopts an improved Markov diffusion kernel to derive a simpler variant of GCN that captures the global and local context of each node.Nevertheless, the SIGN [14] points out that the features among multiple layers should be considered together instead of a certain layer.They splice the features with different degrees of smoothness and utilize them for downstream tasks.Meanwhile, the GBP [17] adopts a constant weighted average decay factor to consider the difference in importance between the receptive fields of different nodes.On this basis, the GAMLP [18] integrates multiscale node features effectively with three different attention mechanisms to improve the scalability and computational efficiency.However, the above models with high scalability lack a jointly training framework for graph clustering tasks.
In contrast, our proposed scalable deep network can not only effectively integrate the graph structure and node features, but also decouple the encoding process from the propagation process.We improve the scalability of the existing models and alleviate the over-smoothing issue.Moreover, SDN solves the issue of the orthogonal relationship between the model depth and the range of the neighborhood by the Quasi-GNN module.
Meanwhile, we utilize a dual self-supervised module to train the clustering task end-to-end, which enables high-confidence clustering results while obtaining high-quality embeddings.

The Proposed Method
In this section, we first introduce the definition of graph and clustering tasks.Then we introduce our proposed Scalable Deep Network (SDN).The overall framework of SDN is shown in Figure 1.Specifically, SDN consists of three modules, a DNN module for auto-encoder, a Quasi-GNN module, and a dual self-supervised module.We first utilize the DNN module and linear encoder to generate the intermediate embedding, then use the linear propagation module to obtain the final embeddings.Meanwhile, we utilize the dual self-supervised module to supervise the training of these two modules.We introduce the specific details of our model as follows.
Figure 1.The overall framework of SDN is as above.X, X are input data and reconstructed data, respectively.E (l) and H (l) are the results of the l-th layer of the linear encoder in the DNN and Quasi-GNN modules, respectively.Layers with different colors represent E (l) of different embeddings learned by the DNN module.The green solid line indicates that the target distribution P is calculated by the distribution Q, the yellow dotted line represents the dual self-supervision mechanism, and the target distribution P supervises the training of the DNN module and the Quasi-GNN module at the same time.The solid blue line in the linear propagation layer of the Quasi-GNN module represents the propagation mode.

Problem Formalization
Graph-structured data can be defined as G = {V, E , X}, where is a feature matrix (input data).The topology of graph G is described by an adjacency matrix (with self-loops) Ã, where Ã = {a ij }.If there there is an edge between v i and v j , a ij = 1, otherwise a ij = 0.For non-graph data, we obtain their adjacency matrix Ã by constructing a KNN graph with the Dot-product.We first calculate the similarity between different nodes by S ij = x T j x i , and select K nodes with the highest similarity for each sample as their neighbors.Degree matrix Graph clustering is to divide the nodes into t disjoint clusters C = {c l | l = 1, 2, 3, • • • , t} according to a selected criterion, and there is c 1 c 2 = ∅.When the node v i is divided into a certain cluster, it can be expressed as v i ∈ c l .

DNN Module for Auto-Encoder
It is not sufficient to obtain embeddings only based on node features and topology information, so we utilize auto-encoders to obtain high-dimensional representations of node features and integrate them into the embedding learning process of multi-layer perceptrons.For accommodating different data types, we adopt the most basic auto-encoder to obtain high-dimensional embeddings of nodes.First, the initial feature matrix (vector data) X is fed into the fully connected neural network of the DNN module to obtain the high-dimensional embedding E. The specific process and formula are defined as follows.
where E (l) represents the encoding result of the l-th layer, and for the 0-th layer of the network, we set represents the encoding weight matrix of the l layer, φ represents the nonlinear function, such as ReLu(•).After encoding at layers l, we decode the embedding using a decoder that is fully symmetric to the encoder.
where D (l) represents the results of the l-th layer, for the 0-th layer of the decoding network, there is d represents the decoding weight matrix of the l layer.After that, we set X = D (l) and make the following results as the objective function.

Quasi-GNN Module
Although the auto-encoder can learn the embeddings from the data themselves, such as E (1) , E (2) , and E (3) , it ignores the relationship between nodes.Therefore, the traditional deep clustering method needs to utilize the GCN module to capture the relationship between nodes as the supplements.The GCN module can solve this issue, but it is difficult to expand the model depth and the range of the neighborhood together, which limits the learning ability and architecture of the models.Therefore, we propose a Quasi-GNN module, which decouples the encoding process from the propagation process and it can not only capture the relationship between nodes, but also expand the range of the neighborhood and the depth of the model together, reducing the computational cost and improving the scalability.

Linear Encoder
We utilize the multilayer perceptron (MLP) as our encoder to get the embeddings.The result of each layer can be defined as where H (l) is the embeddings of the l-th layer.Specially, H (0) = X, W m is the weight matrix of MLP.To obatin a more complete and powerful embedding, we combine the highdimensional representations E (l) learned from the DNN module with H (l) .The formula is as follows.
E (l) is the calculation result of the l-th layer DNN module.σ is the balance coefficient and we set it to 0.5.After that, we need to do the propagation operation on it to aggregate information in the neighborhood.Then we utilize the H(l) as the input of the l-th layer in MLP to generate the embeddings

Linear Propagation Module
We first briefly review the message passing algorithm of traditional GCNs.A traditional two-layer GCN model can be defined as where Â = Dr−1 Ã D−r , Â is the normalized adjacency matrix, by setting r = 1 or 0.5, we can obtain different regularization methods, such as Ã D−1 , D− 1 2 Ã D− 1  2 , and D−1 Ã.The predicted labels is Z GCN .In a traditional two-layer GCN model, the calculation of each layer depends on the calculation result of the previous layer.Limited by this calculation method, the computational costs of the traditional GNN models increase exponentially.It is difficult to expand the model depth and the neighborhodd range together for their orthogonal relationship.According to Xu et al. [28], the influence score of sample x on y in GNN can be defined as In the k-layers GNN, I(x, y) ∝ P rw (x → y, k), where P rw (x → y, k) is the random walk distribution after fine-tuning.When k → ∞, if the graph is irreducible and aperiodic, the value will approach a stable distribution independent of x (i.e., the same amount of influence scaling), which indicates that the influence of the x on the y at this time will eventually be independent of the local graph structure.Assuming that this stable distribution is π lim , we can calculate the distribution by the following formula Obviously, the result is related to the structure of the whole graph and has no relation to the starting point of the random walk, which means that we finally consider the information of the whole graph and ignore the nodes themselves.In addition, the original pagerank also adopts this calculation method to obtain the full graph structure.
Based on this, we can adopt a variant of pagerank (i.e., personalized pagerank) to reconsider the root node.Assuming that i x is the indicator vector of node x, its vector representation after multiple propagations can be defined as where α is the transmission probability, α ∈ [0, 1], Â is the normalized adjacency matrix.
In this way, we can obtain an approximate post-propagation matrix with respect to the entire graph data where M (k) is the result of the k-th propagation, T is the embedding obtained by the linear encoder.Therefore, we can deduce the final embeddings Z by combining the intermediate embeddings H in the Seciton 3.3.1.
The last layer of the linear propagation module is the multi-classification function of the softmax function As a result, z ij ∈ Z indicates the probability that a node v i belongs to the cluster j.Moreover, we can consider Z as a kind of aggregation class distribution.

Dual Self-Supervised Module
Through the above two modules, we mechanically combine the DNN module and the Quasi-GNN module, they essentially are all used for unsupervised or supervised learning in different scenarios and we cannot apply them to our depth clustering task directly.Therefore, we need to unify the Quasi-GNN module and the DNN module with the same optimization objective.We set the goal of these modules to approximate the target distribution P, which makes the results tend to be consistent during the training process, and because of the strong connection between the two modules, we call it a dual self-Supervised module.This module does not require the participation of labels during the training process.
First, for the DNN module, we utilize Student's t-distribution as the kernel to measure the similarity between the node embeddings e i and the cluster center vector µ j : where e i is the i-th row of the embedding E (l) , µ j is initialized by the K-means learned by the pre-train auto-encoder, v is the degree of freedom of Student's t-distribution.q ij can be seen as the probability of assigning sample i to cluster j.From this, we can obtain the cluster distribution Q about the nodes.To enable nodes to be assigned to different clusters with higher confidence, we calculate the target distribution P.
where f ij = ∑ i q ij is the soft clustering frequency.The target distribution P normalizes the sum of squares of each distribution in the cluster distribution Q.By using two distributions to constrain different embeddings, the embedding obtained by the DNN module and the Quasi-GNN module can be considered simultaneously to optimize the embedding and clustering quality jointly.On this basis, we can obtain the corresponding objective function This objective function constrains the DNN module and we can obtain the superior embeddings for clustering by reducing the KL divergence loss of the two distributions Q and P. In addition, we need to utilize the P distribution to constrain the Quasi-GNN module To sum up, the final loss can be defined as Generate DNN embeddings E (0) , E (1) , E (2) Use E (L) to calculate the distribution Q by Equation ( 17 Feed H (L) into the decoder to obtain the refactored feature X; The DBLP dataset is an author network.If two authors are collaborators, then there is an edge connection between them.We label their research fields according to their papers published in international journals and conferences.• Flickr: The Flickr is an image network which is constructed by forming links between shared Flickr public images.Edges are formed between pictures from the same location, pictures submitted to the same gallery, group, or collection, pictures that share a common tag, pictures taken by friends, etc.

Methods
We compare SDN with various existing representative unsupervised models for clustering tasks.Moreover, these models can be divided into three categories according to different input data, models that only use feature matrix (vector data): K-means, AE, and Random Swap; models that only use adjacency matrix (graph data): K-algorithm and M-algorithm; models that use both the two data: DEC, IDEC, GAE, VGAE, DAEGC, ARGA, SDCN, AGCN, SDN P , SDN E , and SDN.The following are specific descriptions of these models.
• K-means [32]: It is a traditional clustering method applied directly to the feature matrix (vector data).In this paper, we utilize the K-means supported by the sklearn package.For details, please refer to https://github.com/scikit-learn/scikit-learn,accessed on 12 April 2022.• AE [22]: This auto-encoder consists of an encoder and a decoder.It uses the encoder to encode the initial data, then utilizes the decoder to reconstruct the embeddings.In addition, it calculates the reconstruction loss as the objective function.Finally, we employ K-means to perform clustering on the obtained high-dimensional embeddings.• IDEC [35]: Taking into account the preserved data structure, IDEC manipulates the feature space to disperse the data points.Moreover, it can jointly perform the embedding and the clustering process.• GAE [5]: This method is an effective combination of auto-encoder graph convolution.First, graph convolution is used to encode the data; then, the decoder is used to reconstruct its adjacency matrix.The loss function measures the difference between the reconstructed matrix and the original matrix.• VGAE [5]: This model first obtains the embeddings through GCNs, then learns the distribution satisfied by them.Finally, it calculates the posterior probability to obtain the latent variable to reconstruct the adjacency matrix.• DAEGC [9]: It adopts the attention network to learn node embeddings and employs a clustering loss to supervise the self-training clustering process.• AGRA [7]: Using the adversarial regularization to normalize the process of encoding, ARGA combines an adversarial training scheme with a graph auto-encoder to obtain the superior embeddings.• SDCN [10]: To obtain the more robust embeddings, SDCN fuses the calculation results of the GCN module and the DNN module.Moreover, it utilizes a dual self-supervised module to constrain the two modules to train the model end-to-end.• AGCN [36]: Considering the nodes' importance, AGCN employs the attention mechanism to merge the embeddings learned by the same layer of auto-encoder and GCNs.We adopt four widely used evaluation metrics: Accuracy (ACC), Normalized Mutual Information (NMI), Average Rand Index (ARI), and macro-F1 score (F1) [37].For each metric, a larger value implies a better clustering result.
The specific calculation methods of the four indicators are as follows

Accuracy (ACC)
ACC is used to compare the obtained labels with the true labels, which can be calculated by the formula below where r i , s i represent the obtained label and true label corresponding to the data x i , respectively, n is the total number of data, δ indicates that the indicator function is as follows The map in this formula represents the re-distribution of the best class label to ensure the correctness of the statistics.

Normalized Mutual Information (NMI)
NMI is often used in clustering to measure the similarity of two clustering results.Assuming that P A (a), P B (b) represent the probability distribution of A and B, and P AB (a, b) represents the joint distribution probability of A and B, then we have where H(A) is called the information entropy of A vector.According to the relationship between joint entropy and individual entropy, NMI is defined as

Adjusted Rand index (ARI)
ARI reflects the degree of overlap between the two divisions.Suppose clustering is a series of decision-making processes, that is, making decisions on all N(N − 1) node pairs on the set.When only two nodes are similar, we group them into the same cluster.We utilize a to group two similar nodes into one cluster and b to group dissimilar nodes into different clusters.The Rand coefficient (RI) can be defined as However, RI fails to guarantee that the RI value of randomly divided clustering results is close to 0. Therefore, the Adjusted Rand index (RI) is proposed. where

Macro-F1 Score (F1)
The F1 score measures the accuracy of a binary classification (or multi-task binary classification) model.It takes into account both the accuracy and recall of the classification model.F1 score can be regarded as a weighted average of model precision and recall, and F1 ∈ [0, 1].
According to Table 2, precision refers to the proportion of samples with a predicted value of 1 and a true value of 1 in all samples with a predicted value of 1.In addition, recall refers to the proportion of samples with a predicted value of 1 and a true value of 1 among all samples with a true value of 1.Therefore, precision and recall can be defined as On this basis, F1 score is defined as the harmonic mean of precision and recall For macro-F1, it is the average of the F1 score of each cluster in the set.

Experimental Setup
To ensure the consistency of the experiments, we utilize a unified pre-train autoencoder to train the benchmark models involving the DNN module, such as AE+K-means, DEC, IDEC, SDCN, and AGCN.The structure of the pre-train auto-encoder is a 4-layer encoder and a 4-layer decoder with the dimension of 500-2000-500-10, and the two components are completely symmetrical to ensure the consistency of the constructed features.Meanwhile, we adopt the learning rate of 10 −3 and 30 epochs to train the auto-encoder and restore the optimal training results.In the subsequent training, we first employ the pre-train auto-encoder to encode the data, then we perform the K-means and initialize our clustering layer with the obtained clustering results.During the training, different learning rates and epochs are used for different datasets.Table 3 shows the detailed settings for training the Quasi-GNN module in different datasets.For the β and γ in the loss function, we set them as β = 10 −1 and γ = 10 −2 , respectively, in the experiment.Also, we set α in the linear propagation layer to 0.3 and the degrees of freedom of the Student's t-distribution to 1.
For the application of Random Swap on each dataset, we set the number of iterations to 10 and perform K-means twice for each iteration, the rest of the settings are the default settings; please refer to https://github.com/uef-machine-learning/RandomSwap,accessed on 12 April 2022.Moreover, by applying the K-algorithm and M-algorithm, we need to reconstruct the data according to the corresponding input data format.Therefore, we obtain the corresponding neighbor nodes according to their adjacency matrix and measure their similarity according to the nodes' features like the weight of the edge.On this basis, when using the K-algorithm and the M-algorithm, we calculated the conductance as the cost function; in particular, for the M-algorithm, we set the number of iterations to 100.Other parameters are default; please refer to https://github.com/uef-machine-learning/gclu,accessed on 12 April 2022.
On the other hand, K-means, Random Swap, and AE perform graph clustering directly on the feature matrix (vector data), K-algorithm and M-algorithm perform graph clustering on the data after reconstructing the input format, and other methods based on graph neural networks utilize a combination of feature matrix (vector data) and adjacency matrix (graph data).

Scalability Analysis
To further illustrate the scalability, we analyze our proposed model in time and space complexity.Moreover, we present time and memory consumption in different baselines.

Complexity Analysis
In this paper, we assume that the dimension of the input data is d and the dimensions of each layer of the pre-train auto-encoder are Assuming the number of input data is N, the time complexity of the pre-train auto-encoder is O(Nd ).For the linear encoder in the Quasi-GNN module, the dimension used in this part must be the same as the pre-train auto-encoder, so the time complexity of the linear encoder is ).In addition, the linear propagation module in the Quasi-GNN module requires an adjacency matrix to participate in the operation instead of parameters, so the time complexity of this part is related to the output dimension of the linear encoder and the number of nodes.Therefore, the time complexity is O(NL P d 2 L |V | 2 ).Moreover, we suppose that there are K classes in the clustering task, and the time complexity of Equation ( 17) is O(NK + NlogN) according to the analysis of Xie et al. [34].In summary, the total time complexity of our proposed model is O(Nd Next, we analyze the space complexity of our proposed model.For neural networks, the space complexity is represented by the number of neural network layers and the number of parameters.The parameters needed in our model appear in the DNN module and the linear encoder in the Quasi-GNN module.For the encoder of the pre-train auto-encoder and the linear encoder in the Quasi-GNN module, to combine the embeddings of these two components, the dimensions of their weight matrix should correspond to each other.In addition, the decoder and the encoder are completely symmetrical.Therefore, the weight matrix size of these three components should be the same.The space complexity of W e , W d , and To sum up, the space complexity of SDN should be O(dd

Time and Memory Consumption Comparison
On the one hand, to fully demonstrate the superiority of SDN in terms of memory consumption, we conduct experiments on Flickr and compare the SDN with baselines that have the state-of-the-art (SOTA) performance, such as AGCN and SDCN.The statistics of the Flickr are shown in Table 1.The results of the comparative experiments, the total number of parameters, and memory consumption of AGCN, SDCN, and SDN are shown in Table 4. On the other hand, to show the superiority of SDN in terms of time consumption, we record the time consumed by AGCN, SDCN, and SDN when processing the same dataset.The specific results are shown in Figure 2.
First, according to the results in Figure 2, the time consumption of AGCN and SDCN is mostly higher than that of SDN.In summary, the two-part experimental results show that existing SOTA methods have larger memory consumption and longer processing time than SDN.Second, for the experimental results in Table 4, AGCN and SDCN cannot be applied to large-scale datasets, such as Flickr, mainly because they adopt GNN and its variant methods as the result of the backbone network, while the computational costs of GNN recursively increase with the deepening of the network layer, which makes it difficult for this type of model to handle large-scale graph data, and cannot effectively expand the neighborhood range to obtain better node embedding and clustering results.In contrast, SDN utilizes linear encoders as the backbone network and linear propagation layers for feature propagation.Our proposed model not only effectively reduces the computational costs when processing the large-scale graph-structured data but also solves the orthogonal issue of the neighborhood range and the model depth, which shows the high scalability of SDN.

Result Analysis
We compare SDN with representative benchmark models and conduct extensive experiments on five datasets, including HHAR, Reuters, ACM, CiteSeer, and DBLP.The benchmark models completely adopt the original parameter settings.Moreover, the specific experimental results for different metrics are shown in Tables 5-8, where the bold values represent the best performance, the underlined values indicate the second-best performance.Our model surpasses recent benchmark models and achieves SOTA results.Compared with SDCN, our module has the following advantages:

•
We decouple the GCN module by employing the Quasi-GNN module to capture the information of graph topology and node features, and this module can be combined with methods such as smoothing or label propagation, which makes our model have high scalability.

•
We solve the issue of the orthogonal relationship between the model depth and the range of neighborhood, which enables the two to scale together and reduces the computational costs.

•
We simplify the model's structure and add the structural information to the Quasi-GNN module to alleviate the over-smoothing issue.

Ablation Study
To further verify the effectiveness of our proposed model, we adopt two variant methods to verify the performance of each module.SDN P only employs the Quasi-GNN module and SDN E only utilizes the DNN module for encoding.Finally, we perform K-means on the embeddings obtained from them.Compared with other baselines, the experimental results show that removing either of the above two modules will lead to a decrease in accuracy and other metrics, which indicates that the two components of our proposed model are inseparable.
Moreover, it is worth noticing that some results obtained by SDN P are better than baselines.For instance, the accuracy in Reutuers, ACM, and CiteSeer of SDN P is about 2%, 0.5%, and 1% higher than that of AGCN.In addition, other metrics of SDN P also surpass AGCN with varying degrees, and this result indicates that the decoupled method we proposed still has superior performance while having high scalability.

Analysis of Transmission Probability α
The linear propagation layer adopts the propagation strategy defined by Equation (11).Owing to no parameters, it simply performs linear operations on the original matrix, which greatly improves scalability and reduces computational costs.We set α = 0, 0.1, 0.3, 0.5, 0.7, and 1, respectively, on each dataset, and measure their final clustering effect.In this way, we obtain the most suitable transmission probability α setting, and the experimental results are shown in Figure 3.The experimental results generally show a trend of increasing first and then decreasing with the increase in α.The transition probability essentially indicates the probability that the target node learns from itself or its neighbor nodes.Based on the experimental results, we can infer that learning from only one of them is not sufficient.Therefore, it is necessary to find a suitable α to integrate the information of the target node and its neighbor nodes to obtain the deep embeddings.

Conclusions
In this paper, we propose a scalable deep network with a Quasi-GNN module and a DNN module.First, we utilize the Quasi-GNN module to capture the information of graph topology and node features in different dimensions and employ the DNN module for auto-encoder to supplement the structural information.In addition, the combination of these two components can be combined with other post-processing methods to enable nodes further to be assigned to clusters with higher confidence, so it has high scalability.Moreover, our proposed model solves the issue of the orthogonal relationship between the model depth and the neighborhood range.It reduces the computational costs of the traditional GCN models and alleviates the over-smoothing issue caused by the stacking of multiple GCN layers.Experiments on benchmark datasets show that our model has superior performance and achieves the SOTA effect.
For future work, we plan to optimize the Quasi-GNN module using the attention mechanism to consider the difference in importance between different nodes.On the other hand, we can add variants of GAE/VGAE to obtain more robust embeddings or propose different self-supervised modules to supervise the training of deep embeddings and clustering effectively.

Algorithm 1 d 4 :
where β and γ are constraint coefficients, β ∈ [0, 1] and γ ∈ [0, 1].Algorithm 1 shows the training process of our proposed model.Training process of SDN.Require: Initial features: X, Graph: G, Numbers of clusters: K, Adjacency matrix: A, Iteration number: MaxIter, Layer number of linear encoder: L E , Layer number of linear propagation module: L P ; Ensure: Clustering results R; with pre-train auto-encoder; 2: Initialize µ with K-means on the representations learned by pre-train auto-encoder; for ite = 1 to MaxIter do 5:

Figure 2 .
Figure 2. Time consumption comparison chart of AGCN, SDCN, and SDN on six datasets.The experiments are conducted on a machine with Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, and a single NVIDIA GeForce RTX 2080 Ti with 11GB memory.The operating system of the machine is Ubuntu 18.04.5 LTS.As for software versions, we use Python 3.9.12,Pytorch 1.11.0 and CUDA 11.4.

Figure 3 .
Figure 3. Analysis of transmission probability α.Through these four pictures, we can notice that when α = 0.3, our model have the best performance for different mertics.

18 :
[29]ulate L res , L clu , L ml p , respectively;To evaluate the performance of our model, we conduct extensive experiments on five public benchmark datasets, the specific details of them are shown in Table1.•USPS[29]:The USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9298 16 × 16-pixel grayscale samples; the images are centered, normalized, and show a broad range of font styles.Machine Learning), and HCI, containing a total of 3312 papers.It records information between cited papers or citations.
• ACM: The ACM dataset is a paper network from ACM digital library.It contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB, which can be divided into three categories (databases, wireless communication, data mining).• CiteSeer: The CiteSeer is a citation network.Papers in this dataset are divided into Agents, AI (Artificial Intelligence), DB (Database), IR (Information Retrieval), ML (• DBLP:

Table 1 .
The statistics of the datasets.

Table 2 .
Confusion matrix.For binary classification issues, the rows of the matrix represent the true values and the columns of the matrix represent the predicted values.TP (True Positive) means the number of positive samples predicted as positive samples, FN (False Negative) means the number of positive samples predicted as negative samples, FP (False Positive) means the number of negative samples predicted as positive samples, TN (True Negative) means the number of negative samples predicted as negative samples.

Table 3 .
Parameter settings used for each training set when training the Quasi-GNN module.K represents constructing a K-nearest neighbor graph for non-graph data, and if the value is none, it means that the original data is graph data.

Table 4 .
Accuracy of graph clustering on Flickr."OOM" means "out of memory".

Table 5 .
Accuracy (ACC) results on six datasets.

Table 6 .
Normalized Mutual Information (NMI) results on six datasets.

Table 7 .
Average Rand Index (ARI) results on six datasets.