1. Introduction
In the real world, graphs are ubiquitous and can be described in various forms. Graphs can be widely used in recommendation systems, citation networks, construction of protein structures in biochemistry, etc. Graph data is regarded as a carrier of information dissemination. As in
Figure 1, the Cora dataset in the citation network is introduced. The Cora dataset consists of machine learning papers. It is a very popular dataset to use for graph deep learning in recent years. The papers are divided into a total of seven categories. In the final corpus, there are cited relationships between papers and papers. The lower part of
Figure 1 shows this relationship with a graph data. There are 2708 papers in the entire corpus. After cataloging the papers, 1433 unique words are left to represent the papers. Cora dataset contains 1433 unique words, so the feature is 1433 dimensions. 0 and 1 in a single dimension describe whether each word exists in paper. In recent years, the research on learning from graph-structured data has received extensive attention in the field of machine learning and deep learning.
Semi-supervised node classification is one of the most popular and important problems in graph learning. In recent years, many effective node classification methods have been proposed by a wide range of scholars. In 2017, Kipf et al. proposed a new graph neural network structure, graph convolutional neural network (GCN) [
1]. Standard GCNs learn the information of neighboring nodes, while Higher-Order Graph Convolution Architectures via Sparsified Neighborhood Mixing (MixHop) [
2] can learn mixed neighbor relationships by repeatedly mixing feature representations of neighbors at different distances. Despite the great success of these algorithms, these models do not take full advantage of the relevant information about nodes and their neighbors.
The current hot topic focuses on the deformation of graph adjacency matrix. Yu Rong et al. proposed Towards Deep Graph Convolutional Networks on Node Classification (Dropedge) [
3], an algorithm to randomly remove edges, and Dongsheng Luo et al. proposed Robust Graph Neural Network via Topological Denoising (PTDNet) [
4] aiming to weed out task-irrelevant edges Both of the algorithms had low time complexity, but they did not benefit from added edges. Chen et al. [
5] proposed to iteratively add (remove) edges among nodes with the same (different) labels while predicting. Although this approach increased the addition of edges, it depended heavily on the training size and was easy to make the error propagation. Recently Tong Zhao et al. proposed a Data Augmentation for Graph Neural Networks (GAUG) [
6], a data augmentation method, which modified the structure of the graph by means of an edge predictor. It deleted unimportant edges, added possible data enhanced edges, and finally predicted the data with the modified edges.
Klicpera et al. [
7] proposed a graph diffusion network (GDN), which coordinated spatial message passing and generalized graph diffusion, where diffusion acted as a denoising filter to allow messages to pass through higher-order neighborhoods. According to the different stages of diffusion, GDN can be divided into early fusion model and late fusion model. The early fusion model proposed by Xu et al. [
8] and Jiang et al. [
9] used graph diffusion to determine neighbors. For example, graph diffusion convolution (GDC) replaced the adjacency matrix in graph convolution with sparse diffusion matrix (Klicpera et al. [
7]). The later fusion model (Tsitsulin et al. [
10] and Klicpera et al. [
11]) projected node features into a potential space, and then spread the learned node features.
In the aspect of the node classification task, how to fully use the large amount of unlabeled data becomes one of the important directions of research nowadays. Recently, in contrastive learning and unsupervised graph representation learning, Deep Graph Contrastive Representation Learning (GRACE) [
12], Graph Contrastive Learning with Augmentations (GraphCL) [
13] defined positive and negative pairs to maximize mutual information and expand data. In respect of the computer vision, Prototypical Contrastive Learning of Unsupervised Representations (PCL) [
14], Exploring balanced feature spaces for representation learning (BalFeat) [
15], and Unsupervised data augmentation for consistency training (UDA) [
16] tried to make full use of unlabeled data for consistent regularization training to improve the generalization performance of the model. These ideas can be applied to the model to improve the performance of semi-supervised node classification.
The convolutional layers of GCN usually have only two layers, and the performance is higher when the layers are shallow, but gradually decreases as the layers become deeper. Li et al. [
5] showed that the reason of over-smoothing was that by repeatedly applying the Laplacian smoothing, GCN might mix node characteristics from different clusters, making it difficult to distinguish. To solve the above problem, Jumping Knowledge Networks (JKNet) [
17] alleviated the over-smoothing problem to some extent by combining the representation of each previous layer with the last layer. Wu et al. proposed Simplifying graph convolutional networks (SGC) [
18], but it was still a shallow model and still had the risk of over-smoothing. Klicpera et al. proposed the Graph neural networks meet personalized pagerank (PPNP) [
11] algorithm, which separated feature transformation from propagation and can aggregate multi-order neighbors without increasing the number of layers of the neural network.
By using a modified Markov diffusion kernel, Hao Zh et al. derived a variant of GCN named as Simple spectral graph convolution (SSGC) [
19], which was able to balance the global and local information of each node. SSGC effectively utilized the information of the current node and its multi-order neighbors, while alleviating the over-smoothing problem.
How to effectively aggregate multi-order neighbors is still one of the current research focuses.
We propose the graph random neural network based on PageRank (PMRGNN), a simple and effective framework. Three approaches are used to solve the above problems: firstly, we design graph data augmentation for semi-supervised learning. Secondly, we extract features by combining two feature extractors. Finally, a graph regularization term is added to improve the performance of the designed model. The experimental results show that the PMRGNN can effectively improve the accuracy of node classification.
The rest of the paper is organized as follows:
Section 2 presents the basic concepts of graph and semi supervised classification, and introduces the relevant models.
Section 3 introduces the three modules of the proposed model and its basic theory, analyzing the algorithm and time complexity. The experimental design is presented in
Section 4, and the obtained results are listed under
Section 4. The conclusions and planned further work have been explained in the last section.
4. Experiment and Analysis
In order to verify the performance of the proposed model, we conduct experiments on three general benchmark datasets and compare it with baseline models. In
Section 4.1, the benchmark datasets and the baseline models are introduced. In
Section 4.2, the PMRGNN is compared with the baseline models. In
Section 4.3, we conduct the ablation experiment to understand the contribution of each component to functionality. In
Section 4.4, We compare the weight of
for PageRank with other cases. In
Section 4.5, the generalization performance of the model is analyzed. In
Section 4.6, at low labeling rate, we analyzed the PMRGNN model and other GNN baselines. In
Section 4.7, the experiments of over-smoothing are analyzed. In
Section 4.8, We perform a visual analysis of the classification of two GNN baselines and PMRGNN.
4.1. Data Sets and Benchmark Algorithm(s)
In this paper, we conduct experiments on three benchmark graphs [
1,
29]: Cora, CiteSeer and PubMed. The specific information of the datasets is shown in
Table 1.
The experimental setup of Yang et al. [
30] is used in the experiment. In the citation network datasets, the nodes are documents to form the feature matrix
. The citation links are regarded as the undirected edges and can be constructed as an adjacency matrix
. The training set consists of 20 nodes in each class, 500 nodes verification set and 1000 nodes test set. The labeling rate is the ratio of labelled nodes to the total nodes in the dataset. Its formula is
. For example, the labeling rate of Cora is
.
Baseline models: eight different neural networks are selected as baseline models, Graph convolutional networks (GCN) [
1], Graph attention networks (GAT) [
29], Simplifying graph convolutional networks (SGC) [
18], Graph neural networks meet personalized pagerank (APPNP) [
11], Towards Deep Graph Convolutional Networks on Node Classification (DropEdge) [
3], Adaptive universal generalized PageRank graph neural network (GPRGNN) [
31], Simple spectral graph convolution (SSGC) [
19], Simple and deep graph convolutional networks (GCNII) [
32]. A GNN based sampling method: Inductive representation learning on large graphs (GraphSAGE) [
33].
4.2. Experimental Results
The prediction accuracy of node classification is summarized in
Table 2. The comparison experiment methods are the baseline models used in
Section 4.1. The result of PMRGNN is average value of 50 runs with random weight initialization.
In
Table 2, it can be noticed that the PMRGNN proposed in this paper is superior to the compared baseline models. Specifically, compared with GCN, our strategy performance improves by 3.7%, 5.0% and 1.4% in Cora, CiteSeer and PubMed. Compared with GAT, it increases by 2.2%, 2.8% and 1.4%. Compared with GCN, SSGC raises by 1.1%, 2.7% and 1.0%. Compared with GCN, GCNII is improves by 3.4%, 2.6% and 1.2%. Our strategy is 0.3%, 2.4% and 0.2% higher than GCNII.
4.3. Ablation Experiment
We conduct an ablation experiment to understand the contributions of different components in PMRGNN.
Unshielded feature (w/o MF): remove mask matrix, just use DropNode.
Neural network structure setting MLP (net-MLP): the combination of two feature extractors is changed to use MLP alone.
Neural network structure setting GCN (net-GCN): the combination of the two feature extractors is changed to use GCN alone.
No Feature extractors loss (w/o KL loss): There is no KL loss in the final loss.
No regularization term (w/o reg): remove the graph regularization term, and set the super parameter of the graph regularization term as 0.
In
Table 3, We summarize the results of the ablation experiment. After removing some components from PMRGNN, it can be observed that the performance will be reduced comparing with the complete model, indicating that the components designed in this paper can help to improve the accuracy of the model.
Table 3 shows that masking features produces random data augmentation; using one feature extractor alone is not as effective as using two feature extractors in combination; At the same time, the graph regularization can effectively improve the performance of the model.
4.4. PageRak Weight Comparison Experiment
In this paper, we design a multi-level domain with aggregation nodes:
,
, parameter
is the PageRank weight. In this section, we compare the case where
is the learnable weight or
. Experimental results show that the method of aggregating multi-order neighborhoods proposed in this paper is more effective (as shown in
Table 4).
4.5. Generalization Performance Analysis
In this section, the effects of random propagation and graph regularization term on the generalization ability of the model are studied. In order to verify their effects on the model, the training loss and verification loss of the model on Cora dataset are analyzed. The smaller the gap is between the two losses, the better the generalization performance of the model is. The generalization performance of the model and its variant (w/o reg) are shown in
Figure 6. When there is no graph regularization term, an obvious gap between the training loss and the verification loss can be observed in the
Figure 6a; When the graph regularization term is added, the verification loss is close to the training loss in the
Figure 6b. These show that the graph regularization term added in this paper can improve the generalization performance of PMRGNN.
4.6. Low Label Rate Experiment
In order to verify the performance of PMRGNN, we conduct experiments on datasets with lower label rate. The result statistics are shown in
Table 5. The training set contains 10 nodes of each class, the validation set contains 500 nodes, and the test set contains 1000 nodes. On the datasets with lower label rate, the performance of our strategy is still at a higher level than other models. Specifically, the model has improved by 6.4%, 5.4% and 1.7% compareing with GCN on Cora, CiteSeer and PubMed datasets. Compared with GAT, it increases by 2.6%, 2.3% and 1.9%. PMRGNN is 1.2%, 0.3% and 0.3% higher than the best algorithm on each dataset.
Figure 7 shows the comparison of the accuracy of Cora dataset on different training sets (each type of trained nodes ranges from 10 to 100). It can be observed that the algorithm designed in this paper is generally higher than GCN and GAT in all partitions.
4.7. Over-Smoothing Experiment
Many GNNs are faced with the problem of over-smoothing when the step size of feature propagation is enlarged. When the propagation step size increases, the nodes with different labels will become difficult to distinguish. In order to verify the ability of PMRGNN to alleviate the problem of over-smoothing, we conduct an experiment on the Cora dataset with different propagation steps.
Figure 8 shows the experimental results on the Cora dataset. The propagation step length in PMRGNN is controlled by the super parameter
. The hyperparameter
,
is the number of propagation steps of the adjacency matrix in augmentation of graph data. In
Section 3.2, the hyperparameter
is reflected in
. For GCN and GAT, they are adjusted by superimposing different hidden layers. For SGC and SSGC, the propagation depth of data preprocessing is adjusted. The experimental results show that with the increase of propagation step length, the indexes of GCN and GAT decrease significantly. Due to the problem of over-smoothing, the accuracy of GCN decreases from 81.5% to 15%, the accuracy of GAT reduces from 83% to 20%. PMRGNN, SGC and SSGC can alleviate the problem of over-smoothing. They can be stabilized within a certain range. But our method performs better. Compared with other algorithms, the accuracy of PMRGNN is always the highest, maintained at 85%.
4.8. Classification Visualization
Figure 9 and
Figure 10 shows the visualization of the results of the standard classification of the CiteSeer. The
Figure 9a is GCN, the
Figure 9b is Simple and deep graph convolutional networks (GCNII) [
32], the
Figure 10a is the feature extractor of PMRGNN model using the convolution layers without KL divergence, and the
Figure 10b is the model proposed in this paper. It can be observed that the distribution of the
Figure 9a is diffused and not clear. The
Figure 9b shows that the GCNII model is more distributed than the traditional model GCN. However, according to the illustration, the node classification of GCN and GCNII is still not clear enough. The
Figure 10 correspond to our model. Compared with GCNII, the node aggregation in PMRGNN is more compact, and similar nodes tend to move in one direction. In the
Figure 10a, five kinds of nodes are classified obviously, and the classification of the central nodes is relatively fuzzy. In the ablation experiment of
Section 4.3, we also proved that without KL divergence, the accuracy of the model will decline. We use two feature extractors to make clusters more compact and node classification clearer.
5. Conclusions
In the research of graph semi-supervised classification, this paper proposes a new model, Graph Mixed Random Network Based on PageRank (PMRGNN). Frist of all, we randomly mask the dimension with zero in the node features, and then aggregate the multi-order fields of the nodes to randomly generate new feature matrices. Secondly, we propose a method that combines two feature extractors. It enables the key information between features complement each other. Finally, we propose two losses of processing feature extractors loss and graph regularization loss to improve the performance of the model. In the experiment, we prove that our model has superior performance comparing with other neural networks. Specifically, PMRGNN is 0.3%, 2.4%, 0.2% higher than other best algorithms respectively on three data sets. Under the lower label rate, model still maintains 1.2%, 0.3% and 0.3% higher than other algorithms on three data sets. Ablation experiments show that each component of this algorithm has a corresponding role. At the same time, the component can make the verification loss close to the training loss, and make them more stable. About the selection of , PageRank weight has more advantages than other options. In the research of over-smoothing, with the increase of the number of layers, the accuracy of our model does not decline, and it is stable at 85%. It is proved that the algorithm can effectively alleviate the over-smoothing problem. In terms of classification visualization, PMRGNN is more intuitive than other classifications.
All in all, the idea of PMRGNN is feasible. In semi-supervised learning, the model may have a profound impact on other work. In future research, we hope to expand the strength of PMRGNN to collect effective information. At the same time, we want to improve the sampling method and enhance the scalability of proposed strategy.