Robust Graph Neural Networks via Ensemble Learning

: Graph neural networks (GNNs) have demonstrated a remarkable ability in the task of semi-supervised node classiﬁcation. However, most existing GNNs suffer from the nonrobustness issues, which poses a great challenge for applying GNNs into sensitive scenarios. Some researchers concentrate on constructing an ensemble model to mitigate the nonrobustness issues. Nevertheless, these methods ignore the interaction among base models, leading to similar graph representations. Moreover, due to the deterministic propagation applied in most existing GNNs, each node highly relies on its neighbors, leaving the nodes to be sensitive to perturbations. Therefore, in this paper, we propose a novel framework of graph ensemble learning based on knowledge passing (called GEL) to address the above issues. In order to achieve interaction, we consider the predictions of prior models as knowledge to obtain more reliable predictions. Moreover, we design a multilayer DropNode propagation strategy to reduce each node’s dependence on particular neighbors. This strategy also empowers each node to aggregate information from diverse neighbors, alleviating oversmoothing issues. We conduct experiments on three benchmark datasets, including Cora, Citeseer, and Pubmed. GEL outperforms GCN by more than 5% in terms of accuracy across all three datasets and also performs better than other state-of-the-art baselines. Extensive experimental results also show that the GEL alleviates the nonrobustness and oversmoothing issues.


Introduction
Graphs serve as structural data to describe complex relationships between entities among numerous networks, such as social networks [1,2], academic networks [3], citation networks [4,5], and traffic networks [6]. There is evidence that mining graphs can be beneficial for solving many real-world applications, such as node classification and clustering. Many studies have shown that graph neural networks (GNNs) [7][8][9][10] are a powerful approach to exploring graph data and can achieve promising results on graph-based semi-supervised learning tasks.
Despite the remarkable ability of most existing GNNs, recent studies have demonstrated that GNNs suffer from nonrobustness issues [11,12], i.e., GNNs are vulnerable to a small perturbation in the graph structures. By adding fake edges randomly to change the graph structures, the performance of GNNs can degrade dramatically, which poses a great challenge for employing GNNs in sensitive scenarios, such as finance networks [13].
To mitigate the nonrobustness issues faced by most GNNs, one strategy is to incorporate an ensemble learning mechanism into GNNs to obtain a stronger robustness. Based on ensemble learning, several strategies have focused on enhancing the ability of base models and then constructing an ensemble directly [14][15][16][17][18], e.g., average the predictions of enhanced base models. Their experimental results show that the nonrobustness issues can be improved by applying an ensemble mechanism. However, these approaches have some major shortcomings: (1) They ignore the interactions among multiple models when constructing an ensemble framework, which may lead to obtaining similar graph representations [19]. As a result, the whole ensemble models may be affected by the same perturbation due to the transferability of perturbations among models [14]. (2) In most GNNs models, they still apply deterministic propagation for each node to extract information. This propagation rule makes each node highly dependent on its neighbors, leaving each node to be easily sensitive to perturbations in the graph structures.
To address the two challenges, in this paper, we propose a framework, i.e., graph ensemble learning based on knowledge passing (GEL), to alleviate the nonrobustness of GNNs and improve performance in the task of node classification. Specifically, we first present a knowledge-passing strategy to construct an ensemble model with interactions among base models. The motivation is that we hope the knowledge (predictions of prior models) can be passed to the next model, so that the next model can avoid degrading the performance caused by the same perturbations as in prior models, improving the robustness of the whole framework. Second, we design a multilayer DropNode propagation strategy, which is achieved by randomly dropping the entire feature matrix of each node during each propagation, with a different probability among base models. By doing this, each node aggregates information from diverse subsets of its neighbors rather than neighbors from a deterministic propagation, which reduces its dependence and sensitivity on particular neighbors and benefit from diverse neighborhoods, increasing the robustness of GEL. For instance, when some nodes are perturbed, in the deterministic propagation, the negative effect propagates to its neighbors. With our propagation rule, the effects are greatly reduced or even eliminated because the perturbed node may be excluded in the various subsets of neighbors depending on the different probability of DropNode. Furthermore, this propagation rule empowers each node to incorporate broader higher-hops information, mitigating the oversmoothing for GEL.
Finally, we conduct experiments on three public datasets with several popular GNNs models. Experimental results demonstrate that GEL outperforms GCN in terms of accuracy. We also show that the variants without the multilayer DropNode propagation or ensemble learning based on knowledge passing still gain improvements compared to GCN, which means that each strategy makes a difference to our framework. More importantly, we observe that GEL can mitigate nonrobustness and oversmoothing issues.
In summary, the contributions of this work are as follows: • We propose a novel framework of graph ensemble learning based on knowledge passing (i.e., GEL) to address the robustness challenge of GNNs and improve the performance on semi-supervised learning tasks. • We design a multilayer DropNode propagation strategy to reduce each node's dependence on particular neighbors, which can strengthen the robustness of GEL. Moreover, this propagation rule enables each node to extract knowledge from diverse subsets of neighbors, alleviating the oversmoothing issues. • Experimental results on three public datasets show that our framework performs better than baseline methods in terms of classification accuracy and robustness.
The remainder of this paper is structured as follows. In Section 2, we outline the semisupervised learning task on graphs and review related work. In Section 3, we elaborate on the proposed framework. Next, we conduct experiments to evaluate the performance of our framework in Section 4. Finally, we conclude the paper in Section 5.

Task Definition and Related Work
In this section, we describe the semi-supervised node classification tasks on graphs and introduce some notations in Table 1. Then, we review the related work. Hidden node representations in the lth layer W (l) Weight matrix in the lth layer X Perturbed matrix X Matrix after propagation Z Predicted possibilities of matrix X

Semi-Supervised Node Classification
We describe a connected graph G = (V, E), where V = {V 1 , V 2 , · · · , V n } is a node set including n nodes, and E = e ij 1 i,j n is an edge set indicating the connections between nodes. A ∈ {0, 1} n×n denotes the adjacency matrix of graph G. For an undirected graph, A ij = 1 indicates that there is an edge e ij between node V i and V j , otherwise A ij = 0. We use X ∈ R n×d to denote the feature matrix of graph G, wherein X i denotes the node V i 's features and d is the dimension of the feature matrix.
We formalize the semi-supervised node classification tasks on graphs. In a graph, each vertex V i is associated with its label Y i ∈ Y, where Y denotes all possible labels. For a semi-supervised node classification, m nodes have known labels Y L ⊂ Y and the labels Y U = Y \ Y L of the remaining n − m nodes are unknown. The target is to design a function F to predict the labels Y U of unlabeled nodes via its corresponding feature matrix X. Therefore, the predictive function can be detailed as follows: Traditional methods to solve this problem are mostly based on graph Laplacian regularization [20,21]. Recently, GNNs have emerged as promising approaches for semisupervised node classification [22][23][24][25], which are briefly introduced below.

Graph Neural Networks
In this part, we introduce some representative GNNs methods and GNN models' propagation rule. In GNNs, each node propagates information to its neighbors with some deterministic propagation rules. For instance, in the graph convolutional network (GCN) [7] for semi-supervised learning on graphs, the graph propagation rule is formulated as follows: where A denotes a symmetric normalized adjacency matrix, W (l) denotes the weight matrix in the lth layer, H (l) is the hidden node representation in the lth layer, i.e., H (0) = X and ReLU is the activation function. Some methods have been proposed to advance this architecture. For example, Hamilton et al. [26] defined the graph convolution as aggregating information from neighbors. Petar Veličković et al. [27] applied an attention mechanism to assign different weights in the aggregation of neighbors' features. Xu et al. [28] added residual and jumping connections to adapt neighbors' properties. Wu et al. [29] removed the nonlinear activation function to simplify GCN. Abu-El-Haija et al. [30] studied a class of neighborhood-mixing relationships. Tu et al. [31] chose hyperparameters automatically to improve the effectiveness and efficiency. However, all the above works do not consider the robustness of GNNs models.

Ensemble Learning
Ensemble learning was proposed to combine the predictions of base learners into more accurate predictions [32]. Ensemble learning has shown its effectiveness in many real-world scenarios. It is also widely used in semi-supervised node classification tasks.
There are a few studies applying ensemble learning to graphs. For example, Hou et al. [33] leveraged a graph ensemble technique to help dependency-based approaches alleviate the influence of parsing errors in the sentiment analysis area. Zhang et al. [34] trained many GCN models and then created an ensemble of them in a way similar to BAN. Further works have adopted ensemble learning to improve the robustness issues [14,15,17,35,36]. Liu et al. [18] used a voting ensemble for generating a high accuracy output. Mun et al. [37] trained an ensemble of GNN classifiers with dependent codes to improve the robustness of the networks.
However, these works ignore the interaction among base models, which refers to the sharing of information, such as knowledge and experience, among base models. In other words, each model outputs its predictions independently, without taking the predictions of prior models as their own knowledge or experience. Compared with these works, the goal of our framework is to utilize the predictions from prior models as knowledge to train a new one, so that the new model can avoid degrading the performance caused by the same perturbations as in prior models and thus gain a more precise accuracy in node embeddings.

The Design of GEL
We designed a graph ensemble learning based on knowledge passing (GEL) framework for semi-supervised node classification tasks and mitigating nonrobustness issues. As illustrated in Figure 1, a connected graph G with its adjacent matrix A and feature matrix X are given. In order to empower each node to extract information from diverse subsets of its neighbors, we utilized a multilayer DropNode propagation strategy to achieve this for each model. Afterwards, the matrix after propagation was put into a classification model MLP. Finally, we leveraged ensemble learning for better performance under a semi-supervised setting. Each step of our framework is explained in detail below.

MultiLayer DropNode Propagation
Multilayer DropNode Propagation. There were two steps in the multilayer DropNode propagation. First, before propagation, the feature matrix of each node was removed with probability p to gain a perturbed matrix X. Second, during propagation, we performed label propagation to generate matrix X . Note that the above process was only implemented during training. During inference, we directly used the original feature matrix X.
Formally, in the first step, we assigned a mask p i ∼ Bernoulli(1 − ε) for each node v i . Then, we gained the perturbed matrix X by multiplying each node's feature matrix with its mask, i.e., X i = p · X i , where X i denotes the ith row of feature matrix X. In doing so, we changed the graph structure before each propagation. Specifically, each node extracted information from its diverse subsets of neighbors rather than the same subsets of neighbor nodes, reducing its dependence on specific neighbors. Note that the probability p of removing node features was different for each model. Generally, the probability p is decreasing gradually from the first model to the last model. Thus, we obtained different node representations and ensured the diversity of the models.
In the second step, we adopted a multilayer label propagation, i.e., directly usingÂ K , this propagation rule empowered each node to aggregate more local information, mitigating the risk of oversmoothing. Compared with GRAND [38], we performed a DropNode strategy in each layer propagation, rather than only once before propagation. By doing so, we obtained stochastic subsets of neighbors for each node. Consequently, each node extracted information randomly from its diverse neighbors and thus its features could be affected by more remote neighbors. Moreover, we guaranteed the reduction of the distraction from noise nodes' information, enhancing the robustness of the model and meanwhile alleviating the oversmoothing problem.
Prediction. For each model, the matrix X were generated after performing the multilayer random DropNode propagation. Then, each matrix was fed into an MLP model to obtain the corresponding classification results: where Z ∈ (0, 1) denotes the predicted possibilities of the matrix X and Θ is a set of model parameters.

Graph Ensemble Learning Based on Knowledge Passing
We designed a graph ensemble learning method based on knowledge passing. The main idea was to consider not just the performance of a single model, but the interactions among base models. Our goal was for the model to learn the knowledge of prior models, which could make the model perform well.
Ensemble learning. We adopted an ensemble learning strategy, which was to use multiple MLPs to complete the classification task. For each model, we obtained the matrix X from the multilayer DropNode propagation. Afterwards, it was fed into an MLP model to output its corresponding predictions. Except for the first model, each base model used the classification results of the prior models as knowledge.
Specifically, we designed the serialization ensemble learning method as shown in Figure 2. First, the classifier MLP 1 outputted the classification result Z (1) , and Z (1) was delivered to the next model MLP 2 . When the classifier MLP 2 performed classification, it received the result delivered by the first model and gave its own predictions. It is not difficult to conclude that each ith model received knowledge from the (i − 1)th model. The optimization objective which aligned the predictions between the ith model and (i − 1)th model can be formulated as where distance(·, ·) denotes the distance between the predictions of two models. In doing so, the ith model received the prior knowledge transferred from the (i − 1)th model and thus the ith model performed better. Meanwhile, we also designed a parallel ensemble learning method entitled P-GEL as one of the comparison algorithms shown in Figure 3. Different from the serialized ensemble learning method, the first N − 1 models were trained independently and outputted classification results Z (1) to Z (N−1) , respectively. When the ith model was trained, it received classification results from the first model up to the (i − 1)th model and outputted its own classification result.  It can be seen from Section 4.2 that the classification results of the serialized ensemble learning algorithm are better than that of the parallel ensemble learning algorithm, which indicates that knowledge passing strategy among base models is significant. When a single model encounters perturbation and outputs poor classification results, the knowledge from prior models can improve the poor performance. Consequently, the robustness of the models is greatly strengthened.
Training and Inference. In the whole algorithm, the loss function was mainly divided into two parts: the loss in a model and the loss between models. The loss in a model referred to the supervised loss. With m labeled nodes among n nodes, the supervised loss in every epoch was formulated as the cross-entropy loss: The loss between modes was knowledge passing loss. Concretely, when the (i − 1)th model passed knowledge to the ith model, we minimized the distance between Z (i) and Z (i−1) : The final loss of our algorithm was: where λ is a hyperparameter to control the balance between the two losses. Algorithm 1 summarizes the training process of GEL.

Algorithm 1 GEL.
Require: Graph G, adjacent matrix A, feature matrix X ∈ R n×d , the number of models N, DropNode probability p, learning rate η, an MLP model: f ml p (X , Θ). Ensure: Prediction results Z. 1: while not convergence do 2: for n = 1 : N do 3: Perturb the feature matrix: X ∼ DropNode(X, p)

Computational Complexity
For a model, the time complexity of the multilayer DropNode propagation is O(Kd(n + |E|)), where K represents propagation step, d is the dimension of node feature, n is the number of nodes and |E| is the count of edges. The next step is the prediction module and it can be accomplished in O(nd h (d + C)), where d h represents its hidden size and C is the number of classes. The total computational complexity of GEL is O(N((Kd(n + |E|)) + (nd h (d + C)))), where N denotes the number of models.

Experiments
In this section, we begin with introducing the details of the datasets, comparative baselines, variants, and experimental settings. Then, we present the overall experimental results on the semi-supervised node classification task and compare the performance on network visualization. Afterwards, we give an experiment comparing between GEL and GCN [7], GAT [27], and GRAND [38] to validate the advantage of GEL in robustness and oversmoothing issues. Finally, we conduct experiments to study the effects of different hyperparameters.

Datasets
To evaluate the performance of GEL on semi-supervised node classification tasks, we used three benchmark datasets: Cora [39], Citeseer [39], and Pubmed [40]. The statistics of the three datasets are summarized in Table 2. The details for Cora, Citeseer, and Pubmed are as follows: • Cora [39] is a benchmark dataset related to citations between machine learning papers. It is widely used in the field of graph learning. Each node represents a paper and the edges represent citations between papers. The label of a node indicates the research field of a paper. • Similar to Cora, Citeseer [39] is another benchmark dataset which represents the citations between computer science papers, keeping a similar configuration to Cora.
• Pubmed [40] is also a citation dataset which is relevant to articles about diabetes. The node features are weighted frequency-inverse document frequencies (TF-IDF). The label of a node denotes the type of diabetes.

Baselines and Variants
We conducted experiments with other comparative algorithms to evaluate the performance of GEL. The details of the comparative baselines are listed as follows: • GCN [7] is a semi-supervised learning approach which employs novel convolution operators on graph-structured data to learn node representations. • GAT [27] performs better than GCN by combining an attention mechanism which specifies different weight to a neighbor node. • DGI [41] is an unsupervised learning approach to study node representations. Maximizing mutual information between patch representations and corresponding highlevel summaries were proposed in DGI. • APPNP [42] improved GCN by connecting GCN with PageRank. A new propagation procedure based on personalized PageRank was proposed to make full use of neighbor information. • MixHop [30] was proposed to study neighborhood mixing relationships, such as difference operators. Sparsity regularization lets us visualize which neighborhood information will be chosen in priority by the network. • GraphSAGE [26] was proposed to study node embeddings by sampling and aggregating information which comes from a node's local neighborhood. • GRAND [38] first designed random propagation to achieve data augmentation and then proposed consistency loss to optimize the prediction loss of unlabeled nodes through data augmentation. • RDD [34] was proposed to define node reliability and edge reliability to ensure the quality of a model. Moreover, a new ensemble learning method was proposed to combine the above optimization.
For each dataset, we tested the following variants of our method: • M-GEL: The variant without the multilayer DropNode propagation mechanism. • K-GEL: The variant without graph ensemble learning based on knowledge passing mechanism. • P-GEL: The variant with the multilayer DropNode propagation and graph ensemble learning based on parallel knowledge passing.

Settings
We used PyTorch to conduct our experiments. The preprocessing for the three datasets was accomplished with the reference of Planetoid [43]. The experimental settings of the three basic datasets were exactly the same as works on semi-supervised learning tasks [7]. For Cora, the values of train nodes, valid nodes, and test nodes were, respectively, 140, 500, and 1000. For Citeseer, the values of train nodes, valid nodes, and test nodes were, respectively, 120, 500, and 1000. For Pubmed, the values of train nodes, valid nodes, and test nodes were, respectively, 60, 500, and 1000. In addition, we employed early stopping with a patience of 200 as an indicator of termination in the training process. To evaluate the performance of GEL, the metric for the classification task used in our experiment was accuracy. All experiments were conducted on PyCharm 2020. As for other software versions, we used Python 3.7.3, PyTorch 1.2.0, Numpy 1.16.4, and CUDA 11.2.

Node Classification Results
The accuracy of node classification predicted by GEL is shown in Table 3. The results of other baselines are all conducted with the same settings as our algorithm.
From Table 3, we can clearly observe that GEL consistently achieves stable improvements across the three datasets in contrast to other baselines. Specifically, GEL improves upon GCN by a margin of 5%, 5.7%, and 5.2% on Cora, Citeseer, Pubmed. Compared to GAT, we gain 3.5%, 3.4%, and 5.2% improvements, respectively. When compared to GRAND, GEL achieves 1.1%, 0.4%, and 1.5% improvements, respectively. We observe that P-GEL also outperforms most of the baselines, though still lower than GEL. This indicates serialization ensemble learning method is better than the parallel ensemble learning method. This also suggests that knowledge passing between models is significant. We then conducted an ablation experiment to study the contributions of different components in GEL. From the experimental results of two variants named M-GEL and K-GEL, we have two observations. Firstly, the performance of all GEL variants with some components removed is significantly reduced compared to the full model, demonstrating that every component of the design contributes to GEL's success. Second, GEL without the multilayer DropNode propagation outperforms almost all baselines across the three datasets, illustrating the positive effect of the proposed ensemble learning based on knowledge passing for semi-supervised graph learning.

Network Visualization
We can explore the network structure in two-dimensional space by network visualization. In this experiment, we visualized the Cora network using GCN, GAT, GRAND, and the proposed method GEL and its variants. Seven visualized networks are illustrated in Figure 4, where each color denotes a class. We summarize the observations as follows: • GCN tightly confuses red, purple, and green, as well as blue and lake blue. GAT poorly separates the boundary of green and peak green, and strongly confuses red, orange, and purple. GRAND fails to separate purple and red and cannot develop the boundary of green and peak green. • GEL shows a remarkable ability to visualize the Cora network. It can separate points of different colors and cluster points of the same color, though tightly confuse red and purple. We can also observe that the variants of GEL also show a significant performance in the network visualization compared to GCN, GAT, and GRAND. These results indicate that the multilayer DropNode propagation and ensemble learning on graphs are useful in network visualization.

Robustness Analysis
In this part, we examine the robustness of GEL by perturbing graphs with a random attack method by adding fake edges randomly. Figure 5 shows the classification accuracy of different methods when perturbing the Cora dataset with different perturbation rates. We can see that GEL outperforms GCN and GAT across all perturbation rates. When compared to the very recent GRAND model, we can observe that although the accuracy of the proposed method is slightly lower than that of GRAND when the perturbing probability is less than 100%, when the perturbing probability is greater than 100%, GEL shows obvious advantages. When adding 200% new random edges into the Cora dataset, we can observe that the classification accuracy for GEL decreases only 17.6%, while it decreases by 26.5% for GRAND, 28.4% for GAT, and 70.8% for GCN. This study indicates the robustness advantage of the GEL model with the increase of the perturbation rate.

Oversmoothing Analysis
Many GNNs suffer from oversmoothing problems. When deepening the propagation step, GNNs may make nodes with different labels indistinguishable. In this experiment, we studied how vulnerable GEL was to this problem on the Cora and Citeseer networks by deepening the propagation step. Higher classification accuracy with a deeper propagation step indicates a less severe oversmoothing issue. Figure 6 presents the classification accuracy with different propagation steps on the Cora and Citeseer networks. In GEL and GRAND, the propagation step is adjusted with the hyperparameter K, while in GCN and GAT, it is controlled by different hidden layers. As shown in Figure 6, with the propagation step increasing, the classification accuracy of GCN and GAT drops dramatically on both networks because of the oversmoothing problem. However, GEL and GRAND perform completely different. Both GEL and GRAND are not affected by the propagation step. Meanwhile, the performance of GEL is always better than GRAND as the propagation step increases on both networks. This suggests that GEL is much more powerful to alleviate the oversmoothing issue.

Parameter Analysis
In this subsection, we investigate the parameter sensitivity. For GEL, one of the most crucial parameters is the number of models. As a result, we first studied how it affected the performance of GEL on the three datasets. We set N ∈ {2, 3, 4, 5, 6, 7, 8, 9} with a fixed DropNode probability p and propagation step for each dataset. For the Cora and Citeseer networks, GEL behaved best with six to eight models. For the Pubmed dataset, the GEL performed best with two models.
We also studied some extra additional parameters, that are the propagation step K, DropNode probability p, and KP loss coefficient λ. We performed hyperparameter searching for each dataset. Concretely, we first searched K among {2, 3, 4, 5, 6, 7, 8, 9} for each dataset. With the best choice of K, we then searched the DropNode probability of each model, considering that each ensemble learning model should have diversities, and we set the DropNode probability of each model to be different. Afterwards, we studied the number of models for Citeseer and Cora again from six to eight models. Finally, we fixed λ ∈ {0.5, 0.7, 1.0}. Other parameters included the early stopping patience, hidden layer size, L2 weight decay rate, dropout rate in the input layer, and dropout rate in the hidden layer. These parameters did not cost much time, because the GEL was not sensitive to them. The optimal hyperparameters we used in our experiments are shown in Table 4.

Conclusions
In this paper, we studied the semi-supervised learning tasks on graphs and presented the graph ensemble learning based on knowledge passing (i.e., GEL) to mitigate the nonrobustness issues faced by most existing GNNs. We proposed the multilayer DropNode propagation, a strategy empowering each node to extract information from diverse subsets of its neighbors. Thus, each node's information depended not only on a single node, but also on multiple subsets of neighbors, alleviating oversmoothing issues. Then, we leveraged ensemble learning based on knowledge passing for considering the interaction among base models to avoid degrading the performance caused by the same perturbations as in prior models, alleviating nonrobustness issues. Experimental results on three datasets demonstrated that GEL outperformed other state-of-the-art baselines on semi-supervised node classification tasks, illustrating the importance of the ensemble learning and multilayer DropNode propagation used in our framework. Additional experiments on random attacks and different numbers of propagation layers showed the advantage of our algorithm with respect to robustness and oversmoothing issues.
In future work, we will improve the adaptability of various attacks, e.g., metattack attacks and adversarial attacks. We will also explore whether our framework is applicable to other graph-based tasks, such as unsupervised learning tasks and some anomaly detection tasks. Another line of research would be to refine our framework by encouraging fewer models (e.g., a distillation model) for better performance.  Data Availability Statement: The three datasets used in this paper are publicly available.