Multi-Scale Aggregation Graph Neural Networks Based on Feature Similarity for Semi-Supervised Learning

The problem of extracting meaningful data through graph analysis spans a range of different fields, such as social networks, knowledge graphs, citation networks, the World Wide Web, and so on. As increasingly structured data become available, the importance of being able to effectively mine and learn from such data continues to grow. In this paper, we propose the multi-scale aggregation graph neural network based on feature similarity (MAGN), a novel graph neural network defined in the vertex domain. Our model provides a simple and general semi-supervised learning method for graph-structured data, in which only a very small part of the data is labeled as the training set. We first construct a similarity matrix by calculating the similarity of original features between all adjacent node pairs, and then generate a set of feature extractors utilizing the similarity matrix to perform multi-scale feature propagation on graphs. The output of multi-scale feature propagation is finally aggregated by using the mean-pooling operation. Our method aims to improve the model representation ability via multi-scale neighborhood aggregation based on feature similarity. Extensive experimental evaluation on various open benchmarks shows the competitive performance of our method compared to a variety of popular architectures.


Introduction
Convolutional neural networks (CNNs) [1] demonstrate state-of-the-art performance in a variety of learning tasks for processing 1D, 2D, and 3D Euclidean data, such as videos, acoustic signals, and images. However, the convolution operation is not applicable to deal with non-Euclidean data such as graphs, since each node may have a different number of adjacent nodes, and it is difficult to perform convolution operations using a convolution kernel of the same size.
In recent years, an increasing number of applications have represented data in the form of graphs. For example, in e-commerce, graph-based learning systems can leverage the interaction between users and products to make highly accurate recommendations. In chemistry, molecules are modeled as graphs that require the identification of their biological activity for drug discovery. In citation networks, papers are linked through citations, so they need to be grouped. Graph deep learning models have achieved great success in modeling relational data, including link prediction [2][3][4] graph classification [5,6], and semi-supervised node classification [7,8].
There are many approaches for leveraging deep learning algorithms on graphs. Node embedding methods use random walks or matrix factorization to directly train an individual node embedding, often without using node features and usually in an unsupervised manner, such as DeepWalk [9], LINE [10], and node2vec [11]. However, these are unsupervised algorithms that ignore the feature attributes of nodes. Therefore, they cannot perform node classification tasks in an end-to-end manner. Unlike previous methods based on random walking, the use of neural networks in graphs has been extensively studied in recent years. Examples for these include ChebNet [12], MoNet [13], GCN [8], and SSE [14], SGC [15]. Among these categories, this class of message-passing algorithms has received special attention because of its flexibility and good performance.
Recently, there has been increasing research interest in applying convolutional operations on graphs. These graph convolutional networks (GCNs) [8,15] are based on the neighborhood aggregation scheme that combines information from neighborhoods to generate node embedding. Compared with traditional methods, GCNs achieve promising performance in various tasks (e.g., node classification [8,16] and graph classification [12]). Nevertheless, GCN-based models are usually shallow, which limits the scale of the receptive field. When GCN is set up as a two-layer network, there is usually a better classification effect. However, the two-layer GCN model only aggregates information from 1-hop and 2-hop neighbors for each node; due to the limitation of the receptive field, it is difficult for this model to obtain sufficient global information. However, simply adding more layers to the GCN model will degrade the classification performance. According to the explanation in [17], each GCN layer essentially acts as a form of Laplacian smoothing, and as the number of layers increases, the hidden layer representations of all nodes will tend to converge to the same value, which will lead to over-smoothing [17,18]. Although some methods [19,20] attempt to obtain more global information through deeper models, they are either unsupervised models or require many training examples. Consequently, they still cannot solve the semi-supervised node classification task well.
Furthermore, for the semi-supervised learning, GCN-based models use a symmetric normalized adjacency matrix as the aggregation matrix to aggregate local information. However, the normalized adjacency matrix can only simply aggregate feature information from neighboring nodes for the target node. The original feature distribution relationship between the target node and its neighboring nodes is not considered in the process of aggregating information, which will result in failure to distinguish the relative importance of neighboring nodes for the target node. These neighboring nodes whose original feature distribution is closer to the target node should have a larger aggregation weight when aggregating information.
To solve the above-mentioned issues, in this paper, we propose a multi-scale aggregation graph neural network based on feature similarity (MAGN). We first construct a similarity matrix as the aggregation matrix by calculating the original feature similarity between adjacent node pairs. The similarity matrix is used as the aggregation matrix, which can distinguish the relative importance of neighboring nodes for the target node according to the original feature distribution relationship. We then utilize the similarity matrix to perform feature propagation (i.e., aggregation) of K steps. As the number of propagation steps increases, the scope of feature propagation centered on each node gradually expands, thereby capturing more global information for each node. Finally, an element-wise meanpooling operation is applied to aggregate the output of feature propagation of different steps. This aggregation of multi-step (i.e., multi-scale) feature propagation can improve the model representation capability. We conducted extensive experiments on a variety of public datasets and show the competitive performance of our method compared to various popular architectures.
The rest of this article is organized as follows. Section 2 reviews the related work. In Section 3, we describe our proposed method, and then perform an experimental evaluation in Section 4. Finally, in Section 5, we summarize our contributions and future work.

Related Work
Given a graph G = (V, E), where V and E are the set of n nodes and the set of edges respectively. Let X i nodes and the set of edges respectively. Let denote the feature vector for node i and Y i denote the true label. All node features (labeled and unlabeled) are represented by X= [X 1 , X 2 , . . . , X n ] T ∈ R n×c , with a c-dimensional feature vector for per node. Let L denote the set of labeled nodes and Y ∈ R |L| × f denote the one-hot label matrix, where f is the number of classes.

Graph-Based Semi-Supervised Learning
Generally, graph-based semi-supervised learning can be defined by the following loss function: L label and L reg are defined as, where L label is the standard supervised loss for loss function l and L reg is called as graph Laplacian regularization. L reg can ensure that connected nodes have a similar model output, λ ∈ R is the regularization coefficient. f (X i ) denotes the label prediction of node i, and f (X i ) is predicted by learning both labeled and unlabeled nodes simultaneously.
A represents an adjacency matrix or other graph construction and A ij denotes a certain relationship between graph nodes i and j.
Graph-based semi-supervised learning has been a popular research field in the past few years. By using the graph structure to aggregate the feature information from the labeled and unlabeled nodes, learning can be done with very few labels. There are already many methods for graph-based semi-supervised learning. The label propagation algorithm [21] uses labeled node label information to predict unlabeled node label information and uses the relationship between samples to establish a complete graph model. ManiReg [22] calculates the supervised loss on the labeled nodes and calculates the unsupervised loss on all nodes using the graph Laplacian regularization. SemiEmb [23] regularizes a deep neural network with an embedding-based regularizer. Planetoid [7] is a method based on sampling, and the authors derived a sampling algorithm based on random walks to obtain the positive and negative contexts for each data point.

Graph Neural Networks
Graph neural networks (GNNs) are an extension of neural networks to structured data encoded as graphs, which update the features X (t−1) i of node i ∈ V of node in layer t − 1 by aggregating local information via where N (i) is the set of neighbors of node i in the graph and f (t) Θ is a differential function parameterized by weights Θ (t) . In some current implementations, C (t−1) i,w is defined as either static [24], structure- [8] or data-dependent [25].
GNNs were originally introduced as extensions of recurrent neural networks. They learn a target node's representation by propagating neighbor information in an iterative manner until a stable fixed point is reached. However, as the weights are shared among all nodes, GNNs can also be interpreted as extensions of convolutional neural networks on a 2D grid to general graphs and aim at addressing graph-related tasks in an endto-end manner. GNNs have been successfully applied in various applications, such as community detection [26,27], molecular activation prediction [28], matrix completion [29], combinatorigal optimization [30], and detecting similar binary codes [31].

Semi-Supervised Learning with GCN
The GCN [14] model is a special case of GNNs, and it is a simple but powerful architecture that stacks two layers of specific propagation and perceptron. Given the input feature matrix X and adjacency matrix A, the output of the two-layer GCN model can be defined as: Here,Â = D −1/2 A D −1/2 is a symmetric normalized adjacency matrix and A = A + I, where I ∈ R n×n is the identity matrix and D is a diagonal degree matrix with D ii = ∑ j A ij . ReLU is a rectified linear activation function where ReLU(x) = max {0, x}, and so f tmax( The weight matrices Θ (1) ∈ R c×h and Θ (2) ∈ R h× f are trained to minimize the cross-entropy loss over all labeled examples L: The GCN model combines graph structure and node features in the convolution, where the features of unlabeled nodes are mixed with those of nearby labeled nodes. As the GCN model leverages the features of unlabeled nodes in training, it only requires fewer labeled nodes to achieve better prediction results.

The Proposed Method
In this section, we introduce our method in two steps. First, we introduce the process of calculating the similarity matrix. Then, we introduce the multi-scale aggregation graph neural network method based on the similarity matrix.
Compared with the previous graph convolution models, our method has two innovations: (i) We no longer use the adjacency matrix to participate in node feature update. We construct a similarity matrix to take the place of the adjacency matrix, which can distinguish the relative importance of neighbor nodes for the target node in the feature update; (ii) We use an average encoding with skip connections in the feature propagation of each layer, which is an aggregation of multi-scale feature propagation. Compared to previous single-scale feature propagation methods (e.g., GCN [8] and SGC [15]), this multi-scale aggregation can not only retain adequate lower-order neighbors' information, but also obtain more global information.
The flow illustration of the proposed method is shown in Figure 1. This similarity matrix needs to be obtained in advance. During training, the obtained similarity matrix can be used directly. First, we need to calculate the similarity matrix and normalize the similarity matrix. Since the adjacency matrix can be easily obtained, we can obtain the similarity matrix according to the adjacency matrix and the original feature matrix. Then, we need to build network architecture of the proposed method, mainly divided into three steps: (i) we need to input the feature matrix into a fully connected network for linear transformation to reduce the feature dimensions; (ii) nonlinear activation is performed to obtain node hidden representations; (iii) multi-scale neighborhood aggregation is performed to generate node embeddings. A multi-layer network architecture can be generated by repeating these three steps.

Calculating Similarity Matrix
Following the notation in Section 2, Generally, the feature similarity between two nodes is compared by calculating their feature distance. The smaller the feature distance is, the greater the similarity, and conversely, the smaller the similarity. In our model, we use the Manhattan distance to calculate the feature similarity between two nodes.
Nodes i and j are adjacent nodes in the graph, and the Manhattan distance between their features can be gained by the following formula: Calculating the similarity coefficient between nodes i and j via where μ is the smoothing parameter. By Equation (7), a smaller feature distance will obtain a larger similarity coefficient.
The similarity matrix , ( means that nodes i and j are adjacent nodes in the graph, similarity coefficient ij α of adjacent nodes i and j is calculated by Equations (6) and (7). Algorithm 1 describes the process of calculating the similarity matrix S . Note that the input feature matrix needs to be normalized before calculating the similarity matrix. Otherwise, if the similarity gap of different neighboring nodes is too great, it will lead to lower classification accuracy. Actually, S can be regarded as an adjacency matrix with weights. S is used in feature propagation, which can distinguish the relative importance of neighbors based on the similarity of original features between the target node and neighbors. These neighbors with higher similarity tend to play a more important role in feature propagation.

Calculating Similarity Matrix
Following the notation in Section 2, X= [X 1 , X 2 , . . . , X n ] T ∈ R n×c is the feature matrix, composed of the features of all labeled and unlabeled nodes, where X i ∈ R c is the cdimensional feature vector of node i and n is the number of all nodes. The graph structure is represented by the adjacency matrix A ∈ R n×n . Generally, the feature similarity between two nodes is compared by calculating their feature distance. The smaller the feature distance is, the greater the similarity, and conversely, the smaller the similarity. In our model, we use the Manhattan distance to calculate the feature similarity between two nodes.
Nodes i and j are adjacent nodes in the graph, and the Manhattan distance between their features can be gained by the following formula: Calculating the similarity coefficient between nodes i and j via where µ is the smoothing parameter. By Equation (7), a smaller feature distance will obtain a larger similarity coefficient. The similarity matrix S ∈ R n×n is defined by where (i, j) ∈ E means that nodes i and j are adjacent nodes in the graph, similarity coefficient α ij of adjacent nodes i and j is calculated by Equations (6) and (7). Algorithm 1 describes the process of calculating the similarity matrix S. Note that the input feature matrix needs to be normalized before calculating the similarity matrix. Otherwise, if the similarity gap of different neighboring nodes is too great, it will lead to lower classification accuracy. Actually, S can be regarded as an adjacency matrix with weights. S is used in feature propagation, which can distinguish the relative importance of neighbors based on the similarity of original features between the target node and neighbors. These neighbors with higher similarity tend to play a more important role in feature propagation. Algorithm 1. Calcul ate Similarity Matrix S 1: Input: feature matrix X ∈ R n×c , adjacency matrix A ∈ R n×n 2: output: similarity matrix S ∈ R n×n 3: Perform normalization X ←D −1 X with diagonal matrixD ii = ∑ c ε=1 X iε 4: Initialize S with zeros 5: for i to n do 6: calculating the feature distance of nodes i and j 9: S ij = (µ + exp(d ij )) −1 // calculating the feature similarity of nodes i and j 10: end for 11: end for 12: return S We need to use the similarity matrix (calculated in Algorithm 1) for the proposed architecture to perform multi-scale feature propagation, specific as shown in Figure 2, where X represents the feature matrix and S represents the feature matrix and represents the similarity matrix. First, we need to perform linear transformation and nonlinear activation on the feature matrix X to obtain the hidden feature representation H ∈ R n×r , where n represents the number of nodes in the graph and r represents the hidden feature dimensions. Next we use the normalized similarity matrix to perform multi-scale feature propagation on hidden feature representation H. Then we use an aggregator to aggregate the output of multi-scale feature propagation to generate an embedding matrix H ∈ R n×r . In Figure 2, "+" means aggregator, here we use mean-pooling as aggregator. For the proposed method, we will describe it in further detail in Section 3.2.  We need to use the similarity matrix (calculated in Algorithm 1) for the proposed architecture to perform multi-scale feature propagation, specific as shown in Figure 2, where X represents the feature matrix and S represents the similarity matrix. First, we need to perform linear transformation and nonlinear activation on the feature matrix X to obtain the hidden feature representation n r H × ∈  , where n represents the number of nodes in the graph and r represents the hidden feature dimensions. Next we use the normalized similarity matrix to perform multi-scale feature propagation on hidden feature representation H . Then we use an aggregator to aggregate the output of multi-scale feature propagation to generate an embedding matrix n r H × ∈   . In Figure 2, " + " means aggregator, here we use mean-pooling as aggregator. For the proposed method, we will describe it in further detail in Section 3.2.

MAGN Model
In the GCN, hidden representations of each layer are aggregated among neighbors that are one hop away. This implies that after K layers, a node extracts feature information from all nodes that are K hops away in the graph. Each GCN layer has only a size-1 feature extractor, so more layers are needed to obtain adequate global information. Different from GCN, we explore a set of size-1 up to size-K feature extractors in each layer to extract multi-scale neighborhood features for node representations. Considering that if only a size-K feature extractor is used, the resulting model is linear; this linear approximation leads to information loss and classification accuracy degradation. For example, the SGC [15] model only uses a fixed size-K feature extractor. Although the training time of the SGC model is reduced to a record low, its performance on some benchmark datasets is degraded compared with GCN. In contrast, using a set of size-1 up to size-K feature extractors (e.g., in our MAGN) can avoid the linear approximation and increase the representation ability. More importantly, our model needs fewer layers to obtain adequate global information.

MAGN Model
In the GCN, hidden representations of each layer are aggregated among neighbors that are one hop away. This implies that after K layers, a node extracts feature information from all nodes that are K hops away in the graph. Each GCN layer has only a size-1 feature extractor, so more layers are needed to obtain adequate global information. Different from GCN, we explore a set of size-1 up to size-K feature extractors in each layer to extract multiscale neighborhood features for node representations. Considering that if only a size-K feature extractor is used, the resulting model is linear; this linear approximation leads to information loss and classification accuracy degradation. For example, the SGC [15] model only uses a fixed size-K feature extractor. Although the training time of the SGC model is reduced to a record low, its performance on some benchmark datasets is degraded compared with GCN. In contrast, using a set of size-1 up to size-K feature extractors (e.g., in More importantly, our model needs fewer layers to obtain adequate global information.
We first normalize the similarity matrix S, and let S denote the "normalized" similarity matrix: where D is a diagonal matrix and D = diag(∑ n j=1 S 1j , . . . ,∑ n j=1 S nj ). For the overall model, we consider a multi-layer MAGN with the following layer-wise propagation rule: where H (t−1) is the feature representation of the (t − 1)-th layer; H (0) equals to the input feature matrix X. Θ (t) is a layer-specific trainable weight matrix, and σ denotes ReLU activation function. Note that the input feature matrix X consists of all node features (i.e. labeled and unlabeled), and we can utilize the similarity matrix to combine feature information from labeled and unlabeled nodes to generate node embedding. S K represents the k-th power of S and we define S 0 as the identity matrix; S 1 to S K represent a set of size-1 up to size-K feature extractors, which are used to extract multi-scale neighborhood features. When k > 1, calculating the k-th power of S can transfer the similarity from 1-hop neighbors to k-hop neighbors, which is equivalent to adding an edge directly connected to the k-hop neighbors for each node. Therefore, our model can directly obtain feature information from k-hop neighbors for each node by learning the k-th power of S. With the increase in k, the scope of feature extraction (i.e., feature propagation) gradually expands, which can capture more global information.
In each MAGN layer, feature representations are updated in four stages: linear transformation (i.e., feature learning), nonlinear activation, feature propagation, and multi-scale aggregation. We adopt a strategy of learning first and then propagating, using a trainable weight matrix to perform linear transformation to degrade the feature dimensions, and then perform multi-scale feature propagation on low-dimensional features. Compared with the strategy of propagating first and then learning, using this method can reduce computational complexity and shorten the training time. We describe each step in detail.
Linear transformation and nonlinear activation. Each MAGN layer first performs linear transformation by a trainable weight matrix Θ (t) to learn node features. Then, a nonlinear activation function ReLU is applied pointwise before outputting hidden representation H (t) : In particular, H (0) = X and H (1) = ReLU(X Θ (1) ) when t = 1.

Feature propagation and multi-scale aggregation.
After the feature transformation, we use the "normalized" similarity matrix S to generate a set of size-1 up to size-K feature extractors for multi-scale feature propagation. Then, a mean-pooling operation is applied to aggregate hidden representation H (t) and the output of multi-scale feature propagation. In summary, the final feature representation updating rule of the t-th layer is: where S 1 H (t) to S K H (t) denote feature propagation on different scales of the graph and can directly obtain feature information across near or distant neighbors. S 0 H (t) = H (t) is added to keep more of its own feature information for each node. ( S k ) ij represents the probability of starting at node i to complete k steps of the random walk and finally reaching node j. The k-th power of S contains statistics from the k-th step of a random walk on the graph. Therefore, S 1 to S K can combine information from different step-sizes (i.e., graph scales). The output row-vector of individual node i is: where N k (i) is an empty set if k = 0; otherwise, it is the set of k-hops neighbors of node i.
For an individual node, its final feature representation in the t-th layer is the aggregation of multi-hops neighbors' features and its own features. It is worth noting that the propagation scheme of this model does not require any additional parameters (i.e., trainable weights) to train, in contrast to models such as GCN, which usually require more parameters for each additional propagation function. Therefore, each layer of this model can propagate farther with very few parameters.
Prediction function. The output layer is similar to GCN, and we use a softmax function to predict the labels. The class predictionẐ of a t-layer MAGN can be written as: Loss function. The loss function is defined as the cross-entropy of prediction over the labeled nodes: where L is the set of labeled nodes used as the training set and f is the number of classes. Y ∈ R | L|× f represents the corresponding true label matrix for the training set, andŶ il is 1 if the node i belongs to class l; otherwise, it is 0.Ẑ il is the predicted probability that node i is of class l.
Our model can learn based on the features of both labeled and unlabeled nodes simultaneously, and only use the training set labels to calculate the loss (i.e., only the training set labels are used for learning). Therefore, our model is a semi-supervised learning method for graphs. The proposed MAGN model for semi-supervised learning is schematically depicted in Figure 3, on the left is an input graph, in the middle is a t-layer MAGN model, and on the right is an output graph, where S is the normalized similarity matrix, I is the identity matrix and equal S 0 . × is the matrix-matrix multiply operator, and σ is the ReLU activation function. H (t−1) is the input feature representation, H (0) = X, and H (t) is the output feature representation. Overall, Figure 3 shows that labeled and unlabeled nodes are used to predict the labels of unlabeled nodes via the MAGN model. Input Graph

Experiments
In this section, we test our proposed MAGN model on semi-supervised node classification tasks. We first introduce the four datasets used in the experiments. Then, we list the compared methods and some implementation details. Finally, we test the classification accuracy of our model on fixed data splits and random data splits and compare it with some popular methods.

Experiments
In this section, we test our proposed MAGN model on semi-supervised node classification tasks. We first introduce the four datasets used in the experiments. Then, we list the compared methods and some implementation details. Finally, we test the classification accuracy of our model on fixed data splits and random data splits and compare it with some popular methods.

Datasets
For our experiments, we used three well-known citation network datasets: Cora and CiteSeer from [32], and PubMed from [33]. In the three citation network datasets, nodes represent documents and edges are citation links. We also introduce a co-author dataset for the node classification task: Coauthor CS (from [34]), which is a co-authorship graph. Here, nodes are authors, that are connected by an edge if they co-authored a paper, and the class labels indicate the most active research field of each author. All datasets use a bag-of-words representation of the papers' abstracts as features. These four datasets can be downloaded from https://github.com/shchur/gnn-benchmark/tree/master/data (accessed on 5 April 2020). The details of these four datasets are summarized in Table 1.

Implementation
In practice, we make use of Pytorch for implementation by using sparse-dense matrix multiplications. In the concrete implementation, the similarity matrix is a sparse matrix, and the input feature matrix and the learnable weight matrices are dense matrices. All the experiments were conducted on a computer with an Nvidia GeForce RTX 2080 Ti GPU (11 GB GPU memory, Dell, Beijing and China). For the experimental parameter settings, as shown in Tables 2 and 3, where dropout rate [38], L 2 regularization, and early stopping are added to avoid overfitting, all the experimental methods used Adam optimizer [39]. In addition, for the parameter settings for random data splits, we set different K values for our method in different datasets. For Cora, K = 6. For CiteSeer and PubMed, K = 5. For Coauthor CS, K = 4.

Fixed Data Splits
In this first experiment, we use the fixed data splits from [7], as they are the open standard data splits in the literature. Fixed split is the most commonly used data splitting method. Many works use it as a standard split to test the classification performance of their methods. The fixed split has only one split; according to [7], all experiments use 20 nodes per class as the training set. The size of the training set is determined by the number of classes, which can ensure that the labels of all types of nodes are used for training. According to the number of classes in Table 1, we can obtain the size of training sets for Cora, CiteSeer, and PubMed to be 140, 120, and 60, respectively. The number of training nodes in these three datasets accounts for a very small proportion of the total number of nodes, which are 5.2% (Cora), 3.6% (CiteSeer), and 0.3% (PubMed), respectively. There is no standard fixed split for Coauthor CS, and we use it in random data splits. Furthermore, the validation and test sets of these three datasets keep the same size, with 500 nodes for the validation and 1000 nodes for the test. We ran 20 different random initializations for our method and the parameter settings are provided in Table 2. The experimental results with a fixed split are reported in Table 4 in percentages. The classification accuracy of all the compared methods is collected from [8,13,15,25,35,36].
From Table 4, we can see that although our method performs lower than GAT on CiteSeer, it surpasses all compared methods on Cora and PubMed. It is worth noting that in Table 4, the classification accuracies of these graph neural network algorithms are much higher than that of MLP and traditional graph-based semi-supervised learning methods. MLP is a kind of multi-layer fully connected network and cannot use graph structure information for learning. Therefore, the classification accuracy is relatively low. These traditional graph-based semi-supervised learning methods only leverage graph structure information and known label information to train the semi-supervised classifier, but the feature information of nodes is ignored in training, which leads to a lower classification accuracy. For fixed data splits, we also report the classification accuracy of our method with different K and different t (t represents the number of network layers). Except for K and t, the other experimental settings are the same as in Table 4 and are provided in Table  2. Experimental results are shown in Figure 4 in percent. Here, we report the average accuracy of running 20 different random initializations for our models. From the figure, it can be found that the model performs better when t = 2 and K ∈ [2,7]. Although the model has only two layers, it can obtain adequate global information by adjusting the value of K. From Table 4, we can see that although our method performs lower than GAT on CiteSeer, it surpasses all compared methods on Cora and PubMed. It is worth noting that in Table 4, the classification accuracies of these graph neural network algorithms are much higher than that of MLP and traditional graph-based semi-supervised learning methods. MLP is a kind of multi-layer fully connected network and cannot use graph structure information for learning. Therefore, the classification accuracy is relatively low. These traditional graph-based semi-supervised learning methods only leverage graph structure information and known label information to train the semi-supervised classifier, but the feature information of nodes is ignored in training, which leads to a lower classification accuracy.
For fixed data splits, we also report the classification accuracy of our method with different K and different t ( t represents the number of network layers). Except for K and t , the other experimental settings are the same as in Table 4 and are provided in Table 2. Experimental results are shown in Figure 4 in percent. Here, we report the average accuracy of running 20 different random initializations for our models.    In addition, the smoothing parameter µ also has a certain impact on model performance. To this end, we further test the model performance with different µ on fixed data splits. The experimental results are shown in Figure 5 as percentages; we can see that the model performs best when µ is set to 1. A larger µ does not improve the accuracy, which may be because a larger µ leads to the inability to better distinguish the relative importance of neighbor nodes.
In addition, the smoothing parameter μ also has a certain impact on model performance. To this end, we further test the model performance with different μ on fixed data splits. The experimental results are shown in Figure 5 as percentages; we can see that the model performs best when μ is set to 1. A larger μ does not improve the accuracy, which may be because a larger μ leads to the inability to better distinguish the relative importance of neighbor nodes. (c) MAGN on PubMed. Figure 6 shows the t-SNE [40] visualization of the nodes from the Cora dataset; the left one is t-SNE visualization of the nodes in the Cora dataset from the raw features and the right one from the Cora dataset is trained with a two-layer MAGN model using 5.2% of labels. Colors denote the node class; we can see that the features of different types of nodes can be well-distinguished after training.

Random Data Splits
Since the fixed split has only one split, in order to better prove that our method has competitive performance, we use multiple random splits in this part. The training set allocated for each random split is different. In order to ensure the fairness of the following comparisons, we will set some same random seeds to ensure that our method and other compared methods have the same training, validation, and test sets in each random split.
Next, following the settings of Buchnik and Cohen [41], for Cora, CiteSeer, and Pub-Med, we conducted experiments keeping the same size in training, validation, and test sets as in Table 4, but now selecting those nodes uniformly at random. For Coauthor CS, similarly, we randomly selected 20 nodes for each class as the training set, and randomly selected 500 nodes for validation and 1000 nodes for testing. We used 10 random seeds for 10 splits on each dataset, and every model was run with five different random initializations on each split, leading to a total of 50 runs per model. Note that all models have the same 10 random seeds, which can guarantee that all models have the same training, validation, and test sets for each split. Experimental results (i.e., average accuracy and standard deviation) with random data splits are shown in Table 5 and all experiments are  Figure 6 shows the t-SNE [40] visualization of the nodes from the Cora dataset; the left one is t-SNE visualization of the nodes in the Cora dataset from the raw features and the right one from the Cora dataset is trained with a two-layer MAGN model using 5.2% of labels. Colors denote the node class; we can see that the features of different types of nodes can be well-distinguished after training.
In addition, the smoothing parameter μ also has a certain impact on model performance. To this end, we further test the model performance with different μ on fixed data splits. The experimental results are shown in Figure 5 as percentages; we can see that the model performs best when μ is set to 1. A larger μ does not improve the accuracy, which may be because a larger μ leads to the inability to better distinguish the relative importance of neighbor nodes. (c) MAGN on PubMed. Figure 6 shows the t-SNE [40] visualization of the nodes from the Cora dataset; the left one is t-SNE visualization of the nodes in the Cora dataset from the raw features and the right one from the Cora dataset is trained with a two-layer MAGN model using 5.2% of labels. Colors denote the node class; we can see that the features of different types of nodes can be well-distinguished after training.

Random Data Splits
Since the fixed split has only one split, in order to better prove that our method has competitive performance, we use multiple random splits in this part. The training set allocated for each random split is different. In order to ensure the fairness of the following comparisons, we will set some same random seeds to ensure that our method and other compared methods have the same training, validation, and test sets in each random split.
Next, following the settings of Buchnik and Cohen [41], for Cora, CiteSeer, and Pub-Med, we conducted experiments keeping the same size in training, validation, and test sets as in Table 4, but now selecting those nodes uniformly at random. For Coauthor CS, similarly, we randomly selected 20 nodes for each class as the training set, and randomly selected 500 nodes for validation and 1000 nodes for testing. We used 10 random seeds for 10 splits on each dataset, and every model was run with five different random initializations on each split, leading to a total of 50 runs per model. Note that all models have the same 10 random seeds, which can guarantee that all models have the same training, validation, and test sets for each split. Experimental results (i.e., average accuracy and standard deviation) with random data splits are shown in Table 5 and all experiments are

Random Data Splits
Since the fixed split has only one split, in order to better prove that our method has competitive performance, we use multiple random splits in this part. The training set allocated for each random split is different. In order to ensure the fairness of the following comparisons, we will set some same random seeds to ensure that our method and other compared methods have the same training, validation, and test sets in each random split.
Next, following the settings of Buchnik and Cohen [41], for Cora, CiteSeer, and PubMed, we conducted experiments keeping the same size in training, validation, and test sets as in Table 4, but now selecting those nodes uniformly at random. For Coauthor CS, similarly, we randomly selected 20 nodes for each class as the training set, and randomly selected 500 nodes for validation and 1000 nodes for testing. We used 10 random seeds for 10 splits on each dataset, and every model was run with five different random initializations on each split, leading to a total of 50 runs per model. Note that all models have the same 10 random seeds, which can guarantee that all models have the same training, validation, and test sets for each split. Experimental results (i.e., average accuracy and standard deviation) with random data splits are shown in Table 5 and all experiments are completed by us. For every model, we selected the experimental settings that achieved the best accuracy, and these experimental settings are provided in Table 3. From Table 5, we can see that our method achieves the best accuracy on all datasets. MLP has the lowest classification accuracy on all datasets. This is mainly because MLP does not make use of graph structure information in learning. Therefore, it cannot aggregate neighborhood information to generate node representations, which leads to poor performance in processing graph node classification tasks. The other compared methods are basically shallow single-scale aggregation methods, and the feature information obtained does not exceed the 2-hop neighborhood. Therefore, these models have difficulty obtaining adequate global information. Our model is based on multi-scale neighborhood aggregation, and only a small number of layers is needed to obtain adequate global information, which leads to improved classification accuracy. In Table 6, we further compare the number of parameters (i.e., trainable weights) that need to be trained on each dataset for these different methods. In Table 6, we can see that, except for SGC, our method's number of parameters are on par or lower than other compared methods. The main reason is that the propagation scheme of our method does not require any additional parameters to train, which results in a relatively small total number of parameters. SGC is usually a single-layer graph convolution method, and two or more layers will cause model performance degradation. Therefore, SGC can be trained with few parameters.
We also tested the performance of MAGN and GCN models with different network layers on random data splits; the experimental results are shown in Figure 7. From the figure, we can see that both MAGN and GCN achieve the highest accuracy when the number of network layers is 2. Deep neural networks do not improve the accuracy, which may be due to the simple bag-of-words features and the small training set size. However, it is worth noting that the accuracy of the proposed MAGN model significantly outperforms GCN, especially when the number of network layers is 1.

Random Splits with Different Training Set Sizes
Semi-supervised learning aims to obtain better learning results with less training data, which can greatly reduce the cost of manual labeling. In the above experiment evaluation, we selected 20 labeled nodes per class as the training set. In order to prove that our method still has better classification accuracy on less training data, we used some smaller training sets (that is, we selected fewer nodes per class as the training set).
Next, we randomly selected 5, 10, and 15 nodes per class, respectively, as the training set and kept the same size in the validation and test sets as in Table 5. Compared with Tables 4 and 5, fewer labeled nodes were used for training. We used the same 10 random seeds as in Table 5 for determining the splits, and each model was run with five different random initializations on each split. The experimental settings of all models are the same as in Table  5 and are provided in Table 3. The experimental results are shown in Tables 7-10.

Random Splits with Different Training Set Sizes
Semi-supervised learning aims to obtain better learning results with less training data, which can greatly reduce the cost of manual labeling. In the above experiment evaluation, we selected 20 labeled nodes per class as the training set. In order to prove that our method still has better classification accuracy on less training data, we used some smaller training sets (that is, we selected fewer nodes per class as the training set).
Next, we randomly selected 5, 10, and 15 nodes per class, respectively, as the training set and kept the same size in the validation and test sets as in Table 5. Compared with Tables 4 and 5, fewer labeled nodes were used for training. We used the same 10 random seeds as in Table 5 for determining the splits, and each model was run with five different random initializations on each split. The experimental settings of all models are the same as in Table 5 and are provided in Table 3. The experimental results are shown in Tables 7-10.  It can be seen in Tables 7-10 that the average accuracy of the proposed MAGN model outperforms the compared models on all datasets, and the model performance is relatively stable. This demonstrates that our model still maintains a better classification accuracy with fewer labeled nodes.

Conclusions
In this paper, we proposed a novel method for semi-supervised learning on graphstructured data. We first constructed a similarity matrix based on the original feature distribution of adjacent node pairs. Then, we used the similarity matrix to generate a set of feature extractors to extract multi-scale neighborhood features. Compared with traditional graph convolution methods, our method can distinguish more than the relative importance of neighbor nodes in feature propagation, and more importantly, it only requires a small number of layers to obtain sufficient global information. In addition, our method can aggregate feature information from unlabeled nodes by encoding graph structure and node features, which is conducive to semi-supervised learning. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in the case that labeled data is extremely scarce. For future work, we plan to extend our methods to address other (larger) graph datasets.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: All datasets of this study are publicly available at https://github.com/ shchur/gnn-benchmark/tree/master/data (accessed on 23 March 2021).