Graph Dilated Network with Rejection Mechanism

: Recently, graph neural networks (GNNs) have achieved great success in dealing with graph-based data. The basic idea of GNNs is iteratively aggregating the information from neighbors, which is a special form of Laplacian smoothing. However, most of GNNs fall into the over-smoothing problem, i.e., when the model goes deeper, the learned representations become indistinguishable. This reﬂects the inability of the current GNNs to explore the global graph structure. In this paper, we propose a novel graph neural network to address this problem. A rejection mechanism is designed to address the over-smoothing problem, and a dilated graph convolution kernel is presented to capture the high-level graph structure. A number of experimental results demonstrate that the proposed model outperforms the state-of-the-art GNNs, and can effectively overcome the over-smoothing problem.


Introduction
Graph structure data is ubiquitous in the real world, such as social networks [1][2][3][4][5], citation networks [6][7][8], wireless sensor networks [9], and graph-based molecules [10,11]. Recently, graph neural networks (GNNs) have aroused a surge of research interest. The goal of GNNs is to learn representation vectors of nodes in a graph, and then the learned vectors can be used in many graph-based applications, such as link prediction and node classification [12][13][14][15][16]. The general idea of GNNs is "message propagation", i.e., each node iteratively passes, transforms, and aggregates messages (i.e., features) from its neighbors. Then, after k iterations, each node can capture the information of its neighbor nodes within k-hops.
There are many works in developing graph neural networks. GCN [17] simplifies the localized spectral filters used in [18] by weighted propagating information. GraphSAGE [6] proposes several types of aggregation strategies to propagate messages from neighbors effectively. GAT [10] adopts a self-attention [19] to dynamically propagate messages.
However, the majority of these models suffer from the "over-smoothing" problem [20,21]. Specifically, message propagation is proved to be a type of Laplacian smoothing. Stacking too many layers (i.e., repeatedly applying Laplacian smoothing many times) may lead to the representations of nodes indistinguishable and hurt the performance of GNNs [20]. Furthermore, a random walk view to message propagation shows GNNs converge to a random walk distribution [21], leading to similar conclusions with the "over-smoothing" problem.
As a matter of fact, many GNNs are shallow neural networks with only two or three layers. Thus, limited neighborhood information is captured. Moreover, adding additional layers cannot always improve the performance of GNNs, and may even have an opposite effect on GNNs [17,20]. The above discussions reflect the inability of the current GNN models in exploring the global graph structure. Therefore, although a sufficient size of neighborhoods may help models to capture the high-level graph patterns [12,20,21], most of GNNs fall into the over-smoothing problem and can only capture limited local structure of nodes.
In this paper, we focus on dealing with the limitations of current GNNs, i.e., the need and the bottleneck both introduced by the global information. In detail, we propose a rejection mechanism, which softly rejects information from distant nodes, and allows our model to be free from the over-smoothing problem. To further capture the global graph structure, a graph dilated convolution kernel is proposed, and enlarges the size of the neighborhood at each layer.
The main contributions of this paper are summarized as follows.
• We propose a novel graph neural network model, i.e., Graph Dilated networks with Rejection mechanism (GraphDRej), to learn expressive node representations.

•
We design a rejection mechanism, which is a simple but effective strategy to address the over-smoothing problem.

•
We present multiple graph dilated convolution kernels to explore a sufficient size of neighborhoods in message propagation.

•
Extensive experimental results show that the proposed model achieves state-of-the-art results. Also, the effectiveness of both the rejection mechanism and the graph dilated convolution kernel used in GraphDRej is demonstrated.
The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the preliminaries. Section 4 analyzes the limitation of existing GNNs. Section 5 proposes GraphDRej model, i.e., Graph Dilated Network with Rejection Mechanism. Section 6 reports the experiments. Section 7 concludes the paper.

Related Work
Recently, a vast amount of literature has focused on analyzing graph-based problems by leveraging the graph neural networks [6,10,17,[22][23][24]. GCN [17] is a pilot work, which simplifies the localized spectral filters used in [18], and extracts the 1-localized information for each node in each convolution layer. Then, the deeper relational features can be captured by stacking multiple convolution layers. GraphSAGE [6] takes the representation learning into a formal pattern, i.e., aggregation and combination, and proposes several kinds of aggregation strategies. Actually, GCN can be taken as a special case of GraphSAGE. GAT [10] considers the diversity in neighborhoods, and leverages the self-attention mechanism [19] to effectively select important information in neighborhoods. GG-NN [25] designs a gate mechanism-based aggregation function, which provides a weighted average of the messages from neighbors and the center node. Besides, there are some works [26,27] considering the imbalanced nodes in representation. Although these GNNs can characterize the node structure and learn the representations of nodes, most of them suffer from the over-smoothing problem.
Meanwhile, there are several works analyzing the mechanism of the graph neural networks, which are related to the over-smoothing problem. Li et al. (2018) [20] took graph neural networks as a special form of Laplacian smoothing, which is the main reason why GNNs work. At the same time, it shows the limitation of current GNNs, i.e., GNNs cannot capture the global graph structure due to the over-smoothing problem. Although Li et al. (2018) [20] proposed to leverage co-and self-training to improve the GNN performance, both co-and self-training are designed to deal with the limited validation issue. Therefore, the over-smoothing problem mentioned by Li et al. (2018) [20] still needs to be solved. Xu et al. (2018) [21] provide another viewpoint of GNNs. The message propagation of GNNs can be considered as a modified random walk. The distribution of influence scores between two nodes converges to a stationary distribution, which also reflects the over-smoothing problem. Xu et al. (2018) [21] proposed a densenet-like [28] module to aggregate neighbors from different hops adaptively as well as to address the over-smoothing problem. The PN [29] model proposes a novel normalization layer, which is applied on the output of the graph convolution layer to address the over-smoothing problem. Normalizing the representations of nodes can prevent these representations from being too similar. MADReg [30] provides a topological view of the over-smoothing problem and brings a MADReg loss to avoid the representations of the distant nodes to be similar to the representations of the neighbor nodes. Experimental results (see Section 6.5) show MADReg can only relieve rather than prevent this problem. Although some of these works can solve the over-smoothing problem, none of them considers how to capture a suitable larger neighborhood well to learn better node representations in a situation where GNNs are free of the over-smoothing problem. In this paper, GraphDRej can not only overcome the over-smoothing problem by a rejection mechanism, but also better characterize node structures by multiple graph dilated convolution kernels. Related experimental results are shown in Sections 6.4 and 6.5.

Preliminaries
We begin with summarizing the common GNN models and, along the way, introduce our notations. We represent a graph G as (V, E), where V denotes the node set and E is the edge set. Let A be the adjacency matrix of G. The k-hop neighborhood N k (v) of node v is the set of all nodes reaching to node v in k steps exactly. As default, N(v) is the 1-hop neighborhood of node v. Each node v ∈ V is associated with a feature vector X v and a label y v .
Most GNNs aim to learn a node representation mapping function by using the graph structure and node features. Current GNNs follow a "message propagation" scheme, where at each iteration, the node can aggregate the information (i.e., "messages") from its neighbors. Therefore, after k iterations of propagation, the node representation can capture k-hop structure information. Formally, at the k-th layer, a GNN can be expressed as where h (k) v is the representation (i.e., "message") of node v at the k-th layer, which captures the k hop structure information, and m (k) v can be taken as the integrated message representation from neighbors. The initialization message h 0 v = X v . AGGREGATE is a message propagation function which aggregates information from neighbors, and COMBINE is a function that combines the information from different hops. An illustration is shown in Figure 1. All the information from the neighbors (i.e., nodes 2-6) of the center node 1 is merged by AGGREGATE. Then, COMBINE combines the information from the neighbors and the center node 1. Different models have different propagation functions. For example, Graph Convolutional Network (GCN) [17] proposes a degree-normalized algorithm to aggregate neighbors' messages. The propagation rule (i.e., Equation (1)) of GCN can be presented as where W (k) is the weight matrix. AGGREGATE and COMBINE are fused into one function (i.e., u ∈ N(v) in Equation (1) and u ∈ {v} ∪ N(v) in Equation (3)), and σ is a nonlinear function.

Limitation of Existing GNNs
Although GNN models significantly outperform many state-of-the-art methods on some benchmarks, there remain some problems that limit the performance of GNNs.
In point of fact, the message propagation scheme in GNNs can be taken as a random walk [21]. The influence score of node u on node v, which measures the effect of the input feature of node u to the representation of node v, can be defined as Then, the expected distribution of the normalized influence score follows a slightly modified k-step random walk distribution P(u|v, k) starting at the root node v [21]. Thus, the message propagates from node u to node v in a random walk pattern.
As we know, the node sequence (v t : t = 0, 1, . . . ) generated by a random walk is a Markov chain. When k → ∞, the probability distribution P(u|v, k → ∞) converges to a stationary distribution (i.e., P(o|x) ≡ P(o|y) for any node x, y, o ∈ V) [31]. It means that, for the message propagation scheme, the representation of each node is influenced almost equally by any other nodes (i.e., I(x, o) ≈ I(y, o)) [21] when GNNs go deeper (i.e., a large value of k). In other words, the node representations may be over-smoothed and lose their focus by the information from distant nodes [20], which is called the over-smoothing problem.
Therefore, one of the limitations of existing GNNs is that most GNN models cannot go deeper. A deeper version of these models even performs worse than a shallow version, and the best performance of these models is usually achieved within two or three layers [17]. Although a sufficient size of neighborhoods is especially important, which allows the models to explore a more complex graph structure and aggregate useful neighbors' information [20,21], existing GNNs can only capture limited structure information in a small size of the neighborhoods.
In summary, on the one hand, when we stack too many layers in a GNN model, the node representation will be over smoothed, which makes the node representation indistinguishable. On the other hand, there always exists a need for a sufficient size of neighborhoods to learn a more effective representation [20,21]. Therefore, in this paper, we propose a new neural network architecture to tackle the over-smoothing problem and capture a larger size of neighborhoods at the same time.

Graph Dilated Network with Rejection Mechanism
In this section, we introduce the proposed model, called Graph Dilated network with Rejection mechanism (GraphDRej). To overcome the conflict of GNNs just shown in the preceding section, we propose two main components, i.e., the rejection mechanism and the graph dilated convolution kernel. Generally speaking, at each layer, each node can aggregate messages from a large neighborhood via graph dilated convolution kernels, and a hop penalty parameter is introduced to implicitly reject the information from distant nodes (addressing the over-smoothing problem). At the last layer, a cross-entropy loss is adopted to optimize the parameters of the proposed model. Figure 2 illustrates the architecture of the proposed model.

Method
To address the over-smoothing problem, we propose a simple but effective Rejection Mechanism (RM). As discussed in Section 4, the over-smoothing problem is caused by averaging too much information from distant neighbors. Therefore, we introduce a learnable hop penalty parameter c (k) ∈ R (optimized by the backpropagation) at each layer to adaptively control the messages flowing from one layer to the next layer (i.e., from one hop to the next hop). In this way, the model can reject the messages from distant nodes to address the over-smoothing problem.
Different from GAT, which dynamically selects messages from 1-hop neighbors, the rejection mechanism is a layer-wise or hop-wise operation. Specifically, the rejection mechanism focuses on the combination function (i.e., Equation (2)) introduced in the message propagation. At the k-th layer, the hop penalty parameter c (k) is applied into the integrated message representation from neighbors, and then the combination function can be rewritten as 1] is the rescaled penalty parameter, and refers that each element of the vector m (k) multiplies the rescaled hop penalty parameter c (k) . Intuitively, the value of c (k) can be regarded as a gate to influence the message propagation from neighbors to the center node. ⊕ is an element-wise addition operation.
To fully understand the proposed rejection mechanism, we provide an example (shown in Figure 3) to illustrate how it works to address the over-smoothing problem. For simplicity, a chain graph containing four nodes is taken as an example, and only considers the predecessor nodes as 1-hop neighbors. Considering a three-layer graph neural network, which adopts the mean aggregation function and the proposed combination function with RM, the propagation in the k-th layer for this chain graph can be expressed as (ignoring the weight matrix and nonlinear function in Equation (5)) where i, k ∈ {1, 2, 3}, the symbol in Equation (5) is ignored for expressing convenience, and c (k) refers to the rescaled parameter c (k) . Then, the node 3's representation obtained from the last layer, can be represented as It reveals that the influence from the distant node 0 to the representation of node 3 is punished by ∏ 3 k=1 c (k) . Due to c (k) ∈ [0, 1], multiple multiplications will lead to a small value, which adaptively controls the message flowing from distant nodes (e.g., node 0) to the representation node (e.g., node 3).
Therefore, the benefits of the proposed rejection mechanism can be summarized as follows. (1) When stacking layers to build a deep GNN, the information from distant nodes will be more likely to be punished or even rejected. It is a simple but effective way to address the over-smoothing problem.
(2) The information from distant nodes can also affect the node representation, which helps to capture the global structure. (3) The combination of the hop penalty parameters leads to adaptively aggregate information from different hops, which contributes to building a more powerful deep graph model.

Discussion
We notice that the graph neural network GG-NN [25] proposes a gate mechanism to aggregate neighbors, which is similar to the rejection mechanism proposed in our paper. The differences between these two mechanisms are summarized as follows. (1) The motivations of these two mechanisms are different. For GG-NN, the gate mechanism is proposed to dynamically aggregate neighbors' information based on the information of neighbors and the center node. For GraphDRej, the rejection mechanism is proposed to reject the information from distant nodes to address the over-smoothing problem. (2) The rejection mechanism is simpler and does not require much computation, whereas for GG-NN, the gate computation involves several multiplication and addition operations of matrices [25].
(3) The rejection mechanism is a more direct way that strictly limits the information from the distant nodes by the penalty parameters, whereas in GG-NN, according to Equation (6) in the original paper of GG-NN [25], not only the messages from the neighbors are rescaled, but also the message from the center node is rescaled. This means the gate mechanism in GG-NN is more like a weighted average of the message from neighbors and the center node, and the messages from the center node may also be rejected. Experimental results (see Section 6.6) show GG-NN still suffers from the over-smoothing problem.
To further analyze the connection between the rejection mechanism and the gate mechanism, we propose a gate-based version of GraphDRej (denoted as GraphDRej-Gate). The only difference between GraphDRej and GraphDRej-Gate is the computation of the penalty parameter. The penalty parameter c (k) in GraphDRej is used for all the nodes in the same layer, whereas the penalty parameter where W (k) and U (k) are the transformation matrices. Compared with GG-NN, the rejection (gate) mechanism in GraphDRej-Gate is only applied to the neighbors' messages, which helps to address the over-smoothing problem (see Section 6.6). Note that although the gate mechanism is common in other research areas, how to apply the gate mechanism on GNNs to address the over-smoothing problem is still an open question. For example, GG-NN falls into the over-smoothing problem, while GraphDRej-Gate can solve this problem. Actually, the rejection mechanism is a more general idea and leads to a direction on addressing this problem, and the penalty parameter can be a simple learnable parameter or can be computed by a gate function.

Graph Dilated Convolution Kernel
To have a sufficient size of neighborhoods in message propagation, we explore the idea of dilated convolution kernel [32] used in Computer Version to enlarge the reception field. Then, we propose a graph dilated convolution kernel, which changes the standard and single message propagation scheme, i.e., one layer for searching 1-hop neighbors, and brings the diversity of propagation schemes.
Specifically, we introduce a graph dilation rate γ, which refers to the distance between the node and its neighbor nodes. Figure 4 illustrates the examples of graph dilated convolution kernels with γ ∈ {1, 2, 3}. Actually, the standard graph convolution kernels used in GNNs can be taken as a special case of our graph dilated convolution kernel where γ = 1. Obviously, a larger value of γ allows the model to explore nodes from a larger neighborhood. For example, a two-layer stacked graph dilated convolution (γ = 3) network can view neighbors within six hops, whereas for a standard two-layer stacked graph convolution network, only neighbors in two hops are considered. Therefore, we can enlarge the neighborhood size via graph dilated convolution kernels. As presented above, different graph dilated convolution kernels take nodes in different hops as the neighborhoods. To have a diverse and sufficient size of neighborhoods, multiple types of graph dilated convolution kernels are simultaneously applied to the center node. Then, we further propose Multi Graph Dilated Convolution Kernels (MGDCK). At each k-th layer, multiple types of graph dilated convolution kernels are adopted. Each node v may aggregate the messages not only from 1-hop neighbors (i.e., N(v)), but also from 2-hop neighbors (i.e.,N 2 (v)). Formally, when each layer contains T types of graph dilated convolution kernels, the AGGREGATE and COMBINE functions can be expressed as where t ∈ {1, 2, 3, . . . , T}, COMBINE refers to the combination function with RM mentioned in Section 5.1, and the implementation of AGGREGATE can be various. In this paper, we adopt a mean aggregation strategy. Other aggregation functions can also be applied. Then, the representation of node v at the k-th layer is the mean of all the representation vectors produced by T graph dilated convolution kernels Note that other pooling strategies such as max-pool, min-pool, and attention-pool can also be considered as the summarization operators on all the representations from these T graph dilated convolution kernels. Here, we only take the mean-pool as an example.
Although directly stacking layers can also enlarge the size of neighborhoods, too many layers may introduce the difficulty of the model training. The experimental results also show the effectiveness of graph dilated kernels compared with the directly stacking strategy (see Section 6.7).

Training
In this section, we introduce the training details of GraphDRej, including the loss function and the overall algorithm.

Loss Function
We follow the loss used in standard GNNs [6,10,17]. The loss function of GraphDRej is a supervised loss, i.e., making a prediction for the label of each training node. The supervised loss is defined as the cross-entropy.
where W z is a weight matrix, h is the output of last layer and is taken as the learned node representation vector, and z v is the label prediction of node v in the training nodes.

Overall Algorithm
The overall algorithm of GraphDRej is summarized in Algorithm 1. First, in each iteration, we sample a batch of nodes from the training nodes in Line 4. At each layer k, multiple graph dilated convolution kernels are applied to aggregate information from neighbors (Lines 8-9), and a rejection mechanism based combination is adopted to combine the information from the neighbors and the center node (Lines 10-11). Then, the node representation vector is updated by averaging the representation produced by different graph dilated convolution kernels (Line 13). We calculate the label prediction and the cross-entropy loss in Lines 16-17. After the forward propagation (Lines 4-17), backward propagation is carried out to update the parameters in Line 18. Finally, after the convergence, we take the last layer output as the embeddings of nodes in Line 20.  16 Calculate the prediction label of each node in B according to Equation (13); 17 Calculate the loss function according to Equation (14); 18 Backward propagate and update parameters in GraphDRej;

Experiments
We evaluate the benefits of GraphDRej against a number of state-of-the-art graph neural networks with the goal of answering the following questions.

Q1.
How does the GraphDRej perform in comparison to the state-of-the-art GNNs? Q2. Can the proposed rejection mechanism address the over-smoothing problem? Q3. How much improvement is provided by multiple graph dilated convolution kernels?

Data Sets
To adequately evaluate the performance of our model and baselines, the experiments are conducted on three real-world data sets.
Cora is a research paper-based graph. It contains 2708 machine learning papers from seven classes and 5429 links between them. The links are citation relationships among the papers. Each paper is described by a binary vector of 1433 dimensions, indicating the presence of the corresponding words.
Citeseer is another research based graph that contains 3327 publications from six classes and 4732 links between them. Similar to Cora, the links are citation relationships among these papers, and each paper is described by a binary vector of 3703 dimensions. Pubmed is also a citation graph, and contains 19,717 papers from three classes as well as 44,338 links between them. Similar to Cora, the links are citation relationships among the papers, and each paper is described by a binary vector of 500 dimensions.
The statistics of the data sets are presented in Table 1.

Baselines
In the performance comparison, we consider the state-of-the-art baselines based on GNNs.
GCN [17] is a graph convolution neural network, which simplifies the localized spectral filters used in [18], and extracts the 1-localized information for each node in each convolution layer. GraphSAGE [6] extends GCN to a more general way, and introduces two basic functions, i.e., the aggregation function and the combination function. GAT [10] employs the idea of self attention [19] to filter important messages from neighbors. JK [21] deals with the over-smoothing problem, which uses a densenet-like [28] module to adaptively aggregate neighbors information from different hops. PN [29] proposes a novel normalization layer which is applied on the output of the graph convolutional layer. In this way, PN can address the over-smoothing problem. MADReg [30] also focuses on the over-smoothing problem. It takes the gap of MAD values (MADGap) as a regularization term to avoid the representations of distant nodes to be similar. GG-NN [25] proposes a gate mechanism to aggregate the messages from neighbors. GraphDRej-Gate is a gate version of GraphDRej, in which the penalty parameter is calculated by a gate function.

Experimental Setting
We set the dimension of the representation vector to 16 for all models. Moreover, we use the pooling aggregation in GraphSAGE, which achieves the best performance compared to other aggregation strategies, (as discussed in [6]). For JK, we use the Maxpool for the same reason as GraphSAGE. All models are trained using the Adam SGD optimizer [33] with an initial learning rate of 0.01. We use dropout rate d = 0.2 for all layers. For all data sets, we take 20% nodes as the training nodes, 40% nodes as the validation nodes, and the rest 40% as the test nodes. We use an early stopping strategy on both the model loss and accuracy score on the validation nodes, with the patience of 20 epochs.

Results for Node Classification Task
To fully evaluate the performance of GraphDRej (including RM and MGDCK) and baselines, we not only report the classification accuracy (shown in Table 2) of these models with the same two layers but also report the best results (presented in Table 3) achieved by these models with different numbers of layers. These results provide the positive evidence to question Q1: GraphDRej significantly outperforms previous models both on the two-layer results and on the best results over all data sets. For the tasks with the two-layer models, GraphDRej achieves an improvement ranging from 0.68% to 6.18% over all data sets. Also, compared with the best results, GraphDRej achieves gains from 0.15% to 4.77% over all data sets. Furthermore, compared with the GNNs (i.e., JK, PN, and MADReg), which consider the over-smoothing problem, GraphDRej also achieves the best performances.
Although GG-NN adopts a gate mechanism which is similar to our proposed rejection mechanism, GraphDRej can also outperform GG-NN (further analysis can be found in Section 6.6). These analyses show the effectiveness of GraphDRej on learning a more meaningful node representation. Table 2. Node classification accuracy (%) for two-layer graph neural networks (GNNs). The list in parentheses next to the accuracy of GraphDRej refers to the value set of γ adopted by the corresponding GraphDRej on different data sets.  Table 3. The best performance for GNNs on the node classification task. For baselines, the number in parentheses next to the accuracy indicates the best performing number of layers. For GraphDRej, the first number in parentheses has the same meaning as that in baselines, and the second list in parentheses refers to the value set of γ adopted by the corresponding GraphDRej.

Evaluation of Rejection Mechanism
Actually, the proposed rejection mechanism (RM) is a general method to address the over-smoothing problem. Therefore, we apply the RM to other GNN models to evaluate whether RM can help these GNNs to prevent the over-smoothing problem (i.e., answering the question Q2). Additionally, we also provide the results of GraphDRej to show the effectiveness of RM. Furthermore, to avoid the influence of multiple dilated convolution kernels, for GraphDRej, we apply a single dilated convolution kernel with γ = 1 (i.e., a GCN-like kernel). The results on Cora in terms of classification accuracy are presented in Figure 5 (similar performance trends are also observed on other data sets).
Some observations can be summarized as follows. Evaluation with the standard GNNs. First, we evaluate the RM with the standard GNNs, which suffer the over-smoothing problem. As shown in Figure 5a-d, when we gradually increase the number of layers, most of the original models fail in the over-smoothing problem. Specifically, when it comes to a deeper model (e.g., the layer of GNNs is 9 or 10 in Figure 5), the performance of most of these models sharply decreases, which also supports the discussion mentioned in Section 4. When we apply RM to these failed models, all of these models in a deep version obviously overcome the over-smoothing problem and achieve high performances. Thus, it indicates the rejection mechanism is indeed beneficial to solving the over-smoothing problem.  Evaluation with the GNNs, which also address the over-smoothing problem. Then, we evaluate the RM with the GNNs (i.e., JK [21], PN [29], and MADReg [30]), which also address the over-smoothing problem.
(1) As seen in Figure 5e-f, both JK and PN can well solve the over-smoothing problem, as the performances do not drop sharply with the increasing of the layer number.
(2) As shown in Figure 5g, it seems that the over-smoothing problem still exists in MADReg. Similar conclusions can be found in the original paper of MADReg [30]. Specifically, from Table 5 in [30], we can find that although MADReg can improve the performance of the standard GNNs, the performance of a GNN with the MADReg in a deep version is far away from the performance of that in a shallow version.
(3) Then, we apply RM on these GNNs (including JK, PN, and MADReg). JK still can be improved by RM. PN-Rej achieves comparable performance with PN. For MADReg, as discussed above, MADReg cannot well address the over-smoothing problem. With the help of RM, MADReg-Rej can achieve a big improvement and address the over-smoothing problem.

Evaluation between Rejection Mechanism and Gate Mechanism
To further analyze the rejection mechanism, we conduct experiments to evaluate the performances between the rejection mechanism and the gate mechanism. In order to avoid the influence of multiple graph dilated convolution kernels, the variants of GraphDRej (including GraphDRej and GraphDRej-Gate) used in this section only adopt a single kernel, i.e., γ = 1. The nodes classification results with the different number of layers are reported in Table 4. It shows that the performance of GG-NN drops quickly when the number of layers increases. It indicates that although GG-NN adopts a gate mechanism, the over-smoothing problem still exists, whereas for GraphDRej-Gate, it successfully overcomes this problem and achieves comparable performance with GraphDRej. It shows the main factor to solve the over-smoothing problem is to reject the message from distant nodes with the penalty parameter, which can be implemented by a learnable parameter or a gate function. Table 4. Rejection mechanism vs. gate mechanism. We report the accuracy (%) of nodes classification with different numbers of layers on Cora.

Evaluation of Multiple Graph Dilated Convolution Kernels
To address Q3, we design two types of classification tasks. In order to avoid the influence of RM, the variants of GraphDRej used in this subsection disable RM temporarily.

Evaluation of GraphDRej with the Same Number of Layers
We provide three variants of GraphDRej: GraphDRej-1, GraphDRej-2, and GraphDRej-3. All of these three models have two layers. The only difference among them is that GraphDRej-1 adopts a single graph dilated convolution kernel with γ = 1 at each layer, GraphDRej-2 adopts two types of graph dilated convolution kernels with γ ∈ {1, 2} at each layer, and GraphDRej-3 contains three types of graph dilated convolution kernels with γ ∈ {1, 2, 3}. This means that although all of these models have the same number of layers, the total size of neighborhoods is different. The statistics of these versions are summarized in Table 5. The results are reported in Table 6. We observe that enlarging the neighborhood size in a GNN layer can improve the model performance, and achieve gains 3.32% and 3.23%, respectively (compared to GraphDRej-1). Furthermore, compared to the results of GraphDRej-2 and GraphDRej-3, too large a size of neighborhoods may not further improve the performance. The reasons for this are as follows. (1) On the one hand, adopting a larger size of neighborhoods is easier for models to overfitting the training data, which may influence the generalization of the model and lead to a worse performance. (2) On the other hand, not all the nodes are useful to characterize the representation of the center node, and the distant nodes are likely to bring noise [30]. Therefore, a larger size of neighborhoods is more likely to introduce some noise nodes (e.g., distant node) in representation. Therefore, it indicates that a suitable size of neighborhoods is needed, which can enable GraphDRej to explore sufficient neighbors' structure and avoid overfitting as well as the negative effect of noise nodes.  We also design another three variants of GraphDRej, denoted as GraphDRej-1-6, GraphDRej-2-3, and GraphDRej-3-2. The details of these versions can be found in Table 5. Generally speaking, although these three variants contain different numbers of layers and different kinds of graph convolution dilated kernels, all of these models capture the same size of neighborhoods, i.e., 6-hop-based neighborhoods.
The results are reported in Table 7. Although these models capture the same size of neighborhoods, GraphDRej-2-3 outperforms the other two variants significantly, and can better characterize the structure information. It demonstrates the effectiveness of graph dilated convolution kernels, and also shows a well-designed graph dilated convolution architecture is needed. Table 7. The accuracy (%) for evaluation of GraphDRej with the same size of neighborhoods on Cora.

Conclusions
In this paper, we first analyze the limitations of existing GNNs, i.e., the need and the bottleneck both introduced by the global information. Then a rejection mechanism is designed, which is a concise but effective way to address the over-smoothing problem. Next, we propose a graph dilated convolution kernel, which enlarges the size of the neighborhood at each layer. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art results. In the future, we will pay more attention to the analysis of how to design a good architecture for graph dilated networks with RM to capture graph information as much as possible.