Graph Attention Diffusion Method Combining Diffusion Mechanism and Graph Attention Mechanism

Li, Xing; Li, Jiaxin; Wang, Huijun; Xie, Yue; Jia, Shujuan; Dong, Zhijie; Yue, Zitong; Ma, Baoquan

doi:10.3390/a19060480

Open AccessArticle

Graph Attention Diffusion Method Combining Diffusion Mechanism and Graph Attention Mechanism

by

Xing Li

¹,

Jiaxin Li

¹,

Huijun Wang

²,

Yue Xie

³,

Shujuan Jia

¹,

Zhijie Dong

¹,

Zitong Yue

¹ and

Baoquan Ma

^1,*

¹

National Computer System Engineering Research Institute of China, Beijing 100083, China

²

DaXing ATMC, North China ATMB of CAAC, Beijing 102602, China

³

Beijing Capital International Airport Company Limited, Beijing 100621, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(6), 480; https://doi.org/10.3390/a19060480

Submission received: 24 April 2026 / Revised: 30 May 2026 / Accepted: 9 June 2026 / Published: 15 June 2026

(This article belongs to the Special Issue Scalable Algorithms for Large-Scale Graph Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

Graph neural networks have attracted much attention and performed well in many downstream tasks. However, due to issues such as oversmoothing, existing graph neural networks are limited in their ability to quantitatively exploit higher-order neighborhood information. This paper introduces GAtD (Graph Attention Diffusion Method), which propagates attention to a wider range and aggregates higher-order information. We theoretically analyze the effectiveness of GAtD and demonstrate the convergence and linear complexity. A series of experiments demonstrates that, by combining diffusion and attention mechanisms, our method can effectively capture deep level relationships between nodes.

Keywords:

graph neural network; graph attention; graph diffusion; node classification

1. Introduction

Graph representation learning is a method of learning to use graph structured data, which can represent complex raw data as low dimensional vectors suitable for processing. It plays an important role in various fields, such as biomedicine [1,2] and road traffic [3,4]. In recent years, with the development of neural network technology, graph neural networks (GNNs) have attracted the interest of researchers, which has led to a series of remarkable advances [5,6,7].

GAT (Velikovi et al. [8]) is a representative method that cleverly combines attention mechanisms with graph convolution operations. There are already many outstanding new methods/applications based on GAT, such as SuperGAT [9], Social BiGAT [10], Kgat [11], etc. However, these methods operate within a limited neighborhood range, where the receptive field of a single attention layer is restricted to first-order neighbors. Although it is theoretically possible to obtain a larger receptive field by stacking multiple attention layers, it performs poorly in practice due to the issue of oversmoothing [12,13,14].

On the other hand, there have been many works dedicated to solving the oversmoothing problem, such as deepGCN [15], deeperGCN [16], PPNP [17], GDC [18]. However, these methods exclusively utilize graph convolution models, which means that only qualitative (structural) relationships are utilized, while quantitative (attention) relationships still need to be further explored.

Present work. In this paper, we propose a novel Graph Attention Diffusion (GAtD) framework that effectively diffuses attention to a wider range, thereby quantitatively aggregating information from higher-order neighborhoods without worrying about oversmoothing. At the same time, we theoretically analyze the relationship between GAtD and personalized page ranking (PPR) and use two different graph attention mechanisms that can be applied in different situations. We also analyze the computational complexity and convergence properties of GAtD, demonstrating its efficiency and theoretical guarantees. In addition, we conduct experiments on various real-world datasets, and the experimental results show that GAtD effectively captures and leverages deep level information through diffusing attention across the graph structure. The main contributions can be summarized as follows.

We propose GAtD, which can effectively diffuse attention to a wider range and quantitatively aggregate information from higher-order neighborhoods. Additionally, it is worth noting that GAtD is a universal framework that is not limited to the attention mechanisms used in the article but can be combined with any attention mechanism.

We theoretically analyze the relationship between GAtD and personalized page ranking (PPR), obtain a practical approximation algorithm, and analyze the convergence and computational complexity of GAtD.

We conduct experiments on various real-world datasets, and GAtD achieves superior performance compared to the baseline methods. In addition, we conduct ablation experiments to provide a more in-depth analysis and offer a comprehensive perspective.

2. Related Work

In this part, we focus on previous work closely related to our method. First, some notations are presented: let

G = (V, E, X, Y)

denote a graph, where

V

is the node set, and

E \subseteq (V \times V)

is the edge set.

X

denotes the node feature matrix, and

Y

denotes the node label matrix. Let

A \in {0, 1}^{n \times n}

denote the adjacency matrix of

G

, where

n = | V |

is the number of nodes.

Graph Neural Networks (GNNs). GNNs have extended deep learning methods to graph structured data [19,20] and have played an important role in many fields [21,22,23,24]. GNN generally follows a two-stage operation of “aggregation” (

A g g (\cdot)

) and “update” (

U p (\cdot)

), that is, first obtaining “messages” from neighbor nodes and then updating the central node:

h_{N (u)}^{(k + 1)} = A g g^{(k)} (h_{u}^{(k)}, h_{v}^{(k)}), v \in N (u),

(1)

h_{u}^{(k + 1)} = U p^{(k)} (h_{u}^{(k)}, h_{N (u)}^{(k + 1)}),

(2)

where

k = 0, 1, \dots, K

represents the

k

-th layer of the GNN network,

h_{u}^{(k)}

represents the embedding of node

u

in

k

-th layer, and

N (u)

is the neighbor node set of node

u

.

GAT [8] and APPNP [17]. GAT introduces the attention mechanism into GNNs, and its computational complexity is

O (| V | \times d_{1} \times d_{2} + | E | \times d_{2})

, where

d_{1}

is the number of input features, and

d_{2}

is the number of output features (Velikovi et al. [8]).

The aggregation and update of node messages are as follows:

α_{i, j} = \frac{\exp (L e a k y R e L U (< \vec{a}, {W h}_{i} | | W h_{j} >))}{\sum_{v \in N (i)} \exp (L e a k y R e L U (< \vec{a}, {W h}_{i} | | {W h}_{v} >))},

(3)

where

α_{i, j}

represents the attention coefficient between node

i, j

.

\vec{a}

and

W

represent the attention parameter vector and convolution parameter matrix, respectively, and

< x, y >

represents the inner product of

x

and

y

.

APPNP decouples the compression operation separately and then diffuses the message on the graph in a “jump back” manner (return to the source node), which can make the information of the node spread remotely without additional parameters. The general form of APPNP is as follows:

Z^{(0)} = H = f_{θ} (X),

(4)

Z^{(k)} = (1 - α) P Z^{(k - 1)} + α H, k = 1, 2, \dots, K,

(5)

where

f_{θ}

stands for the neural network, with

θ

as the donating parameter,

P

is the information diffusion matrix,

α

is the restart parameter, and

K

is the number of diffusion layers.

In addition, the Multi-hop Attention Graph Neural Network (MAGNA) [25] has a similar approach to our method, which combines a graph attention mechanism and a diffusion model to obtain long-range attention on the graph. However, these two methods have key differences. Firstly, the diffusion method of MAGNA is equivalent to computing the power of attention between nodes, while GAtD only calculates the attention between nodes once and uses the graph structure to propagate the attention to distant places. In other words, GAtD directly spreads attention on the graph, which reduces the computational complexity of GAtD and the complex operations between parameters, making it more conducive to optimization. Secondly, due to the more direct method of computing high-order attention in GAtD, it can flexibly adopt different attention mechanisms. In summary, GAtD adopts a lightweight architecture and can serve as a plug-and-play module. MAGNA and other diffusion-based graph Transformer models are more suitable for resource-intensive scenarios.

3. Methods

3.1. Graph Attention Diffusion (GAtD)

The global overview of GAtD is shown in Figure 1. Specifically, GAtD first projects the node representation

x_{i}

into a latent space using Multilayer Perceptron (MLP) to obtain the latent representation

h_{i}

:

h_{i} = M L P (x_{i}) .

(6)

Then, we calculate the attention

α_{i, j}

between any neighbor nodes

i, j

in the latent space (Section 3.1.1). Note that we calculate attention weights only between neighboring nodes, as shown in Figure 1A (taking the red node

i

as the center node as an example).

Next, using the adjacency matrix to diffuse attention on the graph (Section 3.1.2), the deep level attention relationship

α_{i, j}^{'}

between nodes is obtained as shown in Figure 1B, and features are aggregated based on the final attention weights as shown in Figure 1C:

h_{i}^{'} = \sum_{j \in N (i) \cup i} {α_{i, j}^{'} h}_{j},

(7)

where

N (i)

is the neighbor set of node

i

. Finally, the predicted vector is obtained:

z_{i} = s o f t m a x ({M L P (H}^{'}))_{i},

(8)

where

h_{i}^{'}

is the

i

-th line of

H^{'}

, the MLP and

s o f t m a x

function project the node representations into the label space, and finally, we calculate the loss based on the label

y_{i}

and conduct training:

l o s s = - \sum_{v} (y_{v} \cdot \log (z_{v})), v \in t r a i n s e t,

(9)

where

y_{v}

is the label vector (one-hot encoded), and

z_{v}

is the predicted vector.

See Appendix A Algorithm A1 for the specific algorithm, and the details are introduced as follows.

3.1.1. Graph Attention

In this subsection, we introduce in detail the way to calculate attention between nodes.

After obtaining the latent representation

h_{i}

of node

i

, we first calculate the attention

α_{i, j}

between nodes

i

and

j

. Here, we implement two attention mechanisms Model I and

I I

as follows:

α_{i, j} = \frac{\exp (L e a k y R e L U ({< h}_{i}, h_{j} >))}{\sum_{v \in N (i) \cup i} \exp (L e a k y R e L U ({< h}_{i}, h_{v} >))},

(Model I)

α_{i, j} = \frac{\exp (L e a k y R e L U (< \vec{a}, h_{i} | | h_{j} >))}{\sum_{v \in N (i) \cup i} \exp (L e a k y R e L U (< \vec{a}, h_{i} | | h_{v} >))},

(Model II)

where

\vec{a}

is the parameter vector, and

x | | y

represents the concatenation of vectors

x

and

y

.

As we can see, no additional parameters are used in the attention mechanism Model I, and only one parameter vector is introduced in the attention mechanism Model II; so, the parameters of GAtD are of the same magnitude as GCN.

In addition, GAtD uses a multi-head mechanism to obtain focus points in different latent spaces to increase stability, namely:

α_{i, j}^{t} = f (h_{i}^{t}, h_{j}^{t}), t = 1, 2, \dots, T,

(10)

where

f

represents some kind of attention mechanism, and

T

represents the number of different latent spaces.

It is worth noting that our method GAtD is a general framework that can be combined with any attention mechanism. In this paper, we implement the above two attention mechanisms as examples.

3.1.2. Attention Diffusion

After calculating the attention between neighbor nodes, we get the attention matrix

A_{a t t}

, and then, we use the following set of formulas to diffuse the attention and get the deep attention relationship between nodes:

A^{(0)} = A_{a t t},

(11)

A^{(1)} = (1 - α) \cdot A^{(0)} + α \cdot I,

(12)

A^{(k)} = (1 - α) \cdot \hat{\tilde{A}} A^{(k - 1)} + α \cdot I, k = 2, 3, \dots, K,

(13)

A^{'} = A^{(K)},

(14)

where

α \in (0, 1]

is the jump-back probability,

I

is the identity matrix,

\hat{\tilde{A}} = {\tilde{D}}^{- \frac{1}{2}} (A + I) {\tilde{D}}^{- \frac{1}{2}}

is the normalized adjacency matrix with self-loops,

K \geq 1

is the number of diffusion operations performed, and

A^{'}

is the final deep attention matrix. Finally, the aggregation operation is performed over all nodes covered by the diffusion attention matrix to integrate multi-hop neighbor information.

It can be seen that the existence of the jump-back probability

α

guarantees that the attention matrix

A^{(K)}

does not become a zero matrix. When

k = 1

, it is equivalent to using the attention matrix

A_{a t t}

for feature aggregation; when

k > 1

, GAtD directly uses the graph structure for attention diffusion, which reduces the complexity (see Complexity Analysis).

In addition, by combining the diffusion mechanism with the attention mechanism and incorporating self-loops, the attention diffusion mechanism enables GAtD to capture deep structural dependencies through attention diffusion, while controlling convergence via a jump-back probability

α

(see Convergence analysis).

3.2. Theoretical Analysis

Relationship with Personalized PageRank. The original PageRank can be obtained by calculating

π_{p r} = A_{p r} π_{p r}

, with

A_{p r} = A D^{- 1}

, where

π_{p r}

represents the limiting distribution [26]. Considering

π_{p r}

as information and ignoring nonlinearity, we get the way GCN propagates information. Now, consider personalized PageRank with a jump-back probability

α

, and its corresponding limiting distribution satisfies following formula:

π_{p p r} = (1 - α) \cdot A_{p r} π_{p p r} + α \cdot e,

(15)

where

e

represents uniform distribution. From this, we can get

π_{p p r} = α (I - (1 - α) \cdot A_{p r})^{- 1} \cdot e .

(16)

Observe Formula

(15)

: when

\hat{\tilde{A}}

is used instead of

A_{p r}

, and the distribution

π_{p p r}

is considered as each column of the attention matrix (the initial distribution at this time is the one-hot vector of the corresponding node), it is consistent with the attention diffusion form of GAtD in Formula

(13)

. That is, the element

α_{i, j}^{'}

in the limit matrix

A^{'}

(when

K \to \infty

in Formula

(14)

) can be regarded as the influence of node

j

on node

i

at the level of attention features. Therefore, GAtD can be considered as a promising approach of the deep attention relationships between nodes.

Complexity Analysis. As described in Section 3.1, the GAtD method comprises five steps: feature projection, computation of initial attention, diffusion, aggregation and prediction. In the feature projection stage, the input is node feature matrix

X \in R^{|V| \times d_{1}}

, the output is

H = X \times W_{1} {\in R}^{|V| \times d_{2}}

, and the computational complexity is

O (| V | \times d_{1} \times d_{2})

.

In stage of computation of initial attention, the computational complexity is

O (| E | \times d_{2} \times M)

according to GAT [8], where

M

is the number of attention heads. For the diffusion stage, we employ a technique in application: diffuse the projection matrix

H

together with

A^{(k)}

. It is justified because the following iterative formula holds:

A^{(k)} H = (1 - α) \hat{\tilde{A}} A^{(k - 1)} H + α H .

(17)

In each diffusion layer, only the matrix multiplication between

\hat{\tilde{A}}

and the matrix obtained in the previous layer needs to be computed, and the computational complexity of the diffusion stage is

O (K | E |)

.

In the aggregation and prediction steps, the computational complexity is

O (| V | \times d_{2} \times d_{3})

, where

d_{3}

is the dimension of the label vector.

Since

d_{2}, d_{3}, M

are constants, and generally, we have

|E| ≫ |V|

, the overall computational complexity simplifies as

O (|V| \times d_{1} \times d_{2}) + O (|E| \times d_{2} \times M) + O (K |E|) + O (|V| \times d_{2} \times d_{3}) = O (|V| \times d_{1} + K |E|) = O (K | E |) .

(18)

In practice,

K

is a hyperparameter; see the experiments section for detailed settings.

Convergence analysis. GAtD uses the iterative equation:

A^{(0)} = A_{a t t},

(19)

A^{(1)} = (1 - α) \cdot A^{(0)} + α \cdot I,

(20)

A^{(k)} = (1 - α) \cdot \hat{\tilde{A}} A^{(k - 1)} + α \cdot I, k = 2, 3, \dots, K .

(21)

After

k

diffusions we have

A^{(k)} = {(1 - α)}^{(k - 1)} {\hat{\tilde{A}}}^{(k - 1)} A^{(1)} + α \sum_{i = 0}^{k - 2} (1 - α)^{i} {\hat{\tilde{A}}}^{i}, k \geq 2 .

(22)

We investigate the asymptotic behavior as

k \to \infty

.

For

{(1 - α)}^{(k - 1)} {\hat{\tilde{A}}}^{(k - 1)} A^{(1)}

,

\hat{\tilde{A}}

is the normalized adjacency matrix with self-loops; so, all its characteristic values

| λ_{i} | \leq 1

. Therefore, the spectral radius

ρ (\hat{\tilde{A}}) \leq 1

. Since

α \in (0, 1]

, we have

{(1 - α)}^{(k - 1)} {\hat{\tilde{A}}}^{(k - 1)} A^{(1)} \to 0

.

For

\sum_{i = 0}^{k - 2} (1 - α)^{i} {\hat{\tilde{A}}}^{i}, k \geq 2

, the convergence condition is

ρ ((1 - α) \hat{\tilde{A}}) < 1

. Given that

ρ (\hat{\tilde{A}}) \leq 1

, it remains to prove that

(1 - α) < 1

. Since

α \in (0, 1]

,

\sum_{i = 0}^{k - 2} (1 - α)^{i} {\hat{\tilde{A}}}^{i}

converges.

The characteristic values of

I - (1 - α) \hat{\tilde{A}}

are

I - (1 - α) λ_{i}, (| λ_{i} | \leq 1)

. Since

α \in (0, 1]

,

I - (1 - α) λ_{i} \neq 0

. Therefore, matrix

I - (1 - α) \hat{\tilde{A}}

is invertible. Then, we get

A^{(\infty)} = α (I - (1 - α) \hat{\tilde{A}})^{- 1} .

(23)

In practical applications, typically,

K

does not exceed 20. That is, the attention matrix

A^{(K)}

is a weighted combination of the original attention matrix and the graph structure matrix and does not completely forget the original attention.

4. Experiments

4.1. Datasets

GAtD is evaluated on datasets from different fields (summarized in Table 1).

Cora, Citeseer and Pubmed [27]: Cora, Citeseer and Pubmed are three standard citation network datasets, where nodes represent documents, and edges (undirected) represent citation links. The three citation network datasets contain sparse feature vectors for each document, and each document has a category label. Dataset website: https://paperswithcode.com/datasets (accessed on 1 January 2025).

CS and Physics [28]: Physics and CS are co-authorship datasets based on the Microsoft Academic Graph of the 2016 KDD Cup Challenge. The nodes represent the authors, and connected nodes indicate that they are co-authors of the same paper. Node features are extracted from the keyword information of each author in the paper. Dataset website: https://github.com/shchur/gnn-benchmark#datasets (accessed on 1 January 2025).

Chameleon [29]: Chameleon is a network dataset of Wikipedia pages about chameleons, where nodes represent articles in Wikipedia, edges (directed) reflect links between web pages, and node features indicate the presence of specific nouns in the articles. Dataset website: https://github.com/radrumond/Chameleon (accessed on 1 January 2025).

These datasets are all general datasets for testing the performance of graph neural networks. We perform node classification tasks on these datasets to test the performance of GAtD. Next, we introduce the specific experimental settings and baseline methods.

4.2. Experimental Setup and Baselines

Our method has a simple structure, and the main hyperparameters are the same as GAT, except for

α

and

K

(Section 3). For

α

and

K

, we perform a grid search in [0.05, 0.1, 0.2, 0.3, 0.4, 0.5] and [2, 5, 10, 15, 20], respectively. The other hyperparameters follow the settings in GAT [8].

Additionally, Cora, Citeseer and Pubmed employ the standard dataset split (following GCN [29]); for the CS and Physics datasets, 20 nodes per class are randomly sampled for the training set, 30 nodes per class for the validation set, and the remaining nodes for the test set; Chameleon follows the split in Geom-GCN [28].

In experiments on each dataset, 10 runs are performed, and the average accuracy and standard deviation are reported. During training, the maximum number of epochs is 200, with early stopping applied when the validation accuracy does not improve for 50 consecutive epochs. All baselines are evaluated under identical conditions.

Baselines. GAtD is compared with the following state-of-the-art methods, aside from MLP, GAT and APPNP:

GCN [29] is a semi-supervised GNN model, which introduces the convolution operation.
GraphSAGE [30] determines node neighborhoods by sampling and can generalize to unseen data.
GIN [31] is a GNN framework with a simple structure which is also based on GCN. Its differentiation and representation capabilities can be comparable to the WL-test [32].
GDC [18] also uses the idea of graph diffusion, which enhances GNNs using the generalized graph diffusion matrix.
For each method, the hyperparameters follow their original optimal settings.

4.3. Results and Analysis

Main Results. We conduct 10 experiments on each dataset, and the experimental results are shown in Table 2. GAtD-I and GAtD-II represent the GAtD methods combined with attention mechanism Model I and attention mechanism Model II, respectively.

The results show the average accuracy ± standard deviation of node classification across 10 independent runs. The higher the accuracy and the smaller the standard deviation, the better the classification effect. Among them, the best result is denoted in “bold”, the second in “underline”, and the third in “wavy underline”. It can be seen that GAtD-I/GAtD-II achieves strong performance on all datasets, especially on the CS dataset, where GAtD has a 2.2% advantage over the best method in baselines. These results demonstrate the effectiveness of GAtD.

When the best effect is achieved, the corresponding hyperparameters (

α

,

K

) are set as Table 3 and Table 4.

It can be seen that, except for the Chameleon dataset, when the best effect is achieved, the number of diffusion layers of the other datasets is more than 10. At the same time, combined with the comparison with APPNP, the GAtD method can make full use of the diffusion mechanism and attention mechanism to capture the deep relationship between nodes.

In addition, compared with other datasets, it is noted that the jump-back probability

α

is significantly higher when the co-authored datasets CS and Physics achieve the best results, which suggests that the CS and Physics datasets pay more attention to the feature information of the node itself, while other datasets can benefit more from neighbor information.

Ablation Experiment. GAtD mainly focuses on two hyperparameters: jump-back probability

α

and diffusion number

K

. To this end, we further explore the impact of

α

and

K

on the results.

First, we study how node classification accuracy changes with

α

, and the results are shown in Figure 2.

Figure 2 shows how the performance of GAtD-I and GAtD-II changes with

α

on various datasets when

K

remains unchanged (

K

is fixed to the corresponding optimal setting). The horizontal axis represents the value of

α

, and the vertical axis represents the node classification accuracy.

It can be seen that, on all datasets, GAtD-I and GAtD-II show regular trends, and their trends are consistent on all datasets, which show that the GAtD method can accurately model and learn the deep attention relationship between nodes in the graph (otherwise, the trend should be messy and not consistent on all datasets). In addition, it is noted that, on the Cora, CS, and Chameleon datasets, the accuracy fluctuates strongly (more than 5%), which may suggest that the node characteristics in these three graphs are quite different.

Then, we study how the accuracy changes with

K

, and the results are shown in Figure 3.

Figure 3 shows the performance of GAtD-I and GAtD-II as

K

changes when

α

remains unchanged on each dataset. The horizontal axis represents the value of

K

, and the vertical axis represents the node classification accuracy.

As

K

increases, GAtD achieves better performance on most datasets, demonstrating that it effectively alleviates the oversmoothing problem. Once again, we can see the regular trend of change, and the trends are consistent on all datasets, which supports the effectiveness of GAtD.

Finally, we show the calculation time of GAtD in Figure 4.

Figure 4 shows the calculation time of GAtD at different diffusion layers on Cora, Citeseer, and Pubmed. GAT is also shown as a baseline control, and the vertical axis represents time (in seconds).

It can be seen that as the number of diffusion layers increases, the calculation time of GAtD also increases (linearly). When the number of diffusion layers is 2 (

K = 2

), the calculation time of GAtD is basically the same as that of GAT. These experimental phenomena verify the theoretical analysis in Section 3 (Complexity Analysis).

To ensure the reliability of performance improvements, we use a t-test to verify the statistical significance between GAtD and the suboptimal methods.

Table 5 shows the statistical results. GAtD significantly outperforms suboptimal methods on two out of three datasets. On Citeseer, although the difference does not reach statistical significance, GAtD achieves a higher average performance.

Additionally, we computed the 95% confidence intervals (Table 6) to illustrate the precision of performance estimates. GAtD exhibits narrower confidence intervals, indicating higher stability.

5. Conclusions

In this paper, we propose a novel graph attention diffusion framework GAtD, which propagates attention to a wider range and aggregate higher-order information. Moreover, GAtD can be combined with any attention mechanism, and this paper provides two attention mechanisms as examples. In addition, we theoretically analyze the relationship between GAtD and personalized page ranking (PPR), demonstrate its effectiveness, and simultaneously analyze the convergence and computational complexity of GAtD. The final series of experiments supported our analysis, and the competitive performance demonstrates that GAtD can fully utilize diffusion and attention mechanisms to capture deep level relationships between nodes, making it a promising framework for deep level attention modeling.

Future work includes combining more types of attention mechanisms to adapt to data with different features or extending GAtD to more complex types of graphs.

Author Contributions

Conceptualization, X.L. and B.M.; methodology, X.L.; software, X.L. and J.L.; validation, H.W. and Y.X.; formal analysis, S.J.; investigation, Z.D. and Z.Y.; resources, B.M.; data curation, J.L.; writing—original draft, X.L.; writing—review and editing, J.L. and H.W.; visualization, Y.X.; supervision, S.J.; project administration, Z.D. and Z.Y.; funding acquisition, B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. U2241213).

Data Availability Statement

The original data presented in the study are openly available at https://huggingface.co/papers/trending, accessed on 8 June 2026, https://github.com/shchur/gnn-benchmark#datasets, accessed on 8 June 2026, https://github.com/radrumond/Chameleon, accessed on 8 June 2026.

Conflicts of Interest

Author Huijun Wang was employed by the company DaXing ATMC, North China ATMB of CAAC. Author Yue Xie was employed by the company Beijing Capital International Airport Company Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Algorithm A1. GAtD algorithm. Graph Attention Diffusion Algorithm

Input: Graph

G = (V, E)

; Node feature matrix

X

; Adjacency matrix

A

; Number of latent spaces

T

; jump-back probability

α

; Number of diffusion layers

K

Output: Prediction vector

z_{i}

1:: Compute representation in latent space: $H = M L P (X)$
2:: Using attention mechanisms $I$ and $I I$ , calculate the attention $α_{i, j}$ between neighbor nodes $i$ and $j$
3:: $H_{0} \leftarrow H$
4:: $A^{(0)} \leftarrow A_{a t t}$
5:: for $k = 1, 2, \dots, K$ do
6:: if $k = 1$ do
7:: $A^{(k)} = (1 - α) \cdot A^{(k - 1)} + α \cdot I$
8:: else do
9:: $A^{(k)} = (1 - α) \cdot \hat{\tilde{A}} A^{(k - 1)} + α \cdot I$
10:: end if
11:: end for
12:: Aggregate features:

$h_{i}^{'} = \sum_{j \in N (i) \cup i} {α_{i, j}^{'} h}_{j}$
13:: Get the prediction vector:

$z_{i} = s o f t m a x ({M L P (H}^{'}))_{i}$

The GAtD algorithm is shown in Algorithm A1, where “←” represents assignment operation. First, the representation of node features in latent space is calculated through the MLP network, and the attention between neighbor nodes is calculated through attention mechanisms Model I and II (lines 1–2); then, the attention is diffused layer by layer to obtain the deep relationship between nodes (lines 4–10); finally, the node features are aggregated using the deep attention relationship between nodes, and the final prediction vector is obtained (lines 11–12).

References

Li, H.; Han, Z.; Sun, Y.; Wang, F.; Hu, P.; Gao, Y.; Bai, X.; Peng, S.; Ren, C.; Xu, X.; et al. CGMega: Explainable graph neural network framework with attention mechanisms for cancer gene module dissection. Nat. Commun. 2024, 15, 5997. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Liu, Z.; Pan, Y.; Lin, H.; Zhang, Y. IMAEN: An interpretable molecular augmentation model for drug-target interaction prediction. Expert Syst. Appl. 2024, 238, 121882. [Google Scholar] [CrossRef]
Wang, Z.; Sun, P.; Hu, Y.; Boukerche, A. A novel hybrid method for achieving accurate and timeliness vehicular traffic flow predict. Comput. Commun. 2023, 209, 378–386. [Google Scholar] [CrossRef]
Jiang, W.; Luo, J. Graph neural network for traffic forecasting: A survey. Expert Syst. Appl. 2022, 207, 117921. [Google Scholar] [CrossRef]
Song, X.; Lian, J.; Huang, H.; Luo, Z.; Zhou, W.; Lin, X.; Wu, M.Q.; Li, C.Z.; Xie, X.; Jin, H. XGCN: An Extreme Graph Convolutional Network for Large-scale Social Link Prediction. In Proceedings of the ACM Web Conference; Association for Computing Machinery: New York, NY, USA, 2023; pp. 349–359. [Google Scholar]
Xv, G.P.; Lin, C.; Guan, W.X.; Gou, J.P.; Li, X.B.; Deng, H.B.; Xu, J.; Zheng, B. E-commerce Search via Content Collaborative Graph Neural Network. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2023; pp. 2885–2897. [Google Scholar]
Liu, Z.; Yang, L.; Fan, Z.; Peng, H.; Yu, P.S. Federated Social Recommendation with Graph Neural Network. ACM Trans. Intell. Syst. Technol. 2022, 13, 55. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
Kim, D.; Oh, A. How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar] [CrossRef]
Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, H.; Savarese, S. Social-BiGAT: Multimodal trajectory forecasting using Bicycle-GAN and graph attention networks. arXiv 2019, arXiv:1907.03395. [Google Scholar]
Wang, X.; He, X.; Cao, Y.; Liu, M.; Chua, T.-S. KGAT: Knowledge Graph Attention Network for Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 950–958. [Google Scholar] [CrossRef]
Wang, G.; Ying, R.; Huang, J.; Leskovec, J. Improving graph attention networks with large margin-based constraints. arXiv 2019, arXiv:1910.11945. [Google Scholar] [CrossRef]
Liu, Z.; Chen, C.; Li, L.; Zhou, J.; Li, X.; Song, L.; Qi, Y. GeniePath: Graph neural networks with adaptive receptive paths. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2019; pp. 4424–4431. [Google Scholar]
Oono, K.; Suzuki, T. Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. In Proceedings of the International Conference on Learning Representations, Virtual, 26–30 April, 2020; Available online: https://openreview.net/forum?id=S1ldO2EFPr (accessed on 8 June 2026).
Li, G.; Müller, M.; Qian, G.; Delgadillo, I.C.; Abualshour, A.; Thabet, A.; Ghanem, B. DeepGCNs: Making GCNs Go as Deep as CNNs. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6923–6939. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Xiong, C.; Qian, G.; Thabet, A.; Ghanem, B. DeeperGCN: Training Deeper GCNs With Generalized Aggregation Functions. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13024–13034. [Google Scholar] [CrossRef] [PubMed]
Klicpera, J.; Bojchevski, A.; Günnemann, S. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Gasteiger, J.; Weißenberger, S.; Günnemann, S. Diffusion Improves Graph Learning. arXiv 2022, arXiv:1911.05485. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 3844–3852. [Google Scholar]
Febrinanto, F.G.; Liu, M.; Xia, F. Balanced Graph Structure Information for Brain Disease Detection. Knowl. Manag. Acquis. Intell. Syst. 2023, 14317, 134–143. [Google Scholar]
Wang, T.; Bai, J.; Nabavi, S. Single-cell classification using graph convolutional networks. BMC Bioinform. 2021, 22, 364. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, S.; Liu, Y.; Wang, K.; Fang, F.; Qian, H. ConGNN: Context-consistent cross-graph neural network for group emotion recognition in the wild. Inf. Sci. 2022, 610, 707–724. [Google Scholar] [CrossRef]
Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2018; pp. 974–983. [Google Scholar] [CrossRef]
Wang, G.; Ying, R.; Huang, J.; Leskovec, J. Multi-hop Attention Graph Neural Networks. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–26 August 2021; pp. 3089–3096. [Google Scholar] [CrossRef]
Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web. In The Web Conference; Elsevier: Amsterdam, The Netherlands, 1999; Available online: https://api.semanticscholar.org/CorpusID:1508503 (accessed on 8 June 2026).
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Shchur, O.; Mumme, M.; Bojchevski, A.; Günnemann, S. Pitfalls of Graph Neural Network Evaluation. arXiv 2018, arXiv:1811.05868. [Google Scholar] [CrossRef]
Pei, H.; Wei, B.; Chang, K.C.-C.; Lei, Y.; Yang, B. Geom-GCN: Geometric Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, Virtual, 26–30 April 2020; Available online: https://openreview.net/forum?id=S1e2agrFvS (accessed on 8 June 2026).
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar] [CrossRef]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
Morris, C.; Ritzert, M.; Fey, M.; Hamilton, W.L.; Lenssen, J.E.; Rattan, G.; Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2019; pp. 4602–4609. [Google Scholar]

Figure 1. Overview of GAtD.

Figure 2. Accuracy changes with

α

.

Figure 2. Accuracy changes with

α

.

Figure 3. Accuracy changes with

K

.

Figure 3. Accuracy changes with

K

.

Figure 4. Calculation time of GAtD.

Table 1. Summary of the datasets.

Datasets	Nodes	Edges	Features	Classes
Cora	2708	5429	1433	7
Citeseer	3327	4732	3703	6
Pubmed	19,717	44,338	500	3
CS	18,333	81,894	6805	15
Physics	34,493	247,962	8415	5
Chameleon	2277	36,101	2325	5

Table 2. Main results.

	Cora	Citeseer	Pubmed	CS	Physics	Chameleon
MLP	57.2 ± 1.03	56.9 ± 1.41	72.7 ± 0.71	87.5 ± 0.65	88.6 ± 0.54	51.6 ± 1.80
GCN	81.0 ± 0.33	70.9 ± 0.34	79.0 ± 0.26	89.4 ± 0.90	92.3 ± 0.76	62.4 ± 2.84
GraphSAGE	81.9 ± 0.57	70.9 ± 0.41	78.4 ± 0.32	86.2 ± 1.85	91.3 ± 0.82	62.9 ± 1.51
GIN	81.8 ± 0.61	70.4 ± 0.57	78.6 ± 0.44	89.6 ± 0.96	91.4 ± 1.53	58.8 ± 1.32
GAT	82.4 ± 0.59	71.4 ± 0.49	78.6 ± 0.28	89.9 ± 0.68	92.1 ± 0.88	58.9 ± 2.26
APPNP	82.6 ± 1.14	71.7 ± 0.91	79.2 ± 0.52	90.5 ± 0.33	90.3 ± 0.92	56.8 ± 3.47
GDC	82.3 ± 0.84	71.4 ± 0.75	79.6 ± 0.49	90.8 ± 1.13	91.7 ± 0.86	60.6 ± 2.37
GAtD-I (ours)	83.7 ± 0.73	72.3 ± 0.85	79.9 ± 0.23	92.9 ± 0.47	93.2 ± 0.57	64.7 ± 2.17
GAtD-II (ours)	84.0 ± 0.57	72.3 ± 0.82	79.8 ± 0.20	93.0 ± 0.37	93.1 ± 0.94	64.5 ± 2.55

The best result is denoted in “bold”, the second in “underline”, and the third in “wavy underline”.

Table 3. Hyperparameter (

α

,

K

) settings of GAtD-I.

Table 3. Hyperparameter (

α

,

K

) settings of GAtD-I.

	Cora	Citeseer	Pubmed	CS	Physics	Chameleon
$α$	0.1	0.05	0.1	0.5	0.2	0.05
$K$	15	20	15	10	15	2

Table 4. Hyperparameter (

α

,

K

) settings of GAtD-II.

Table 4. Hyperparameter (

α

,

K

) settings of GAtD-II.

	Cora	Citeseer	Pubmed	CS	Physics	Chameleon
$α$	0.1	0.1	0.1	0.5	0.3	0.05
$K$	20	15	20	10	20	2

Table 5. T-test experimental results.

Datasets	GAtD vs. Suboptimal Methods	Accuracy Improvement	t-Statistic	p-Value
Cora	vs. APPNP	+1.4%	3.26	0.005
Citeseer	vs. APPNP	+0.6%	1.08	0.154
Pubmed	vs. GDC	+0.3%	2.00	0.038

Table 6. 95% confidence intervals.

Datasets	GAtD	Suboptimal Methods
Cora	83.98 [83.55, 84.41] %	82.56 [81.70, 83.42] %
Citeseer	72.30 [71.68, 72.92] %	71.72 [71.03, 72.41] %
Pubmed	79.91 [79.74, 80.08] %	79.59 [79.22, 79.96] %

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Li, J.; Wang, H.; Xie, Y.; Jia, S.; Dong, Z.; Yue, Z.; Ma, B. Graph Attention Diffusion Method Combining Diffusion Mechanism and Graph Attention Mechanism. Algorithms 2026, 19, 480. https://doi.org/10.3390/a19060480

AMA Style

Li X, Li J, Wang H, Xie Y, Jia S, Dong Z, Yue Z, Ma B. Graph Attention Diffusion Method Combining Diffusion Mechanism and Graph Attention Mechanism. Algorithms. 2026; 19(6):480. https://doi.org/10.3390/a19060480

Chicago/Turabian Style

Li, Xing, Jiaxin Li, Huijun Wang, Yue Xie, Shujuan Jia, Zhijie Dong, Zitong Yue, and Baoquan Ma. 2026. "Graph Attention Diffusion Method Combining Diffusion Mechanism and Graph Attention Mechanism" Algorithms 19, no. 6: 480. https://doi.org/10.3390/a19060480

APA Style

Li, X., Li, J., Wang, H., Xie, Y., Jia, S., Dong, Z., Yue, Z., & Ma, B. (2026). Graph Attention Diffusion Method Combining Diffusion Mechanism and Graph Attention Mechanism. Algorithms, 19(6), 480. https://doi.org/10.3390/a19060480

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Graph Attention Diffusion Method Combining Diffusion Mechanism and Graph Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Graph Attention Diffusion (GAtD)

3.1.1. Graph Attention

3.1.2. Attention Diffusion

3.2. Theoretical Analysis

4. Experiments

4.1. Datasets

4.2. Experimental Setup and Baselines

4.3. Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI