Skip to Content
EntropyEntropy
  • Article
  • Open Access

4 July 2024

Graph Adaptive Attention Network with Cross-Entropy

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430081, China

Abstract

Non-Euclidean data, such as social networks and citation relationships between documents, have node and structural information. The Graph Convolutional Network (GCN) can automatically learn node features and association information between nodes. The core ideology of the Graph Convolutional Network is to aggregate node information by using edge information, thereby generating a new node feature. In updating node features, there are two core influencing factors. One is the number of neighboring nodes of the central node; the other is the contribution of the neighboring nodes to the central node. Due to the previous GCN methods not simultaneously considering the numbers and different contributions of neighboring nodes to the central node, we design the adaptive attention mechanism (AAM). To further enhance the representational capability of the model, we utilize Multi-Head Graph Convolution (MHGC). Finally, we adopt the cross-entropy (CE) loss function to describe the difference between the predicted results of node categories and the ground truth (GT). Combined with backpropagation, this ultimately achieves accurate node classification. Based on the AAM, MHGC, and CE, we contrive the novel Graph Adaptive Attention Network (GAAN). The experiments show that classification accuracy achieves outstanding performances on Cora, Citeseer, and Pubmed datasets.

1. Introduction

Much data in real life have irregular spatial structures, known as non-Euclidean data, such as social networks, recommendation systems, citation relationships between documents, transportation planning, and natural language processing. This type of data has both node information and structural information, which traditional deep learning networks like the CNN, RNN, and Transformer cannot represent well. The Graph Convolutional Network (GCN) [1], shown in Figure 1, is a class of deep learning models used for processing graph data, and significant progress has been made in graph data in recent years. X1–X6 represent the different nodes in the graph of input layer, and X’1–X’6 represent the different nodes in the graph of output layer. In addition, there is also the hidden layer. Finally Y1–Y6 represent the predicted labels of corresponding nodes. In the real world, many complex systems can be modeled as graph structures, such as social networks, recommendation systems, and automatic modulation classification (AMC), of underwater acoustic communication signals [2]. The nodes and edges of these graph data represent entities and their relationships, which is of great significance for understanding information transmission, node classification, graph classification, and other tasks in graph structures. However, compared with traditional regularized data such as images and text, graph data processing is more complex. Traditional Convolutional Neural Networks (CNNs) [3,4] and Recurrent Neural Networks (RNNs) [5,6] cannot be directly applied to graph structures, because the number of nodes and connections in the graph may be dynamic. Therefore, researchers have begun exploring new graph neural network models to effectively process graph data. Graph Convolutional Networks (GCNs) were proposed in this context, making an important breakthrough in graph data. The main idea of GCNs is to use the neighbor information of nodes to update their representations, similar to traditional convolution operations, but on graph structures. By the weighted averaging of neighboring nodes, the GCN achieves information transmission and node feature updates, allowing the model to better capture the local and global structures in the graph. More broadly, Graph Convolutional Networks (GCNs) are a special case of graph neural networks (GNNs).
Figure 1. The structure of GCN.
The previous graph convolution methods have not fully considered the number and importance of neighboring nodes. To solve the above problems, we propose the novel GAAN (Graph Adaptive Attention Network), and our main contributions are in the following two areas:
(1)
To generate different weights for each neighbor node of the central node, we design the novel adaptive attention mechanism (AAM).
(2)
Based on the AAM, we utilize Multi-head Graph Convolution (MHGC) to model and represent features better.
(3)
We adopt a cross-entropy loss function to model the bias between the predicted values and the ground truth, which greatly improves the classification accuracy.

3. Methodology

3.1. Overall

The overall structure of the Graph Adaptive Attention Network (GAAN) is illustrated in Figure 2. The process starts with a graph where each node represents an entity and edges denote relationships between them. In the encoding stage, the input graph’s information is transformed into features for each node. The input layer processes these encoded features. Within the graph layer and hidden layer, the network uses an adaptive attention mechanism (AAM) to assign varying weights α i j to the neighbors of a central node h i . These weights signify the importance of each neighboring node h j in contributing to the central node’s updated features. The updated features are computed through an attention-based weighted average of neighboring node features. Finally, the output layers aggregate the refined node features to perform node classification, as illustrated in the output graph with colored nodes.
Figure 2. The structure of GAAN. The different color arrows represent Multi-head Graph Convolution (MHGC) and the different color circles represent separate categories of nodes.

3.2. Graph Layers with AAM

The specific computational process can be represented by Equations (1)–(5).
h i j = c o n c a t W h i , W h j , i , j   N
where h i and h j represent the ith and jth nodes in the graph, respectively. W is the shared weight matrix to unify the node features. h i j fuses the features of the ith and jth nodes. Based on h i j , we can calculate the basic AAM between the ith and jth nodes with Equation (2).
e i j = a h i j D i D j
where D i and D j represent the degree of the ith and jth nodes, respectively. a is the weight vector to reshape h i j . e i j is the weight coefficient between the ith and jth nodes. Then, we adopt the softmax function to normalize e i j , as shown specifically in Equation (3).
α i j = s o f t m a x j e i j = e x p ( L e a k y R e L U e i k ) k ϵ h i e x p ( L e a k y R e L U e i k )
Based on the normalized weight coefficients between nodes obtained from Equations (1)–(3), we can perform the graph convolution layer computation, as shown in Equation (4).
h i = σ ( j ϵ N i α i j W h j )
Building on Equation (4), we utilize Multi-head Graph Convolution (MHGC), which involves performing the computation of Equation (4) with MHGC and then averaging these results to obtain the node features of the subsequent layer. The computation process is shown in Equation (5).
h i = σ ( 1 K k = 1 K j ϵ N i α i j k W k h j )
Based on the aforementioned formula, we completed the normalized weight coefficients and inter-layer computation processes for the GAAN. We designed a single hidden layer, thus constructing the GAAN with a total of two layers.

3.3. Cross Entropy Loss

In the node classification task, our goal is to correctly categorize each node into predefined categories. Suppose our model outputs the predicted probability that node v i belongs to each category as y ^ i and the true category label as v i , where y ^ i R C and y i R C , C is the number of categories, and N is the number of nodes. The cross-entropy loss function is defined as Equation (6).
L = 1 N i = 1 N i = 1 C y i , c log ( y ^ i , c )
where N is the total number of nodes. C is the number of categories. y i represents the true category distribution of node v i . y ^ i is the predicted category distribution.
The derivation process of the cross-entropy loss function is as follows. At the final output layer of the network, the feature representation h i of each node v i will pass through a fully connected layer and the softmax function will be applied to generate the predicted probability distribution. The model output will be the predicted probability distribution of the node y ^ i . It is described in Equation (7).
y ^ i , c = exp ( h i W c + b c ) c = 1 C exp ( h i W c + b c )
where W c and b c are the weight and bias of the category, respectively. h i is the feature representation of node v i .
The cross-entropy loss measures the difference between the true category distribution and the predicted probability distribution. For each node v i , the loss is defined as Equation (8).
L i = c = 1 C y i , c log ( y ^ i , c )
To measure the categorization performance over the whole graph, the average loss over all nodes needs to be calculated as Equation (6).
Specific processes could be described as the following four steps.
(1)
Encoding features: encode the node features of the input graph to obtain the initial feature representation of the nodes.
(2)
Attention mechanism: in the hidden layer, use the attention mechanism to weigh the average of neighboring nodes to obtain the updated node feature representation h i .
(3)
Fully connected layer: in the output layer, the updated node feature representation is transformed into a fully connected transformation and the softmax function is applied to obtain the predicted probability y ^ i .
(4)
Calculate the loss: use the cross-entropy loss function to calculate the difference between the true category distribution y i and the predicted probability distribution y ^ i , and the average to obtain the overall loss L .
Through the above process, we can effectively classify the nodes in the graph and optimize the network parameters through backpropagation to achieve the accurate classification of nodes.

4. Experiments

4.1. Datasets

Cora, Citeseer, and Pubmed have widely used citation network datasets in graph-based machine learning research. In this paper, we use the Cora, Citeseer, and Pubmed datasets, the specific statistics of which are shown in Table 1. Cora consists of 2708 machine learning publications categorized into 7 classes, with each paper cited by or citing other papers. Each node represents a publication, and edges denote citation relationships. Nodes have 1433 binary word attributes. Citeseer includes 3327 research papers grouped into 6 categories. Similar to Cora, nodes signify publications, and edges indicate citations. Each node has a 3703-dimensional binary word vector representing the presence or absence of specific words. Pubmed contains 19,717 scientific publications from the PubMed database, classified into 3 diabetes-related categories. The nodes represent papers, connected by citation edges. Each node is described by a TF-IDF-weighted word vector from a 500-word dictionary.
Table 1. Summary of the datasets used in our experiments.

4.2. Ablation Experiments

As shown in Table 2, we conduct ablation experiments on the Cora dataset. Table 2 shows the accuracies (Accuracy%) achieved by different experimental groups (Group1 to Group4) when applying specific modules (AAM and MHGC) and setting different dimensionalities of node feature vectors in hidden layers (n_hidden = 64, 96, and 128). Group1 did not use either of the two modules but had three different n_hidden settings, with corresponding accuracies of 83.9%, 83.5%, and 83%. Group2 and Group3 used the AAM and MHGC modules, respectively, and displayed their accuracies under different n_hidden settings, with values ranging from approximately 84.6% to 85.1%. AAM and MHGC, respectively, increased by 1% and 1.2%. Group4 combined the use of both the AAM and MHGC modules, achieving the highest accuracies under different n_hidden settings, reaching up to 85.6%. Our GAAN can significantly improve accuracies.
Table 2. Ablation experimental results on Cora. ‘n’ and ‘n_hidden’ represent the number of graph convolution heads and the dimensionality of node feature vectors in hidden layers, respectively. “×” represents not including the corresponding module and “√” represents including the corresponding module.

4.3. Comparison with Other Methods

Table 3 presents the classification accuracy (%) of various methods on three datasets: Cora, Citeseer, and Pubmed. The methods compared include MLP, SemiEmb, DeepWalk, ICA, Planetoid, Chebyshev, GCN, MoNet, and GAT. GAAN (no MHGC) achieves an accuracy of 84.9% on Cora, 74.3% on Citeseer, and 79.5% on Pubmed. GAAN (with MHGC) further improves the performance, achieving the highest accuracy of 85.6% on Cora, 75.5% on Citeseer, and 80.5% on Pubmed, indicating the effectiveness of integrating MHGC. GAT also shows strong results with 83.7% on Cora, 73.2% on Citeseer, and 79.3% on Pubmed. According to Table 3, we can see that GAAN achieves better performance than other methods on Cora and Citeseer datasets.
Table 3. Compared experimental results on Cora, Citeseer, and Pubmed.

5. Conclusions and Future Work

We have proposed an AAM and MHGC to construct a GCNA, which solves the differences between neighbor nodes and the central node. The experimental results show that our method is superior in accuracy. The Graph Convolutional Network (GCN) is a powerful deep learning model specifically designed to handle graph-structured data. By extending convolution operations from traditional Euclidean data spaces to non-Euclidean structures such as graphs, the GCN has a wide range of applications in real life. For example, in social network analysis, GCN can help us identify structural patterns within communities and relationships between nodes. In recommendation systems, the GCN can help us understand user behavior and preferences, thereby generating more accurate recommendations. In chemical molecule classification, the GCN can help predict the properties and behaviors of molecules, thereby accelerating the discovery and development of new drugs.
Over-smoothing occurs when multi-layers are stacked, leading to the features of all nodes being almost the same. However, many situations are necessary to capture the features of distant neighbors. Therefore, stacking multiple layers of the GCN is inevitable. The strategy of stacking multi-layers, designed to prevent over-smoothing, is urgent.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability Statement

The Cora dataset is available at https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz, accessed on 29 June 2020. The CiteSeer dataset is available at https://linqs-data.soe.ucsc.edu/public/lbc/citeseer.tgz, accessed on 29 June 2020. The PubMed dataset is available at https://linqs-data.soe.ucsc.edu/public/lbc/pubmed_diabetes.tgz, accessed on 29 June 2020.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Thomas, K.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  2. Yao, X.; Yang, H.; Sheng, M. Feature Fusion Based on Graph Convolution Network for Modulation Classification in Underwater Communication. Entropy 2023, 25, 1096. [Google Scholar] [CrossRef] [PubMed]
  3. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  4. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  5. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
  6. Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin, Heidelberg, 2012; pp. 37–45. [Google Scholar]
  7. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
  8. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2016; Volume 29. [Google Scholar]
  9. Hammond, D.K.; Vandergheynst, P.; Gribonval, R. Wavelets on graphs via spectral graph theory. Appl. Comput. Harmon. Anal. 2011, 30, 129–150. [Google Scholar] [CrossRef]
  10. Henaff, M.; Bruna, J.; LeCun, Y. Deep convolutional networks on graph-structured data. arXiv 2015, arXiv:1506.05163. [Google Scholar]
  11. Levie, R.; Monti, F.; Bresson, X.; Bronstein, M.M. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Trans. Signal Process. 2018, 67, 97–109. [Google Scholar] [CrossRef]
  12. Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 2013, 30, 83–98. [Google Scholar] [CrossRef]
  13. Bianchi, F.M.; Grattarola, D.; Livi, L.; Alippi, C. Graph neural networks with convolutional arma filters. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3496–3507. [Google Scholar] [CrossRef] [PubMed]
  14. Defferrard, M.; Milani, F.; Gusset, F.; Perraudin, N. Deep Networks on Toric Graphs. arXiv 2018, arXiv:1808.03965. [Google Scholar]
  15. Spielman, D. Spectral graph theory. Comb. Sci. Comput. 2012, 18, 18. [Google Scholar]
  16. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
  17. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  18. Monti, F.; Bronstein, M.; Bresson, X. Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
  19. Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; Bronstein, M.M. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5115–5124. [Google Scholar]
  20. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
  21. Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.I.; Jegelka, S. Representation learning on graphs with jumping knowledge networks. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  22. Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
  23. Liao, R.; Zhao, Z.; Urtasun, R.; Zemel, R.S. Lanczosnet: Multi-scale deep graph convolutional networks. arXiv 2019, arXiv:1901.01484. [Google Scholar]
  24. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. No. 1. [Google Scholar]
  25. Atwood, J.; Towsley, D. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2016; Volume 29. [Google Scholar]
  26. Taud, H.; Mas, J.F. Multilayer perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
  27. Weston, J.; Ratle, F.; Collobert, R. Deep learning via semi-supervised embedding. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008. [Google Scholar]
  28. Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014. [Google Scholar]
  29. Bandyopadhyay, S.; Maulik, U.; Holder, L.B.; Cook, D.J. Link-based classification. In Advanced Methods for Knowledge Discovery from Complex Data; Springer: Dordrecht, The Netherlands, 2005; pp. 189–207. [Google Scholar]
  30. Yang, Z.; Cohen, W.; Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the 33rd International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016. [Google Scholar]
  31. Mohamadi, Y.; Chehreghani, M.H. Strong Transitivity Relations and Graph Neural Networks. arXiv 2024, arXiv:2401.01384. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.