3.4.2. Recurrent Graph Neural Networks
One of the first models applying deep neural networks to graph representation learning was based on graph neural networks (GNNs). The main idea of GNNs is that it considers messages shared between target nodes and their neighbors until a steady balance is acquired.
Table 9 summarizes graph recurrent autoencoder models.
Scarselli et al. [
44,
45] proposed a GNN model which could learn embeddings directly for different graphs, such as acyclic/cyclic and directed/undirected graphs. These models assumed that if nodes are directly connected in graphs, the distance between them should be minimized in the latent space. The GNN models used a data diffusion mechanism to aggregate signals from neighbor nodes (units) to target nodes. Therefore, the state of a node describes the context of its neighbors and can be used to learn embeddings. Mathematically, given a node
in a graph, the state of
and its output can be defined as:
where
and
are transition functions, and
,
denote the label of node
, edge
, respectively. By considering the state
that is revised by the shift process,
and its output at layer
l could be defined as:
However, one of the limitations of GNNs is that the model learns node embeddings as single output, which could cause problems with sequence output. Several studies tried to improve GNNs using recurrent graph neural networks [
17,
48,
49]. Unlike the GNNs which could represent a single output for each entity in a graph, Li et al. [
17] attempted to output sequences by applying gated recurrent units. The model used two gated graph neural networks
and
to predict the output
and the following hidden states. Therefore, the output of node
at layer
could be computed as:
where
denotes the set of neighbors of node
.
Wang et al. [
49] proposed Topo-LSTM model to capture the diffusion structure by representing graphs as a diffusion cascade to capture active and inactive nodes in graphs. Given by a cascade sequence
, the hidden state can be represented as follows:
where
p and
q denote the input aggregation for active nodes connected with
and not connected with the node
, respectively,
depicts the precedent sets of active nodes at time
t, and
depicts the set of activated nodes before time
t.
Figure 15 presents an example of the Topo-LSTM model. However, these models could not capture global graph structure since they only capture the graph structure within
k-hop distance. Several models have been proposed by combining graph recurrent neural network architecture with random-walk sampling structure to capture higher structural information [
48,
93]. Huang et al. [
93] introduced the GraphRNA model to combine a joint random-walk strategy on attributed graphs with recurrent graph networks. One of the powers of the random-walk sampling strategy is to capture the global structure. By considering the node attributes as a bipartite network, the model could perform joint random walks on the bipartite matrix containing attributes to capture the global structure of graphs. After sampling the node attributes and graph structure through joint random walks, the model uses graph recurrent neural networks to learn embeddings. Similar to GraphRNA model, Zhang et al. [
48] presented the SHNE model to analyze the attributes’ semantics and global structure in attributed graphs. The SHNE model also used a random-walk strategy to capture the global structure of graphs. However, the main difference between SHNE and GraphRNA is that the SHNE model first applied GRU (gated recurrent units) model to learn the attributes and then combined them with graph structure via random-walk sampling.
Since the power of autoencoders architecture is to learn compressed representations, several studies [
57,
205] aimed to combine RGNNs and autoencoders with learning node embeddings in weighted graphs. For instance, Seo and Lee [
57] adopted an LSTM autoencoder to learn node embeddings for weighted graphs. They used the BFS algorithm to travel nodes in graphs and extract the node-weight sequences of graphs as inputs for the LSTM autoencoder. The model then could leverage the graph structure reconstruction based on autoencoder architecture and the node attributes by the LSTM model.
Figure 16 presents the sampling strategy of this model, which lists the nodes and their respective weighted edges. To capture local and global graph structure, Aynaz et al. [
205] proposed a sequence-to-sequence autoencoder model, which could represent inputs with arbitrary lengths. The LSTM-based autoencoder model architecture consists of two main parts: The encoder layer
and the decoder layer
. For the sequence-to-sequence autoencoder, at each time step
l, the hidden vectors in the encoder and decoder layers can be defined as:
where
and
are the hidden states at step
t in the encoder and decoder layers, respectively. To generate the sequences of nodes, the model implemented different sampling strategies, including random walks, shortest paths, and breadth-first search with the WL algorithm to encode the information of node labels.
Since the aforementioned models learn node embeddings for static graphs, Shima et al. [
203] presented an LSTM-Node2Vec model by combining an LSTM-based autoencoder architecture with the Node2Vec model with learning embeddings for dynamic graphs. The idea of the LSTM-Node2Vec model is that it uses an LSTM autoencoder to preserve the history of node evolution with a temporal random-walk sampling. It then adopted the Node2Vec model to generate the vector embeddings for the new graphs.
Figure 17 presents a temporal random-walk sampling strategy to travel a dynamic graph.
Jinyin et al. [
204] presented the E-LSTM-D model (Encoder-LSTM-Decoder) to learn embeddings for dynamic graphs by combining autoencoder architecture and LSTM layers. Given by a set of graph snapshots
, the objective of the model is to learn a mapping function
. The model takes the adjacency matrix as the input of the autoencoder model, and the output of the encoder layer could be defined as:
where
denotes the
i-th graph in the series of graph snapshots,
is the activation function. For the decoder layer, the model tried to reconstruct the original adjacency matrix from vector embeddings, which could be defined as follows:
where
depicts the output of the stacked LSTM model, which captures the current graph’s structure
. Similar to E-LSTM-D model, Palash et al. [
201] proposed a variant of Dyngraph2Vec model, named Dyngraph2VecAERNN (Dynamic Graph to Vector Autoencoder Recurrent Neural Network) which also considers the adjacency matrix as input for the model. However, the critical difference between the E-LSTM-D model and the Dyngraph2VecAERNN model is that they feed the LSTM layers directly into the encoder part to learn embeddings. The decoder layer is composed of fully connected neural network layers to reconstruct the inputs.
There are several advantages of recurrent graph neural networks compared to shallow learning techniques:
Diffusion pattern and multiple relations: RGNNs show superior learning ability when dealing with diffuse information, and they can handle multi-relational graphs where a single node has many relations. This feature is achieved due to the ability to update the states of each node in each hidden layer.
Parameter sharing: RGNNs could share parameters across different locations, which could be able to capture the sequence node inputs. This advantage could reduce computational complexity during the training process with fewer parameters and increase the performance of the models.
However, one of the disadvantages of the RGNNs is that these models use recurrent layers with the same weights during the weight update process. This leads to inefficiencies in representing different relationship constraints between neighbor and target nodes. To overcome the limitation of RGNNs, convolutional GNNs have shown remarkable ability in recent years when it uses different weights in each hidden layer.
3.4.3. Convolutional Graph Neural Networks
CNNs have achieved remarkable success in the image processing area. Since image data can be considered to be a special case of graph data, convolution operators can be defined and applied to graph mining. There are two strategies to implement when applying convolution operators to the graph domain. The first strategy is based on graph spectrum theory which transforms graph entities from the spatial domain to the spectral domain and applies convolution filters on the spectral domain. The other strategy directly employs the convolution operators in the graph domain (spatial domain).
Table 10 summarizes spectral CGNN models.
When computing power is insufficient for implementing convolution operators directly on the graph domain, several studies focus on transforming graph data to the spectral domain and applying filtering operators to reduce computational time [
18,
55,
213]. The signal filtering process acts as the feature extraction on the Laplacian matrix. Most models adopted single and undirected graphs and presented graph data as a Laplacian matrix:
where
D denotes the diagonal matrix of the node degree,
A is the adjacency matrix. The matrix
L is a symmetric positive definite matrix describing the graph structure. Considering a matrix
U as a graph Fourier basis, the Laplacian matrix then could be decomposed into three components:
where
is the diagonal matrix which denotes the spectral representation of graph topology and
is eigenvectors matrix. The filter function
resembles a
k-order polynomial, and the spectral convolution acts as diffusion convolution in graph domains. The spectral graph convolution given by an input
x with a filter
is defined as:
where ∗ is the convolution operation. Bruna et al. [
56] transformed the graph data to the spectral domain and applied filter operators on a Fourier basis. The hidden state at the layer
l could be defined as:
where
is a diagonal matrix at layer
l,
denotes the number of filters at layer
, and
V denotes the eigenvectors of the
L matrix. Typically, most of the energy of the
D matrix is concentrated in the first
d elements. Therefore, we can obtain the first
d values of the matrix
V, and the number of parameters that should be trained is
.
Several studies focused on improving spectral filters to reduce computational time and capture more graph structure in the spectral domain [
210,
216]. For instance, Defferrard et al. [
216] presented a strategy to re-design convolutional filters for graphs. Since the spectral filter
indeed generates a kernel on graphs, the key idea is that they consider
as a polynomial which includes
k-localized kernel:
where
is a vector of polynomial coefficients. This
k-localized kernel provides a circular distribution of weights in the kernel from a target node to
k-hop nodes in graphs.
Unlike the above models, Zhuang and Ma [
211] tried to capture the local and global graph structures by introducing two convolutional filters. The first convolutional operator, local consistency convolution, captures the local graph structure. The output of a hidden layer
, then, could be defined as:
where
denotes the self-loops adjacency matrix, and
is the diagonal matrix presenting the degree information of nodes. In addition to the first filter, the second filter aims to capture the global structure of graphs which could be defined as:
where
P denotes the PPMI matrix, which can be calculated via frequency matrix using random-walk sampling.
Most of the above models learn node embeddings by transforming graph data to signal domain and use convolutional filters which lead to increased computational complexity. In 2016, Kipf and Welling [
18] introduced graph convolutional networks (GCNs), which were considered to be a bridge between spectral and spatial approaches. The spectral filter
and the hidden layers of the GCN model followed the layer-wise propagation rule can be defined as follows:
where
and
is the largest eigenvalue of Laplacian matrix
L,
is Chebyshev coefficients vector,
is Chebyshev polynomials could be defined as:
where
. Consequently, the convolution filter of an input
x is defined as:
Although spectral CGNNs are effective in applying convolution filters on the spectral domain, they have several limitations as follows:
Computational complexity: The spectral decomposition of the Laplacian matrix into matrices containing eigenvectors is time-consuming. During the training process, the dot product of the U, , and matrices also increase the training time.
Difficulties for handling large-scale graphs: Since the number of parameters for the kernels also corresponds to the number of nodes in graphs. Therefore, spectral models could not be suitable for large-scale graphs.
Difficulties for considering graph dynamicity: To apply convolution filters to graphs and train the model, the graph data must be transformed to the spectral domain in the form of a Laplacian matrix. Therefore, when the graph data changes, in the case of dynamic graphs, the model is not applicable to capture changes in dynamic graphs.
Motivated by the limitations of spectral domain-based CGNNs, spatial models apply convolution operators directly to the graph domain and learn node embeddings in an effective way. Recently, various spatial CGNNs have been proposed showing remarkable results in handling different graph structures compared to spectral models [
52,
95]. Based on the mechanism of aggregation from graphs and how to apply the convolution operators, we divide CGNN models into the following main groups: (i) Aggregation mechanism improvement, (ii) Training efficiency improvement, (iii) Attention-based models, and (iv) Autoencoder-CGNN models.
Table 11 and
Table 12 present a summary of spatial CGNN models for all types of graphs ranging from homogeneous to heterogeneous graphs.
Gilmer et al. [
222] presented the MPNN (Message-Passing Neural Network) model to employ the concept of messages passing over nodes in graphs. Given a pair of nodes
, a message from
to
could be calculated by a message function
. During the message-passing phase, a hidden state at layer
l of a node
could be calculated based on the message-passing from its neighbors, which could be defined as:
where
denotes the message function at layer
l which could be a MLP function,
is an activation function, and
denotes the set of neighbors of node
.
Most previous graph embedding models work in transductive learning which cannot handle unseen nodes. In 2017, Hamilton et al. [
22] introduced the GraphSAGE model (SAmple and aggreGatE) to generate inductive node embeddings in an unsupervised manner. The hidden state at layer
of a node
could be defined as:
where
denotes the set of neighbors of node
,
is the hidden state of node
at layer
l. The function
is a differentiable aggregator function. There are three aggregators (e.g., Mean, LSTM, and Pooling) to aggregate information from neighboring nodes and separate nodes into mini batches. Algorithm 1 presents the algorithm of the GraphSAGE model.
Algorithm 1: GraphSAGE algorithm. The model first takes the node features as inputs. For each layer, the model aggregates the information from neighbors and then updates the hidden state of each node . |
Input: : The graph G with set of nodes V and set of edges E. : The input features of node L: The depth of hidden layers, : Differentiable aggregator functions : The set of neighbors of node . Output: : Vector representations for . |
Lo et al. [
231] aimed to apply the GraphSAGE model to detect computer attackers in computer network systems, named E-graphSAGE. The main difference between the two models is that E-graphSAGE used the edges of graphs as aggregation information for learning embeddings. The edge information between two nodes is the data flow between two source IP addresses (Clients) and destination IP addresses (Servers).
By evaluating the contribution of neighboring nodes to target nodes, Tran et al. [
229] proposed convolutional filters with different parameters. The key idea of this model is to rank the contributions of different distances from the set of neighbor nodes to target nodes using short path sampling. Formally, the hidden state of a node at layer
could be defined as multiple graph convolutional filters:
where ‖ denotes the concatenation,
r and
denote the
r-hop distance and the shortest-path distance
j, respectively. Ying et al. [
225] considered random-walk sampling as the aggregation information that can be aggregated to the hidden state of CGNNs. To collect the neighbors of node
v, the idea of the model is to gather a set consisting of random-walk paths from node
v and then select the top
k nodes with the highest probability.
For hypergraphs, several GNN models have been proposed to learn high-order graph structure [
27,
44,
234]. Feng et al. [
27] proposed HGNN (Hypergraph Neural Networks) model to learn hypergraph structure based on spectral convolution. They first learn each hyperedge feature by aggregating all the nodes connected by the hyperedge. Then, each node’s attribute is updated with a vector embedding based on all the hyperedges connecting to the nodes. By contrast, Yadati [
234] presented the HyperGCN model to learn hypergraphs based on spectral theory. Since each hyperedge could connect several nodes between them, this model’s idea is to filter far apart nodes. Therefore, they adopt the Laplacian operator first to learn node embedding and filter edges, which connect two nodes at a high distance. The GCNs could then be used to learn node embeddings.
One of the limitations of GNN models is that the models consider the set of neighbors as permutation invariant. This limitation then makes the models cannot distinguish between isomorphic subgraphs. By considering the message-passing set from neighbors of nodes as permutation invariant, several works aimed to improve the message-passing mechanism by simple aggregation functions. Xu et al. [
24] proposed GIN (Graph Isomorphism Network) model, which aims to learn vector embeddings as powerful as the 1-dimensional WL isomorphism test. Formally, the hidden state of node
at layer
l could be defined as:
where MLP denotes multilayer perceptions and
is a parameter that could be learnable or fixed scalar. Another problem of GNNs is the over-smoothing problem when stacking more layers in the models. DeeperGCN [
98] was a similar approach that aims to solve the over-smoothing problem by generalized aggregations and skip connections. The DeeperGCN model defined a simple normalized message-passing, which could be defined as:
where
denotes the message-passing from node
to node
,
is the edge feature of the edge
,
presents an indicator procedure which is being 1 if two nodes
and
are connected. Le et al. [
233] presented the PHC-GNN model, which improves the message-passing compared to the GIN model. The main difference between PHC-GNN and GIN models is that the PHC-GNN model added edge embeddings and a residual connection after the message-passing. Formally, the message-passing and hidden state of a node
at layer
could be defined as:
A few studies focused on building pre-trained GNN models, which could be used to initialize other tasks [
209,
246,
247]. These pre-trained models are also beneficial to handle the little availability of node labels. For example, the main objective of the GPT-GNN model [
247] is to reconstruct the graph structure and the node features by masking the attributes and edges. Given a permutated order, the model maximizes the node attributes based on observed edges and then generates the remaining edges. Formally, the conditional probability could be defined as:
where
and
depict the observed and masked edges, respectively.
Since learning node embeddings in the whole graphs is time-consuming, several approaches aim to apply standard cluster algorithms (e.g., METIS, K-means, etc.) to cluster nodes into different subgraphs, then use GCNs to learn node embeddings. Chiang et al. [
95] proposed a Cluster-GCN model to increase the computational efficiency during the training of the CGNNs. Given a graph
G, the model first separates
G into
c clusters
where
using Metis clustering algorithm [
248]. The model then aggregates information within each cluster. GraphSAINT model [
53] had a similar structure to Cluster-GCN and [
249] model. GraphSAINT model aggregated neighbor information and samples nodes directly on a subgraph at each hidden layer. The probability of keeping a connection from a node
u at layer
l to a node
v in layer
could be based on the node degree.
Figure 18 presents an example of aggregation strategy for the GraphSAINT model. By contrast, Jiang et al. [
54] presented a hi-GCN model (hierarchical GCN) that could effectively model the brain network with two-level GCNs. Since individual brain networks have multiple functions, the first level GCN aims to capture the graph structure. The objective of the 2nd GCN level is to provide the correlation between network structure and contextual information to improve the semantic information. The work from Huang et al. [
250] was similar to GraphSAGE and FastGCN models. However, instead of using node-wise sampling at each hidden layer, the model provided two strategies: a layer-wise sampling strategy and a skip-connection strategy that could directly share the aggregation information between hidden layers and improve message-passing. The main idea of the skip-connection strategy is to reuse the information from previous layers that could usually be forgotten in dense graphs.
One of the limitations of the CGNNs is that at the hidden layer, the model updates the state of all neighboring nodes. This can lead to slow training and updating because of inactive nodes. Some models aimed to enhance CGNNs by improving the sampling strategy [
52,
223,
224]. For example, Chen et al. [
52] presented a FastGCN model to improve the training time and the model performance compared to CGNNs. One of the problems with existing GNN models is scalability which expands the neighborhood and increases computational complexity. The model could learn neighborhood sampling at each convolution layer which mainly focuses on essential neighbor nodes. Therefore, the model could learn the essential neighbor nodes for every batch.
By considering each hidden layer as an embedding layer of independent nodes, FastGCN aims to subsample the receptive area at each hidden layer. For each layer, they chose
i.i.d. nodes
and compute the hidden state which could be defined as:
where
denotes the kernel, and
denotes the activation function. Wu et al. [
214] introduced SGC (Simple Graph Convolution) model, which could improve 1st-order proximity in the GCN model. The model removed nonlinear activation functions at each hidden layer. Instead, they used a final SoftMax function at the last layer to acquire probabilistic outputs. Chen et al. [
224] presented a model to improve the updating of the nodes’ state. Instead of collecting all the information from the neighbors of each node, the model proposed an option to keep track of the activation history states of the nodes to reduce the receptive scope. The model aimed to maintain the history state
for each state
of each node
v.
Similar to [
250], Chen et al. [
28] presented a GCNII model using an initial residual connection and identity mapping to overcome the over-smoothing problem. The GCNII model aimed to maintain the structural identity of target nodes to overcome the over-smoothing problem. They introduced an initial residual connection
at the first convolution layer and identity mapping
. Mathematically, the hidden state at layer
could be defined as:
where
denotes the convolutional filter with normalization. Adding two parameters
and
is for the purpose of tackling the over-smoothing problem.
Several models aim to maximize the node representation and graph structure by matching a prior distribution. There have been a few studies based on the idea of Deep Infomax [
227] from image processing to learn graph embeddings [
26,
242]. For example, Velickovic et al. [
26] introduced the Deep Graph Infomax (DGI) model, which could adopt the GCN as an encoder. The main idea of mutual information is that the model trains the GCN encoder to maximize the understanding of local and global graph structure in actual graphs and minimize that in fake graphs. There are four components in the DGI model, including:
A corruption function : This function aims to generate negative examples from an original graph with several changes in structure and properties.
An encoder . The goal of function is to encode nodes into vector space so that presents vector embeddings of all nodes in graphs.
Readout function . This function maps all embedding nodes into a single vector (supernode).
A discriminator compares vector embeddings against the global vector of the graph by calculating a score between 0 and 1 for each vector embedding.
One of the limitations of the DGI model is that it only works with attributed graphs. Several studies have improved DGI to work with heterogeneous graphs with attention and semantic mechanisms [
242,
243]. Similar to the DGI model, Park et al. [
243] presented the DMGI model (Deep Multiplex Graph Infomax) for attributed multiplex graphs. Given a specific node with relation type
r, the hidden state could be defined as:
where
, and
,
is trainable weights, and
is the activation function. Similar to the DGI model, the readout function and discriminator can be employed as:
where
is the
i-th vector of matrix
,
denotes a trainable scoring matrix,
is a function with
. The attention mechanism is adopted from [
251], which could capture the importance of node type to generate the vector embeddings at the last layer. Similarly, Jing et al. [
242] proposed HDMI (High-order Deep Multiplex Infomax) model, which is conceptually similar to the DGI model. The HDMI model could optimize the high-order mutual information to process different relation types.
Increasing the number of hidden layers to aggregate more structural information of graphs can lead to an over-smoothing problem [
97,
252]. Previous models have considered the weights of messages to be the same role in aggregating information from neighbors of nodes. In recent years, various studies have focused on attention mechanisms to extract valuable information from neighborhoods of nodes [
19,
253,
254].
Table 13 presents a summary of attentive GNN models.
Velickovi et al. [
19] presented the GATs (graph attention networks) model, one of the first models in applying attention mechanism to graph representation learning. The purpose of the attention mechanism is to compute a weighted message for each neighbor node during the message-passing of GNNs. Formally, there are three steps for GATs which can be explained as follows:
Attention score: At layer
l, the model takes a set of features of a node as inputs
and the output
. An attention score measures the importance of neighbor nodes
to the target node
could be computed as:
where
, and
are trainable weights,
denotes the concatenation.
Normalization: The score then is normalized comparable across all neighbors of node
using the SoftMax function:
Aggregation: After normalization, the embeddings of node
could be computed by aggregating states of neighbor nodes which could be computed as:
Furthermore, the GAT model used multi-head attention to enhance the model power and stabilize the learning strategy. Since the GAT model takes the attention coefficient between nodes as inputs and ranks the attention unconditionally, this results in a limited capacity to summarize the global graph structure.
In recent years, various models have been proposed based on the GAT idea. Most of them aimed to improve the ability of the self-attention mechanism to capture more global graph structures [
253,
254]. Zhang et al. [
253] presented GaAN (Gated Attention Networks) model to control the importance of neighbor nodes by controlling the amount of attention score. The main idea of GaAN is to measure the different weights that come to different heads in target nodes. Formally, the gated attention aggregator could be defined as follows:
where
denotes a simple linear transformation, and
is the gate value of
m-th head of node
.
To capture a coarser graph structure, Kim and Oh [
258] considered attention based on the importance of nodes to each other. The importance of nodes is based on whether the two nodes are directly connected. By defining the different attention from target nodes to context nodes, the model could solve the permutation equivalent and capture more global graph structure. Based on this idea, they proposed the SuperGAT model with two variants, scaled dot product (SD) and mixed GO and DP (MX), to enhance the attention span of the original model. The attention score
between two nodes
and
can be defined as follows:
where
d denotes the number of features at layer
. The two attention scores can softly decline the number of nodes that are not connected to the target node
.
Wang et al. [
259] aimed to introduce a margin-based constraint to control over-fitting and over-smoothing problems. By assigning the attention weight of each neighbor to target nodes across all nodes in graphs, the proposed model can adjust the influence of the smoothing problem and drop unimportant edges.
Extending the GAT model to capture more global structural information using attention, Haonan et al. [
256] introduced the GraphStar model using a virtual node (a virtual start) to maintain global information at each hidden layer. The main difference between the GraphStar and GATs models is that they introduce three different types of relationships: node-to-node (self-attention), node-to-start (global attention), and node-to-neighbors (local attention). Using different types of relationships, GraphStar could solve the over-smoothing problem when staking more neural network layers. Formally, the attention coefficients could be defined as:
where
,
, and
denotes the node-to-node, node-to-start and node-to-neighbors relations at the
m-th head of node
, respectively.
One of the problems with the GAT model is that the model only provides static attention which mainly focuses the high-weight attention on several neighbor nodes. As a result, GAT cannot learn universal attention for all nodes in graphs. Motivated by the limitations of the GAT model, Brody et al. [
58] proposed the GATv2 model using dynamic attention which could learn graph structure more efficiently from a target node
to neighbor node
. The attention score can be computed with a slight modification:
Similar to Wang et al. [
259], Zhang et al. [
260] presented ADSF (ADaptive Structural Fingerprint) model, which could monitor attention weights from each neighbor of the target node. However, the difference between GraphStar [
259] and the ADSF model is that the ADSF model introduced two attention scores
and
for each node
which can capture the graph structure and context, respectively.
Besides the GAT-based models applied to homogeneous graphs, several models tried to apply attention mechanism to heterogeneous and knowledge graphs [
25,
261,
262]. For example, Wang et al. [
25] presented hierarchical attention to learn the importance of nodes in graphs. One of the advantages of this model is to handle heterogeneous graphs with different types of nodes and edges by deploying local and global level attention. The model proposed two levels of attention: node and semantic-level attention. The node-level attention aims to capture the attention between two nodes in meta-paths. Given a node pair
in a meta-path
P, the attention score of
P could be defined as:
where
and
denote the original and projected features of node
and
via a projection function
, respectively, and
is a function which scores the node-level attention. To make the coefficients across other nodes in a meta-path
P which contain a set of neighbors
of a target node
, the attention score
, and node embedding with
k multi-head attention can be defined as:
The score
indicates how the importance of the set of neighbors based on meta-path
P contributes to node
. Furthermore, the semantic-level aggregation aims to score the importance of meta-paths. Given an attention coefficient
, the importance of meta-path
P and its normalization could be defined as
:
In addition to applying CGNNs to homogeneous graphs, several studies focused on applying CGNNs for heterogeneous and knowledge graphs [
224,
241,
243,
263,
264,
266]. Since heterogeneous graphs have different types of edges and nodes, the main problem when applying CGNN models is the aggregation of messages based on different edge types. Schlichtkrull et al. [
241] introduced the R-GCNs model (Relational Graph Convolutional Networks) to model relational entities in knowledge graphs. R-GCNs is the first model to be applied to learn node embeddings in heterogeneous graphs to several downstream tasks, such as link prediction and node classification. In addition, they also use parameter sharing to learn the node embedding efficiently. Formally, given a node
under relation
, the hidden state at layer
could be defined as:
where
is the normalization constant, and
denotes the set of neighbors of node
with relation
r. Wang et al. [
265] introduced HANE (Heterogeneous Attributed Network Embedding) model to learn embeddings for heterogeneous graphs. The key idea of the HANE model is to measure attention scores for different types of nodes in heterogeneous graphs. Formally, given a node
, the attention coefficients
, attention score
, and the hidden state
at layer
could be defined as:
where
denotes the set of neighbors of node
,
denotes the feature of
, and
is the weighted matrix of each node type.
Several studies focused on applying CGNNs for recommendation systems [
228,
267,
268,
269]. For instance, Wang et al. [
267] presented KGCN (Knowledge Graph Convolutional Networks) model to extract the user preferences in the recommendation systems. Since most existing models suffer from the cold start problem and sparsity of user–item interactions, the proposed model can capture users’ side information (attributes) on knowledge graphs. The users’ preferences, therefore, could be captured by a multilayer receptive field in GCN. Formally, given a user
u, item
v,
denotes the set of items connected to
u, the user–item interaction score could be computed as:
where
denotes an inner product where the score between user
u and relation
r,
e is the representation of item
v.
Since the power of the autoencoder architecture is to learn a low-dimensional node representation in an unsupervised manner, several studies focused on integrating the convolutional GNNs into autoencoder architecture to leverage the power of the autoencoder architecture [
72,
270].
Table 14 summarizes graph convolutional autoencoder models for static and dynamic graphs.
Most graph autoencoder models were designed based on VAE (variational autoencoders) architecture to learn embeddings [
274]. Kipf and Welling [
72] introduced the GAE model, one of the first studies on applying autoencoder architecture to graph representation learning. GAE model [
72] aimed to reconstruct the adjacency matrix
A and feature matrix
X from original graphs by adopting the CGNNs as an encoder and an inner product as the decoder part.
Figure 19 presents the detail of the GAE model. Formally, the output embedding
Z and the reconstruction process of the adjacency matrix input could be defined as:
where
function could be defined by Equation (
65), and
is an activation function
. The model aims to reconstruct the adjacency matrix
A by an inner product decoder part:
where
is the sigmoid function and
is the value at row
i-th and column
j-th in the adjacency matrix
A. In the training process, the model tries to minimize the loss function by gradient descent:
where
is the Kullback–Leibler divergence between two distributions
p and
q.
Several models attempted to incorporate the autoencoder architecture into the GNN model to reconstruct graphs. For example, the MGAE model [
270] combined the message-passing mechanism from GNNs and GAE architecture for graph clustering. The primary purpose of MGAE is to capture information about the features of the nodes by randomly removing several noise pieces of information from the feature matrix to train the GAE model.
The GNNs have shown outstanding performance in learning complex structural graphs that shallow models could not solve [
245,
275,
276]. There are several main advantages of deep neural network models:
Parameter sharing: Deep neural network models share weights during the training phase to reduce training time and training parameters while increasing the performance of the models. In addition, the parameter-sharing mechanism allows the model to learn multi-tasks.
Inductive learning: The outstanding advantage of deep models over shallow models is that deep models can support inductive learning. This makes deep-learning models capable of generalizing to unseen nodes and having practical applicability.
However, the CGNNs are considered the most advantageous in the line of GNNs and have limitations in graph representation learning.
Over-smoothing problem: When capturing the graph structure and entity relationships, CGNNs rely on an aggregation mechanism that captures information from neighboring nodes for target nodes. This results in stacking multiple graph convolutional layers to capture higher-order graph structure. However, increasing the depth of convolution layers could lead to over-smoothing problems [
252]. To overcome this drawback, models based on transformer architecture have shown several improvements compared to CGNNs using self-attention.
The ability on disassortative graphs: Disassortative graphs are graphs where nodes with different labels tend to be linked together. However, the aggregation mechanism in GNN samples all the features of the neighboring nodes even though they have different labels. Therefore, the aggregation mechanism is the limitation and challenge of GNNs for disassortative graphs in classification tasks.
3.4.4. Graph Transformer Models
Transformers [
277] have gained tremendous success for many tasks in natural language processing [
278,
279] and image processing areas [
280,
281]. In documents, the transformer models could tokenize sentences into a set of tokens and represent them as one-hot encodings. With image processing, the transformer models could adopt image patches and use two-dimensional encoding to tokenize the image data. However, the tokenization of graph entities is non-trivial since graphs have irregular structures and disordered nodes. Therefore, applying transformers to graphs is still an open question of whether the graph transformer models are suitable for graph representation learning.
The transformer architecture consists of two main parts: a self-attention module and a position-wise feedforward network. Mathematically, the input of the self-attention model at layer
l could be formulated as
where
denotes the hidden state of position of node
. Then, the self-attention could be formulated as:
where
Q,
K, and
V depict the query matrix, key matrix, and value matrix, respectively, and
d is the hidden dimension embedding. The matrix
S measures the similarity between the queries and keys.
The architecture of graph transformer models differs from GNNs. GNNs use message-passing to aggregate the information from neighbor nodes to target nodes. However, graph transformer models use a self-attention mechanism to capture the context of target nodes in graphs, which usually denotes the similarity between nodes in graphs. The self-attention mechanism could help capture the amount of information aggregated between two nodes in a specific context. In addition, the models use a multi-head self-attention that allows various information channels to pass to the target nodes. Transformer models then learn the correct aggregation patterns during training without pre-defining the graph structure sampling.
Table 15 lists a summary of graph transformer models.
In this section, we divide graph transformer models for graph representation learning into three main groups based on the strategy of applying graph transformer models.
Structural encoding-based graph transformer: These models focus on various positional encoding schemes to capture absolute and relative information about entity relationships and graph structure. Structural encoding strategies are mainly suitable for tree-like graphs since the models should capture the hierarchical relations between the target nodes and their parents as well as the interaction with other nodes of the same level.
GNNs as an auxiliary module: GNNs bring a powerful mechanism in terms of aggregating local structural information. Therefore, several studies try integrating message-passing and GNN modules with a graph transformer encoder as an auxiliary.
Edge channel-based attention: The graph structure could be viewed as the combination of the node and edge features and the ordered/unordered connection between them. From this perspective, we do not need GNNs as an auxiliary module. Recently, several models have been proposed to capture graph structure in depth as well as apply graph transformer architecture based on the self-attention mechanism.
Several models tried to apply vanilla transformers to tree-like graphs to capture the node position [
64,
65,
277,
288]. Preserving tree structure depicts preserving a node’s relative and absolute structural positions in trees. Absolute structural position describes the positional relationship of the current node to the parent nodes (root nodes). In contrast, relative structural position describes the positional relationship of the current node to its neighbors.
Shiv and Quirk [
64] proposed to build a positional encoding (PE) strategy for programming language translation tasks. The significant advantage of tree-based models is that they can explore nonlinear dependencies. By custom positional encodings of nodes in the graph in a hieratical manner, the model could strengthen the transformer model’s power to capture the relationship between node pairs in the tree. The key idea is to represent programming language data in the form of a binary tree and encode the target nodes based on the location of the parent nodes and the relationship with neighboring nodes at the same level. Specifically, they used binary matrices to encode the relationship of target nodes with their parents and neighbors.
Similarly, Wang et al. [
65] introduced structural position representations for tree-like graphs. However, they combine sequential and structural positional encoding to enrich the contextual and structural language data. The absolute position and relative position encoding for each word
could be defined as:
where Abs is the absolute position of the word in the sentence,
d denotes the hidden size of
K,
Q matrix,
is the
/
function depending on the even/old dimension, respectively, and
R is the matrix presenting relative position representation.
The sentences also are represented in an independent tree which could represent the structural relations between words. For structural position encoding, the absolute and relative structural position of a node
could be encoded as:
where
denotes the distance between the root node and the target nodes. They then use a linear function to combine sequential PE and structural PE as inputs to the transformer encoder.
To capture more global structural information in the tree-like graphs, Cai and Lam [
282] also proposed an absolute position encoding to capture the relation between target and root nodes. Regarding the relative positional encoding, they use attention score to measure the relationship between nodes in the same shortest path sampled from graphs. The power of using the shortest path is that it can capture the hieratical proximity and the global structure of the graph. Given two nodes
and
, the attention score between two nodes can be calculated as:
where
and
are trainable projection matrices,
and
depict the node presentation
and
, respectively. To define the relationship
between two nodes
and
, they adopt a bi-directional GRUs model, which could be defined as follows:
where
denotes the shortest path from node
to node
,
and
are the states of the forward and backward GRU, respectively.
Several models tried to encode positional information of nodes based on subgraph sampling [
63,
283]. Zhang et al. [
63] proposed a Graph-Bert model, which samples the subgraph structure using absolute and relative positional encoding layers. In terms of subgraph sampling, they adopt a top-
k intimacy sampling strategy to capture subgraphs as inputs for positional encoding layers. Four layers in the model are responsible for positional encoding. Since several strategies were implemented to capture the structural information in graphs, the advantage of Graph-Bert is that it can be trainable with various types of subgraphs. In addition, Graph-Bert could be further fine-tuned to learn various downstream tasks. For each node
in a subgraph
, they first embed raw feature
using a linear function. They then adopt three layers to encode the positional information of a node, including absolute role embedding, relative positional embedding, and hop-based relative distance embedding. Formally, the output of three embedding layers of the node
from subgraph
could be defined as follows:
where
denotes the WL code that labels node
, which can be calculated from whole graphs,
l and
d are the numbers of interactions throughout all nodes, and the vector dimension of nodes,
is a position metric,
denotes the distance metric between two nodes, and
,
,
denote the absolute, relative structure intimacy, and relative structure hop PE, respectively. They then aggregate all the vector embeddings together as initial embedding vectors for the graph transformer encoder. Mathematically, the transformer architecture could be explained as follows:
Similar to Graph-Bert, Jeon et al. [
283] tried to present subgraphs for the paper citation network and capture the contextual citation of each paper. Each paper is considered a subgraph with nodes as reference papers. To extract the citation context, they encode the order of the referenced papers in the target paper based on the position and order of the referenced papers. In addition, they use the WL label to capture the structural role of the references. The approach by Liu et al. [
289] was conceptually similar to [
283]. However, there is a significant difference between them. They proposed an MCN sampling strategy to capture the contextual neighbors from a subgraph. The purpose of MCN sampling is based on the importance of the target node based on the frequency of occurrence when sampling.
In several types of graphs, such as molecular networks, the edges could bring features presenting the chemical connections between atoms. Several models adopted Laplacian eigenvectors to encode the positional node information with edge features [
29,
284]. Dwivedi and Bresson [
29] proposed the positional encoding strategy using node position and edge channel as inputs to the transformer model. The idea of this model is to use Laplacian eigenvectors to encode the node position information from graphs and then define edge channels to capture the global graph structures. The advantage of using the Laplacian eigenvector is that it can help the transformer model learn the proximity of neighbor nodes by maximizing the dot product operator between
Q and
K matrix. They first pre-computed Laplacian eigenvectors from the Laplacian matrix that could be calculated as:
where
is the Laplacian matrix, and
and
U denote the eigenvalues and eigenvectors, respectively. The Laplacian eigenvectors
then could denote the positional encoding for node
. Given node
with feature
and the edge feature
, the first hidden layer and edge channel could be defined as:
The hidden layers
of node
and the edge channel
at layer
could be defined as follows:
where
Q,
K,
V,
E are learned output projection matrices,
H denotes the number of attention head.
Similar to [
29], Kreuzer at al. [
284] aimed to add edge channels to all pairs of nodes in an input graph. However, the critical difference between them is that they combine full-graph attention with sparse attention. One of the advantages of the model is that it could capture more global structural information since they implement self-attention to nodes in the sparse graph. Therefore, they use two different types of similarity matrices to guide the transformer model to distinguish the local and global connections between nodes in graphs. Formally, they re-define the similarity matrix for pair of connected and disconnected nodes, which could be defined as follows:
where
denotes the similarity between two nodes
and
,
and
are the keys, queries, and edge projections of connected and disconnected pair nodes, respectively.
In some specific cases where graphs are sparse, small, or fully connected, the self-attention mechanism could lead to the over-smoothing problem and structure loss since it cannot learn the graph structure. To overcome these limitations, several models adopt GNNs as an auxiliary model to maintain the local structure of the target nodes [
99,
100,
285]. Rong et al. [
100] proposed the Grover model, which integrates the message-passing mechanism into the transformer encoder for self-supervised tasks. They used the dynamic message-passing mechanism to capture the number of hops compatible with different graph structures. To avoid the over-smoothing problem, they used a long-range residual connection to strengthen the awareness of local structures.
Several models attempted to integrate GNNs on top of the multi-attention sublayers to preserve local structure between nodes neighbors [
63,
99,
290]. For instance, Lin et al. [
99] presented Mesh Graphormer model to capture the global and local information from 3D human mesh. Unlike the Grover model, they inserted a sublayer graph residual block with two GCN layers on top of the multi-head attention layer to capture more local connections between connected pair nodes. Hu et al. [
285] integrated message-passing with a transformer model for heterogeneous graphs. Since heterogeneous graphs have different types of node and edge relations, they proposed an attention score, which could capture the importance of nodes. Given a source node
and a target node
with the edge
, the attention score could be defined as:
where
denotes the
m-th attention head,
is the attentive trainable weights for each edge types,
K and
Q are linear projection of all type of source node
and
, respectively, and
is the importance of each relationship.
Nguyen et al. [
61] introduced the UGformer model, which uses a convolution layer on top of the transformer layer to work with sparse and small graphs. Applying only self-attention could result in structure loss in several small-sized and sparse graphs. A GNN layer is stacked after the output of the transformer encoder to maintain local structures in graphs. One of the advantages of the GNN layer is that it can help the transformer model retain the local structure information since all the nodes in the input graph are fully connected.
In graphs, the nodes are arranged chaotically and non-ordered compared to sentences in documents and pixels in images. They can be in a multidimensional space and interact with each other through connection. Therefore, the structural information around a node can be extracted by the centrality of the node and its edges without the need for a positional encoding strategy. Recently, several proposed studies have shown remarkable results in understanding graph structure.
Several graph transformer models have been proposed to capture the structural relations in the natural language processing area. Zhu et al. [
62] presented a transformer model to encode abstract meaning representation (AMR) graphs to word sequences. This is the first transformer model that aims to integrate structural knowledge in AMR graphs. The model aims to add a sequence of edge features to the similarity matrix and attention score to capture the graph structure. Formally, the attention score and the vector embedding could be defined as: