Next Article in Journal
Efficient Lp Distance Computation Using Function-Hiding Inner Product Encryption for Privacy-Preserving Anomaly Detection
Next Article in Special Issue
Applying Enhanced Real-Time Monitoring and Counting Method for Effective Traffic Management in Tashkent
Previous Article in Journal
A Preliminary Study of Deep Learning Sensor Fusion for Pedestrian Detection
Previous Article in Special Issue
EFFNet-CA: An Efficient Driver Distraction Detection Based on Multiscale Features Extractions and Channel Attention Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Graph Representation Learning and Its Applications: A Survey

1
Department of Artificial Intelligence, The Catholic University of Korea, 43, Jibong-ro, Bucheon-si 14662, Gyeonggi-do, Republic of Korea
2
Data Assimilation Group, Korea Institute of Atmospheric Prediction Systems (KIAPS), 35, Boramae-ro 5-gil, Dongjak-gu, Seoul 07071, Republic of Korea
3
Department of Social Welfare, Dongguk University, 30, Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
4
Semiconductor Devices and Circuits Laboratory, Advanced Institute of Convergence Technology (AICT), Seoul National University, 145, Gwanggyo-ro, Yeongtong-gu, Suwon-si 16229, Gyeonggi-do, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(8), 4168; https://doi.org/10.3390/s23084168
Submission received: 8 March 2023 / Revised: 16 April 2023 / Accepted: 17 April 2023 / Published: 21 April 2023
(This article belongs to the Special Issue Application of Semantic Technologies in Sensors and Sensing Systems)

Abstract

:
Graphs are data structures that effectively represent relational data in the real world. Graph representation learning is a significant task since it could facilitate various downstream tasks, such as node classification, link prediction, etc. Graph representation learning aims to map graph entities to low-dimensional vectors while preserving graph structure and entity relationships. Over the decades, many models have been proposed for graph representation learning. This paper aims to show a comprehensive picture of graph representation learning models, including traditional and state-of-the-art models on various graphs in different geometric spaces. First, we begin with five types of graph embedding models: graph kernels, matrix factorization models, shallow models, deep-learning models, and non-Euclidean models. In addition, we also discuss graph transformer models and Gaussian embedding models. Second, we present practical applications of graph embedding models, from constructing graphs for specific domains to applying models to solve tasks. Finally, we discuss challenges for existing models and future research directions in detail. As a result, this paper provides a structured overview of the diversity of graph embedding models.

1. Introduction

Graphs are a common language for representing complex relational data, including social media, transportation system networks, and biological protein–protein networks [1,2]. Since most graph data are complex and high-dimensional, it is difficult for researchers to extract valuable knowledge. Therefore, processing graph data and transforming them into a form (fixed-dimensional vectors) is an important process that researchers can then apply to different downstream tasks [3]. The objective of graph representation learning is to obtain vector representations of graph entities (e.g., nodes, edges, subgraphs, etc.) to facilitate various downstream tasks, such as node classification [4], link prediction [5,6], community detection [7], etc. As a result, graph representation learning plays an important role since it could significantly promote the performance of the downstream tasks.
Representation of the graph data, however, is challenging and different from image and text data [8]. In textual data, words are linked together in a sentence, and they have a fixed position in that sentence. In image data, pixels are arranged on an ordered grid space and can be represented by a grid matrix. However, the nodes and edges in graphs are non-ordered and have their features. This leads to mapping graph entities to latent space while preserving the graph structure, and proximity relationships are challenging. In the case of a social network, a user can have many friends (neighbors) and various personal information, such as hometown, education level, and hobbies, which makes preserving the graph structure and properties significantly problematic. In addition, many real-world networks show dynamic behaviors in which graph structures and node features could be changed over time [9,10]. These could deliver challenges in capturing the graph structure and mapping graph entities into vector space.
Over decades, various graph representation learning models have been proposed to project graph entities into fixed-length vectors [11,12,13]. Graph embedding models are mainly divided into five main groups: graph kernels, matrix factorization models, shallow models, deep neural network models, and non-Euclidean models. Figure 1 presents the popularity of different graph representation learning models from 2010 to 2022. The number of graph representation learning studies increased considerably over the period of 12 years. Furthermore, there was significant growth in the frequency of research studies on graph neural networks, graph convolutional networks, and graph transformer models. In contrast, the number of studies in graph kernels, graph autoencoder, and matrix factorization-based models increased slightly over the period of 12 years. We obtained the frequency of academic publications including each keyword from Scopus (https://www.scopus.com (accessed on 16 April 2023)).
Historically, the first graph representation learning models were graph kernels. The idea of graph kernel methods perhaps comes from the most essential and well-known Weisfeiler–Lehman (WL) isomorphic testing in 1968 [31]. Graph kernels are kernel functions that aim to measure the similarity between graphs and their entities [32]. The main idea of graph kernels is to decompose original graphs into substructures and construct vector embeddings based on the substructure features. There are two main types of graph kernels: kernels for graphs and kernels on graphs. The former aims to measure the similarity between pairs of graphs, while the latter estimates the similarity between graph nodes. Several strategies to estimate the similarity of graph pairs have been proposed to represent various graph structures, such as graphlet kernels, random walk, and the shortest path, which started in the 2000s. Based on WL isomorphic testing, various graph kernels are built to compute the similarity of pairs of graph entities, such as WL kernels [31], WL subtree kernels [33,34,35], and random walks [36,37]. However, one of the limitations of graph kernels is the computational complexity when working with large-scale graphs since computing graph kernels is an NP-hard class.
Early models for graph representation learning primarily focused on matrix factorization methods, which are motivated by traditional techniques for dimensionality reduction in 2002 [38]. Several matrix factorization-based models have been proposed to handle large graphs with millions of nodes [39,40]. The objective of matrix factorization models is to decompose the proximity matrix into a product of small-sized matrices and then learn the embeddings that fit the proximity. Based on the ways to learn vector embeddings, there are two main lines of matrix factorization models: Laplacian eigenmaps and node proximity matrix factorization. Starting in the 2000s, Laplacian eigenmaps methods [41,42] aim to represent each node by Laplacian eigenvectors along with the first k eigenvalues. In contrast, the node proximity matrix factorization methods [5,15] aim to gain node embeddings by the singular value decomposition in 2015. Various proximity matrix factorization models have successfully handled large graphs and achieved great performance [15,43]. However, matrix factorization models suffer from capturing high-order proximity due to computational complexity when performing with high transition matrices.
In 2014 and 2016, early shallow models, DeepWalk [14] and Node2Vec [4] were proposed, which learn node embeddings based on shallow neural networks. Remarkably, the primary concept is to learn node embeddings by maximizing the neighborhood probability of target nodes using the skip-gram model started in the natural language processing area. The purpose of this strategy could then be optimized with SGD on neural network layers, thus reducing computational complexity. With this historic milestone, various models have been developed by improving multiple sampling strategies and training processes. Shallow models are the embedding models that aim to map graph entities to low-dimensional vectors by conducting an embedding lookup for each graph entity [3]. From this perspective, the embedding of node v i could be represented as Z i = M x i , where M denotes an embedding matrix of all nodes and x i is a one-hot vector of node v i . Various shallow models have been proposed to learn embeddings with different strategies to preserve graph structures and the similarity between graph entities. Structure-preservation models aim to preserve the structural connection between entities (e.g., DeepWalk [14], Node2Vec [4]). In 2015, Tang et al. [16] proposed the LINE model, a proximity reconstruction method that aims to preserve proximity between nodes in graphs. After that, various models have been proposed to preserve the node proximity with higher-order proximity and capture more global graph structure. However, most of the above models focus on transductive learning and ignore node features, which may have several limitations to practical applications.
Breakthroughs in deep learning have led to a new research perspective on applying deep neural networks to the graph domain. Since the 2000s, there have been several early models on GNNs designed to learn node embeddings based on neighborhood information using an aggregation mechanism [44,45]. Graph neural networks (GNNs) have shown a significant expressive capacity to represent graph embeddings in an inductive learning manner and solve the limitations of aforementioned shallow models [46,47]. Recurrent GNNs are also the first studies on GNNs based on recurrent neural network architecture [48,49] in 2005. These models aim to learn node embeddings via recurrent layers with the same weights in each hidden layer and run recursively until convergence. Several recurrent GNNs with different strategies have been proposed by the power of recurrent neural network architecture and the combinations with several sampling strategies. However, using the same weights at each hidden layer of the RGNN model may cause the model to be incapable of distinguishing the local and global structure. Since 2016, several graph autoencoder models have been proposed based on the original autoencoder architecture, which could learn complex graph structures by reconstructing the input graph structure [50,51]. The graph autoencoders comprise two main layers: encoder layers take the adjacency matrix as input and squeeze it to generate node embeddings, and decoder layers reconstruct the input data. By contrast, the idea of CGNNs is to use convolutional operators with different weights in each hidden layer, which are more efficient in capturing and distinguishing the local and global structures [18,52,53,54]. Many studies have been proposed with various variants of CGNNs, including spectral CGNNs [55,56,57] started in 2014, spatial CGNNs [22,24,52] started in 2016, and attentive CGNNs [19,58] started in 2017. Nevertheless, most GNNs suffer limitations such as over-smoothing problems and noise from neighbor nodes when stacking more GNN layers [59,60].
Motivated by transformer architecture started from natural language processing applications in 2017, several graph transformer models were proposed using the transformer architecture to the graph domain in 2019 [61,62]. Graph transformer models have shown competitive and superior performance against GNNs in learning complex graph structures [30,63]. Graph transformer models can be divided into three main groups: transformer for tree-like graphs, transformer with GNNs, and transformer with global self-attention. Early graph transformer models aim to learn tree-like graphs, which mainly aim at learning node embeddings in tree-like graphs where nodes are arranged hierarchically [64,65] since 2019. These models encode the node positions through their relative and absolute positional encoding in trees as constraints with root nodes and neighbor nodes at the same level. Second, several models leverage the power of GNNs as an auxiliary module in computing attention scores [66]. In addition, some models put GNN layers on top of the model to overcome the over-smoothing problem and make the model remember the local structure [61]. Most above graph transformer models adopt vanilla transformer architecture to learn embeddings that rely on multi-head self-attention. Third, several graph transformer models use a global self-attention mechanism to learn node embeddings, which implements self-attention independently and does not require constraints from the neighborhood [30,67]. These models work directly on input graphs and can capture the global structure with global self-attention.
Most of the above models learn embeddings in Euclidean space and represent graph entities as vector points in latent space. However, graphs in the real world could have complex structures and different forms, such that Euclidean space may be inadequate to represent the graph structure and ultimately lead to structural loss [68,69]. Early models learn complex graphs in non-Euclidean geometry by developing efficient algorithms for learning node embeddings based on manifold optimization [70] in 2017. Following the line, several models aim to represent graph data in non-Euclidean space and gain desirable results [68,69,71]. Two typical non-Euclidean spaces, including spherical and hyperbolic geometry, have their advantages. Spherical space could represent graph structures with large cycles, while hyperbolic space is suitable for hierarchical graph structures. Most non-Euclidean models aim to design an efficient algorithm for learning node embeddings since it is challenging to implement operators directly in non-Euclidean. Furthermore, to deal with uncertainty, several Gaussian graph models have been introduced to represent graph entities as density-based embeddings [23] started in 2016. Node embeddings could be defined as a continuous density mostly based on Gaussian distribution [72].
To the extent of our knowledge, no comparable paper in the literature focuses on a wide range of graph embedding models for static and dynamic graphs in different geometric spaces. Most current papers only presented specific approaches for graph representation learning. Wu et al. [8] focused on graph neural network models, which are presented as a section in this paper. Several surveys [13,73,74] summarized graph embedding models for various types of graphs, but they did not mention either graph transformer models or non-Euclidean models. From applying graph embedding models to practical applications, several papers only list the applications for specific and narrow tasks [12,75]. However, we discuss how graphs are constructed in specific applications and how graph embedding models are implemented in various domains.
This paper presents a comprehensive picture of graph embedding models in static and dynamic graphs in different geometric spaces. In particular, we recognize five general categories of models for addressing graph representation learning, including graph kernels, matrix factorization models, shallow models, deep neural network models, and non-Euclidean models. The contribution of this study can be categorized as follows:
  • This paper presents a taxonomy of graph embedding models based on various algorithms and strategies.
  • We provide readers with an in-depth analysis of an overview of graph embedding models with different types of graphs ranging from static to dynamic and from homogeneous to heterogeneous graphs.
  • This paper presents graph transformer models, which have achieved remarkable results in a deeper understanding of graph structures in recent years.
  • We cover applications of graph representation learning in various areas, from constructing graphs to applying models in specific tasks.
  • We discuss the challenges and future directions of existing graph embedding models in detail.
Since abundant graph representation learning models have been proposed recently, we employed different approaches to find related studies. We built a search strategy by defining keywords and analyzing reliable sources. The list of keywords includes graph embedding, graph representation learning, graph neural networks, graph convolution, graph attention, graph transformer, graph embedding in non-Euclidean space, Gaussian graph embedding, and applications of graph embedding. We found related studies at famous top-tier conferences and journals such as AAAI, IJCAI, SIGKDD, ICML, WSDM, Nature Machine Intelligence, Pattern Recognition, Intelligent Systems with Applications, the Web, and so on.
The following sections of this paper are summarized as follows. Section 2 describes fundamental concepts and backgrounds related to graph representation learning. In Section 3, all the graph embedding models will be presented, such as graph kernels, matrix factorization models, shallow models, deep neural network models, and non-Euclidean models. Section 4 discusses a wide range of practical applications of graph embedding models in the real world. Section 5 summarizes the latest benchmarks, downstream tasks, evaluation metrics, and libraries. Challenges for existing graph embedding models and future research directions will be discussed in Section 6. The last section, Section 7 is the conclusion.

2. Problem Description

Graph representation learning aims to project the graph entities into low-dimensional vectors while preserving the graph structure and the proximity of entities in graphs. With the desire to map graph entities into vector space, it is necessary to model the graph in mathematical form. Therefore, we begin with several fundamental definitions of graphs. The list of standard notations used in this survey is detailed in Table 1. Mathematically, a graph G can be defined as follows:
Definition 1
(Graph [3]). A graph is a discrete structure consisting of a set of nodes and the edges connecting those nodes. The graph can be described mathematically in the form: G = ( V , E , A ) , where V = { v 1 , v 2 , , v N } is the set of nodes, E = { ( v i , v j ) | ( v i , v j ) V × V } is the set of edges, and A is an adjacency matrix. A is a square matrix of size N × N where N is the number of nodes in graphs. This can be formulated as follows:
 
A = A 11 A 1 N A N 1 A N N , A i j = 1 , i f e i j E . 0 , o t h e r w i s e . ,
where A i j indicates adjacency between node v i and node v j .
When A i j is binary, the matrix A represents only the existence of connections between nodes. By extending the definition of matrix A, we could expand to abundant different types of graph G:
  • Directed graph: When A i j = A j i for any 1 i ,   j n , then the graph G is called an undirected graph, and G is directed graph otherwise.
  • Weighted graph: is a graph in which each edge is assigned a specific weight value. Therefore, the adjacency matrix could be presented as: A i j = w i j , where w i j R is the weight of the edge e i j .
  • Signed graph: When A i j [ , ] , the graph G is called signature/signed graph. The graph G could have all positive signed edges when A i j > 0 for any 1 i ,   j n , and G could have all negative signed edges otherwise.
  • Attributed graph: A graph G = ( V , E , A , X ) is an attributed graph where V, E is the set of nodes and edges, respectively, and X is the matrix of node attributes with size n × d . Furthermore, we could also have the matrix X as the matrix of edge input attribute with size m × d where m is the number of edges e i j E for any 1 i ,   j n .
  • Hyper graph: A hyper graph G could be represented as G = ( V , E , W ) , where V denotes the set of nodes and E denotes a set of hyperedge. Each hyperedge e i j can connect multiple nodes and is assigned a weight w i j W . The hypergraph G could be represented by an incidence matrix H size V × E with entries h ( v i , v j ) = 1 if e i j E , and h ( v i , v j ) = 0 otherwise.
  • Heterogeneous graph: A heterogeneous graph is defined as G = ( V , E , T , φ , ρ ) where V, and E are the set of nodes and edges, respectively, φ is the mapping function: φ : V T v , and the mapping function ρ : E T e with T v , T e describe the set of node types and edge types, respectively, and T = T v + T e is the sum of the number of node types and edge types.
According to the definitions of graph G = ( V , E ) that have been represented mathematically above, the idea of graph embedding is to map graph entities into low-dimensional vectors with the number of dimensions d with d N . Mathematically, the graph embedding is formulated as follows:
Definition 2
(Graph embedding [14]). Given a graph G = ( V , E ) where V is the set of nodes, and E is the set of edges, graph embedding is a projection function ϕ ( · ) , where ϕ : V R d ( d | V | ) and k ( v i , v j ) ϕ ( v i ) , ϕ ( v j ) describes the proximity of two nodes v i and v j in the graph and ϕ ( v i ) , ϕ ( v j ) is the distance of two vectors ϕ ( v i ) and ϕ ( v j ) in the vector space.
Graph representation learning aims to project graph entities into the vector space while preserving the graph structure and entity proximity. For example, if two nodes v i and v j in the graph G are connected directly, then in vector space, the distance between two vectors ϕ ( v i ) and ϕ ( v j ) must be minimal. Figure 2 shows an example of a graph embedding model that transforms nodes in a graph to low-dimensional vectors ( Z 1 Z 2 Z n ) in the vector space.
When mapping graph entities to latent space, preserving the proximity of graph entities is one of the most important factors in preserving the graph structure and the relationship between nodes. In other words, if two nodes v i and v j are connected or close in the graph, the distance between the two vectors Z i and Z j must be minimal in the vector space. Several models [16,76,77,78] aim to preserve k-order proximity between graph entities in vector space. Formally, the k-order proximity is defined as follows:
Definition 3
(k-order proximity [79]). Given a graph G = ( V , E ) where V is the set of nodes, and E is the set of edges, k-order proximity describes the similarity of nodes with the distance captured from the k-hop in the graph G. When k = 1 , it is 1st-order proximity that captures the local pairwise proximity of two nodes in graphs. When k is higher, it could capture the global graph structure.
There is another way to define graph embedding from the perspective of Encoder-Decoder architecture [3]. From this perspective, the task of the encoder part is to encode graph entities into low-dimensional vectors, and the decoder part tries to reconstruct the graph from the latent space. In the real world, many graphs show dynamic behaviors, including node and edge evolution, and feature evolution [80]. Dynamic graphs are found widely in many applications [81], such as social networks where connections between friends could be added or removed over time.
Definition 4
(Dynamic graph [80]). A dynamic graph G is formed of three entities: G = V , E , T where V = V t is the group of node sets, E = { E ( t ) } with t T is the group of edge sets over time span T, and T denotes the time span. From the statistic perspective, we could also consider a dynamic graph G = { G ( t 0 ) G ( t 1 ) G ( t n ) } as a collection of static graphs G ( t k ) where G ( t k ) = V ( t k ) , E ( t k ) denotes the static graph G at time t k , and V ( t k ) , E ( t k ) denotes the set of nodes and set of edges at time t k , respectively.
Figure 3a presents an example of dynamic graph representation. At time t + 1 , there are several changes in the graph G ( t + 1 ) such as: The edge e 23 will be removed, node v 6 will be added and new edge e 56 . Casteigts et al. [80] proposed an alternative definition of a dynamic graph with five components: G = ( V , E , T , ρ , ζ ) where ρ : V × T { 0 , 1 } describes the existence of each node at time t, and ζ : E × T Ψ describes the existence of an edge at time t.
There is another way to model a dynamic graph based on the changes of the graph entities (edges, nodes) taking place on the graph G over a time span t or by an edge stream. From this perspective, a dynamic G could be modeled as G = ( V , E t , T ) where E t presents the collection of edges of dynamic graph G at time t, and function f : E R + to map edges into integer numbers. It notices that all the edges at time t will have the same labels. Figure 3b describes the evolution of the edges of a graph from time ( t ) to ( t + 1 ) .
Definition 5
(Dynamic graph embedding [82]). Given a dynamic graph G = ( V , E , T ) where V = { V ( t ) } is the group of node sets, and E = { E ( t ) } is the group of edge sets over time span T, a dynamic graph embedding is a projection function ϕ ( · ) , where ϕ ( · ) : G × T R d × T . T describes the time domain in latent space and T is the time span. When G is represented as the collection of snapshots: G = { G ( t 0 ) G ( t 1 ) G ( t n ) } , then the projection function ϕ will be defined as: ϕ = { ϕ ( 0 ) ϕ ( 1 ) ϕ ( n ) } where ϕ ( t ) is the vector embedding of the graph G ( t ) at time t.
There are two ways to represent a dynamic graph G , including a temporal dynamic graph embedding (changes over a period of time) and topological dynamic graph embedding (changes in the graph structure over time).
  • Temporal dynamic graph embedding: A temporal dynamic embedding is a projection function ϕ ( · ) , where ϕ t : G t k , t × T R d × T and G t k , t = { G t k G t k + 1 G t } describes the collection of graph G during time interval [ t k , t ] .
  • Topological dynamic graph embedding: A topological dynamic graph embedding for graph G for nodes is a mapping function ϕ , where ϕ : V × T R d × T .

3. Graph Representation Learning Models

This section presents a taxonomy of existing graph representation learning models in the literature. We categorize the existing graph embedding models into five main groups based on strategies to preserve graph structures and proximity of entities in graphs, including graph kernels, matrix factorization-based models, shallow models, deep neural network models, and non-Euclidean models. Figure 4 presents the proposed taxonomy of the graph representation learning models. Furthermore, we deliver open-source implementations of graph embedding models in Appendix A.
Graph kernels and matrix factorization-based models are one of the pioneer models for graph representation learning. Graph kernels are prevalent in learning graph embeddings using a deterministic mapping function in solving graph classification tasks [83,84,85]. There are two types of graph kernels: kernels for graphs, which aim to compare the similarity between graphs, and kernels on graphs aim to find the similarity between nodes in graphs. Second, matrix factorization-based models aim to represent the graph as matrices and gain embeddings by decomposing the matrices [5,86]. There are several strategies for factorization modeling, and most of these models aim to approximate high-order proximity between nodes. However, graph kernels and matrix factorization-based models suffer from computational complexity when handling large graphs and capturing high-order proximity.
Shallow models aim to construct an embedding matrix to transform each graph entity into vectors. We categorize shallow models into two main groups: structure preservation and proximity reconstruction. Structure-preservation strategies aim to conserve structural relationships between nodes in graphs [4,14,87]. Depending on specific tasks, several sampling strategies could be employed to capture graph structures, such as random walks [4,14], graphlets [88], motifs [89,90,91], etc. By contrast, the objective of the proximity reconstruction models is to preserve the proximity of nodes in graphs [16,92]. The proximity strategies can vary across different models based on their objectives. For example, the LINE model [16] aims to preserve 1st-order and 2nd-order proximity between nodes, while PALE [77] preserves pairwise similarities.
Graph neural networks have shown great performance in learning complex graph structures [18,50]. GNNs can be categorized into three main groups: graph autoencoder [50,51], recurrent GNNs [17,93], and convolutional GNNs. Graph autoencoders and recurrent GNNs are mostly pioneer studies of GNNs based on autoencoder architecture and recurrent neural networks, respectively. Graph autoencoders are composed of an encoder layer and a decoder layer. The encoder layer aims to compress a proximity graph matrix to vector embeddings, and the decoder layer reconstructs the proximity matrix. Most graph autoencoder models employ multilayer perceptron-based layers or recurrent GNNs as the core of autoencoder architecture. Recurrent GNNs aim to learn node embeddings based on recurrent neural network architecture in which connections between neurons can make a cycle. Therefore, earlier RGNNs mainly aimed to learn embeddings on directed acyclic graphs [94]. Recurrent GNNs employ the same weights in all hidden layers to capture local and global structures. Recently, convolutional GNNs have been much more efficient and can gain outstanding performance compared to RGNNs. The main difference between RGNNs and CGNNs is that CGNNs use different weights in each hidden layer, which could distinguish local and global structures. Various CGNN models have been proposed and mainly fall into two categories: spectral CGNNs, and spatial CGNNs [22,52,95]. Spectral CGNNs aim to transform graph data to the frequency domain and learn node embeddings in this domain [56,96]. By contrast, spatial CGNNs work directly on the graph using convolutional filters [53,54]. By staking multiple GNN layers, the models could learn node embeddings more efficiently and capture higher-order structural information [97,98]. However, stacking many layers could cause the over-smoothing problem, which most GNNs have not fully solved in a whole extent.
Recently, several models have enabled transformer architecture to learn graph structures which gain significant results compared to other deep-learning models [30,46,99]. We categorize graph transformer models into three main groups: transformer for tree-like graphs [64,65], transformer with GNNs [99,100], and transformer with global self-attention [30,67]. Different types of graph transformer models aim to handle distinct types of graphs. The transformer for tree-like graphs aims to learn node embeddings in tree-like hierarchical graphs [64,65,101]. The hierarchical relationships from the target nodes to their parents and neighbors are presented as absolute and relative positional encoding, respectively. Several graph transformer models employ the message-passing mechanism from GNNs as an auxiliary module in computing the attention score matrix [61,100]. GNN layers can be used to aggregate information as input to graph transformer models or put on top of the model, which aims to preserve local structures. In addition, some graph transformer models can directly process graph data without support from GNN layers [30,67]. These models implement a global self-attention to learn local and global structures in a graph input without neighborhood constraints.
Most existing graph embedding models aim to learn embeddings in Euclidean space, which may not deliver good geometric representations and metrics. Recent studies have shown that non-Euclidean spaces are more suitable for representing complex graph structures. The non-Euclidean models could be categorized as hyperbolic, spherical, and Gaussian. Hyperbolic and spherical space are two types of non-Euclidean geometry that could represent different graph structures. Hyperbolic space [102] is more suitable for representing hierarchical graph structures that follow the power law, while the power of spherical space is to represent large circular graph structures [103]. Moreover, since the information about the embedding space is unknown and uncertain, several models aim at learning node embeddings as Gaussian distribution [23,104].

3.1. Graph Kernels

Graph kernels aim to compare graphs or their substructures (e.g., nodes, subgraphs, and edges) by measuring their similarity [105]. The problem of measuring the similarity of graphs is, therefore, at the core of learning graphs in an unsupervised manner. Measuring the similarity of large graphs is problematic since the graph isomorphism problem is assigned to the NP (nondeterministic polynomial time) class. However, it is an NP-complete for subgraphs isomorphism problem. Table 2 describes a summary of graph kernel models.
Kernel methods applied to the graph embedding problem can be understood in two forms, including the isomorphism testing of N graphs (kernels for graphs) and embedding entities of graphs to Hilbert space (kernels on graphs).
  • Kernels for graphs: Kernels for graphs aim to measure the similarity between graphs. The similarity between the two graphs (isomorphism) could be explained as follows: Given two undirected graphs G 1 = ( V 1 E 1 ) and G 2 = ( V 2 E 2 ) , G 1 and G 2 are isomorphic if they exist a bimodal mapping function ϕ : V 1 V 2 such that a b V 1 , a and b are contiguous on G 1 if ϕ ( a ) and ϕ ( b ) are contiguous on G 2 .
  • Kernels on graphs: To embed nodes in graphs, kernel methods refer to finding a function that maps pairs of nodes to latent space using particular similarity measures. Formally, graph kernels could be defined as: Given a graph G = ( V , E ) , a function K = V × V R is a kernel on G if there is a mapping function ϕ : V H such that K ( v i v j ) = ϕ ( v i ) ϕ ( v j ) for any node pairs ( v i v j ) .
There are several strategies to measure the similarity of pairs of graphs, such as graphlet kernels, WL kernels, random walk, and shortest paths [31,83]. Among the kernel methods, graphlet kernels are one of the simple kernels that could measure the similarity between graphs by counting subgraphs with a limited size k [83,106]. For instance, Shervashidze et al. [83] introduced a graphlet kernel with the main idea of finding the graph feature by counting the number of different graphlets in graphs. Formally, given an unlabeled graph G, a graphlet list V k = ( G 1 + G 2 + + G n k ) is the set of the graphlets with size k where n k depicts the number of graphlets. The graphlet kernel for two unlabeled graphs G and G could be defined as:
K G , G = ϕ G , ϕ G ,
where ϕ G and ϕ G are vectors that depict the number of graphlets in a G i and G i , respectively. By counting all graphlets with size k for a graph, the computation time is expensive by the enumeration n k with n depicts the number of nodes in G. One of the practical solutions to overcome this limitation is to design the feature ϕ i G more effectively, called Weisfeiler–Lehman.
Weisfeiler–Lehman (WL) test [31] is considered to be a traditional strategy to test the homomorphism of two graphs using color refinements. Figure 5 presents the main idea of the WL homomorphism test for two graphs in detail. By updating node labels, all the structure information of nodes in graphs could be stored at each node, including both local and global information, depending on the number of iterations. We can then compute histograms or other summary statistics over these labels as a vector representation for graphs.
Several models improved the idea from WL isomorphism test [34,84]. The concept of the WL isomorphism test inspired various GNN models later, which aim to be expressive as powerful as the WL test to distinguish different graph structures. Shervashidze et al. [33] presented three instances of WL kernels, including the WL subtree kernel, WL edge kernel, and WL shortest-path kernel with an enrichment strategy for labels. The key idea of [33] is to represent a graph G as WL sequences with the height of h. The WL sequences of two graphs G and G can be defined as:
k W L ( h ) G , G = k G 0 , G 0 + k G 1 , G 1 + + k G h , G h
where k G i , G i = ϕ ( G i ) , ϕ ( G i ) . For N graphs, the WL subtree kernel could be computed in a runtime of O ( N h m + N 2 h n ) , where h and m are the numbers of interactions and edges in G, respectively. Therefore, the algorithm could capture more information about the graph G after h interactions and compare graphs at different levels.
However, the vanilla WL isomorphism test requires massive resources since the methods are an NP-hard class. Following the WL isomorphism idea, Morris et al. [34] presented a set of k-set forms V ( G ) k and built a local and global neighborhood of the k -sets. Instead of working on each node in graphs, the models calculate and update the labels based on the k -set. The feature vectors of graph G then could be calculated by counting the number of occurrences of k -sets. Several models [84,114] improved the Wasserstein distance based on the WL isomorphism test, and the models could estimate weights of subtree patterns before the kernel construction [35]. Several models adopted a random-walk sampling strategy to capture the graph structure that could help reduce the computational complexity to handle large graphs [36,37,85,107].
However, the above methods only focus on homogeneous graphs in which nodes do not have side information. In the real world, graph nodes could contain labels and attributes and change over time, making it challenging to learn node embeddings. Several models have been proposed with slight variations from the traditional WL isomorphism test and random walk methods [109,110,111,112,113]. For example, Borgwardt et al. [109] presented random-walk sampling on attributed edges to capture the graph structure. Since existing kernel models primarily work on small-scale graphs or a subset of graphs, improving similarity based on shortest paths could achieve better computational efficiency for graph kernels in polynomial time. An all-paths kernel K could be defined as:
K P G 1 , P G 2 = p 1 P ( G 1 ) p 2 P ( G 2 ) k p a t h p 1 , p 2 ,
where P ( G 1 ) and P ( G 2 ) are the set of random-walk paths in G 1 and G 2 , respectively, and k p a t h p 1 , p 2 depicts a positive definite kernel on two paths p 1 and p 2 . The model then applied Floyd–Warshall algorithm [115] to find k shortest-path kernels in graphs. One of the disadvantages of this model is the runtime complexity, which is about O ( k × n 4 ) , where n depicts the number of nodes in graphs. Morris et al. [108] introduced a variation of the WL subtree kernel for attributed graphs by improving existing shortest-path kernels. The key idea of this model is to use a hash function that maps continuous attributes to label codes, and then it normalizes the discrete label codes.
To sum up, graph kernels are effective models and bring several advantages:
  • Coverage: The graph kernels are one of the most useful functions to measure the similarity between graph entities by performing several strategies to find a kernel in graphs. This could be seen as a generalization of the traditional statistical methods [116].
  • Efficiency: Several kernel tricks have been proposed to reduce the computational cost of kernel methods on graphs [117]. Kernel tricks could reduce the number of spatial dimensions and computational complexity on substructures while still providing efficient kernels.
Although kernel methods have several advantages, several disadvantages make the kernels difficult to scale:
  • Missing entities: Most kernel models could not learn node embeddings for new nodes. In the real world, graphs are dynamic, and their entities could evolve. Therefore, the graph kernels must re-learn graphs every time a new node is added, which is time-consuming and difficult to apply in practice.
  • Dealing with weights: Most graph kernel models do not consider the weighted edges, which could lead to structural information loss. This could reduce the possibility of graph representation in the hidden space.
  • Computational complexity: Graph kernels are an NP-hard class [109]. Although several kernel-based models aim to reduce the computational time by considering the distribution of substructures, this may increase the complexity and reduce the ability to capture the global structure.
Although the graph kernels delivered good results when working with small graphs, they remain limitations when working with large and complex graphs [118]. To address the issue, matrix factorization-based models could bring far more advantages to learning node embeddings by decomposing the large original graphs into small-sized components. Therefore, we discuss matrix factorization-based models for learning node embeddings in the next section.

3.2. Matrix Factorization-Based Models

Matrix factorization aims to reduce the high-dimensional matrix that describes graphs (e.g., adjacency matrix, Laplacian matrix) into a low-dimensional space. Several well-known decomposition models (e.g., SVD, PCA, etc.) are widely applied in graph representation learning and recommendation system problems. Table 3 and Table 4 present matrix factorization-based models for static and dynamic graphs, respectively. Based on the strategy to preserve the graph structures, matrix factorization models could be categorized into two main groups: graph embedding Laplacian eigenmaps and node proximity matrix factorization.
  • The Laplacian eigenmaps: To learn representations of a graph G = ( V , E ) , these approaches first represent G as a Laplacian matrix L where L = D A and D is the degree matrix [41]. In the matrix L, the positive values depict the degree of nodes, and negative values are the weights of the edges. The matrix L could be decomposed to find the smallest number from eigenvalues which are considered node embeddings. The optimal node embedding Z * , therefore, could be computed using an objective function:
    Z * = arg min Z Z L Z .
  • Node proximity matrix factorization: The objective of these models is to decompose node proximity matrix into small-sized matrices directly. In other words, the proximity of nodes in graphs will be preserved in the latent space. Formally, given a proximity matrix M, the models try to optimize the distance between two pair nodes v i and v j , which could be defined as:
    Z * = arg min Z M i j Z i Z j T .
Hofmann et al. [119] proposed an MSDC (Multidimensional Scaling and Data Clustering) model based on matrix factorization. The key idea of MSDC is to represent data points as a bipartite graph and then learn node embeddings based on node similarity in the graph. This method requires a symmetric proximity matrix M R N × N as input and learns a latent representation of the data in Euclidean space by minimizing the loss that could be defined as:
Z * = arg min Z 1 2 | V | ( v i , v j ) E Z i Z j 2 M i j 2 .
However, the limitation of the MSDC model is that the model focuses only on the pairwise nodes, which cannot capture the global graph structure. Furthermore, the model investigated the proximity of all the data points in the graph, which could increase computational complexity when working on large graphs. Several models [39,120] adopted k-nearest methods to search neighbor nodes which can capture more graph structure. The k-nearest methods, therefore, could bring the advantage of reducing computational complexity since the models only take k neighbors as inputs. For example, Han et al. [120] proposed the similarity S i j between two nodes v i and v j as:
S i j = exp | | v i v j | | 2 δ 2 , if v j N k ( v i ) . 0 , otherwise . ,
where N k ( v i ) depicts the set of k nearest neighbors of v i in graphs. The model could measure the infringement of the constraints between pairs of nodes regarding label distribution. In addition, the model can estimate the correlation between features which would be beneficial to combine common features during the training process.
Several models [7,40,120,121,122] have been proposed to capture side information in graphs such as attributes and labels. He et al. [42] used the locality-preserving projection technique, a nonlinear Laplacian Eigenmap, to preserve the local structural information in graphs. The model first constructs an adjacency matrix with k nearest neighbors for each pair of nodes. The model then computes the objective function as:
a * = arg min a a X L X a
subject to : a X L X a = 1
where D is a diagonal matrix, L = D A is the Laplacian matrix, and a is the transformation matrix in the linear embedding x i y i = A x i . Nevertheless, the idea from [42] only captures the structure within k nearest neighbors, which fails to capture the global similarity between nodes in the graph. Motivated by these limitations, Cao et al. [15] introduced the GraRep model, which considers a k-hop neighborhood of each target node. Accordingly, GraRep could capture global structural information in graphs. The model works with k-order probability transition matrix (proximity matrix) M k which could be defined as:
M k = M M k
where M = D 1 A , D is the degree matrix, A is the adjacent matrix, and M i j k presents the transition probability from node v i to v j . The loss function, thus, is the sum of k transition loss functions:
L k ( v i ) = v j N ( v i ) M i j k log σ Z i Z j N n e g v m P n ( v ) M i m k log σ Z i Z m .
To construct the vector embeddings, GraRep decomposed the transition matrix into small-sized matrices using SVD matrix factorization. Similarly, Li [123] introduced NECS (Learning network embedding with the community) to capture the high-order proximity using Equation (11).
Table 3. A summary of matrix factorization-based models for static graphs. C indicates the number of clusters in graphs, N Z i | μ c , Σ c refers to the multivariate Gaussian distribution for each cluster, L means the Laplacian matrix, H R n × k is the probability matrix that a node belongs to a cluster, U denotes the coefficient vector, and W i j is the weight on ( v i , v j ) .
Table 3. A summary of matrix factorization-based models for static graphs. C indicates the number of clusters in graphs, N Z i | μ c , Σ c refers to the multivariate Gaussian distribution for each cluster, L means the Laplacian matrix, H R n × k is the probability matrix that a node belongs to a cluster, U denotes the coefficient vector, and W i j is the weight on ( v i , v j ) .
ModelsGraph TypesTasksLoss Function
SLE [39]Static graphsNode classification v i V v j V W i j Z i s i j Z j 2 2
[120]Attributed graphsNode classification arg min W T r U X L X U + α v i V v j N ( v i ) Z i Z j 2 2 + α 1 L 1 + α 2 L 2
[7]Attributed graphsCommunity detection ( v i , v j ) E log σ Z i Z j v i V v j N ( v i ) log σ Z i Z j N n e g v k P n ( v ) log σ Z i Z k v i V , c = 1 C N Z i | μ c , Σ c
LPP [42]Attributed graphsNode classification 1 | V | v i V y i y ^ i 2 2
[121]Attributed graphsGraph reconstruction 1 | V | v i V y i y ^ i 2 2
[40]Static graphsNode clustering v i V c = 1 C Z i μ c
GLEE [122]Attributed graphsGraph reconstruction, Link prediction L L ^ 2
LPP [42]Static graphsNode classification ( v i , v j ) E Z i Z j 2 2
Grarep [15]Static graphsNode classification, Node clustering v i V v j N ( v i ) A i j l log σ Z i Z j
| N n e g | v k ~ P n ( v ) A i k l log σ Z i Z k
NECS [123]Static graphsGraph reconstruction, Link prediction, Node classification M M ^ F 2 + α 1 H H ^ F 2 + α 2 H H ^ I F 2
HOPE [5]Static graphsGraph reconstruction Link prediction, Node classification M Z · Z F 2
[124]Static graphsLink prediction ( v i , v j ) S A i j Z i Z j 2 2
AROPE [86]Static graphsGraph reconstruction, Link prediction, Node classification M Z · Z F 2
ProNE [43]Static graphsNode classification v i V v j N ( v i ) log σ Z i Z j + v k P n ( v ) log σ Z i Z k
ATP [6]Static graphsLink prediction M Z · Z F 2
[125]Static graphsGraph partition ( v i , v j ) E v i , v j V k W i j Z i ( k ) , Z j ( k ) 2 + v i V k Z i Z ^ i 2 2
NRL-MF [126]Static graphsNode classification v i V v j N ( v i ) Z i Z j 2 2
In terms of considering the node proximity based on neighbor relations, Ou et al. [5] presented HOPE, an approach for preserving structural information in graphs using k-order proximity. In contrast to GraRep, HOPE tried to solve the asymmetric transitivity problem in directed graphs by approximating high-order proximity. The objective function needs to be minimized for the approximation proximity could be defined as:
Z * = arg min Z M i j Z i Z j 2 2
where M is the high-order proximity matrix, for instance, M i j presents the proximity of two nodes v i and v j , Z i and Z j denote vector embeddings of v i and v j , respectively. The proximity matrix M can be measured by decomposing into two small-sized matrices M = M g 1 · M l . Several common criteria could measure the node proximity, such as Katz Index [127], Rooted PageRank [128], Adamic-Adar [129], and Common Neighbors. Coskun and Mustafa [124] suggested changes in the proximity measure formulas of the HOPE model. For nodes that have a small degree, singular values could be zero after measuring the node proximity. Therefore, to solve this problem, they added a parameter σ to regularize the Laplacian graph.
A few models have been proposed with the same idea as HOPE and GraRep [43,86]. For example, ProNE model [43] aimed to use k number of the Chebyshev expansion to avoid Eigen decomposition, instead of using k-order proximity in HOPE models. Sun et al. [6] introduced a similar approach for preserving asymmetric transitivity with high-order proximity. However, the significant difference is that they proposed a strategy to break directed acyclic graphs while preserving the graph structure. The non-negative matrix factorization could then be applied to produce an embedding matrix. Several models [125,130,131] mainly focused on the pointwise mutual information (PMI) of nodes in graphs which calculates the connection between nodes in terms of linear and nonlinear independence. Equation (5) is used to learn node embeddings.
Several models aimed to reduce computational complexity from matrix factorization by improving the sampling strategies [126,132,133]. For instance, the key idea of the NRL-MF model [126] was to deal with a hashing function for computing dot products. Each node is presented as a binarized vector by a hashing function, which can be calculated faster by XOR operators. The model could learn the binary and quantized codes based on matrix factorization and preserve high-order proximity. Jiezhong [133] targeted sparse matrix factorization. They implemented random-walk sampling on graphs to construct a NetMF Matrix Sparsifier. RNP model [132] explored in-depth vector embeddings based on personalized PageRank values, then approximated the PPR matrices.
Table 4. A summary of matrix factorization-based models for heterogeneous graphs and dynamic graphs. H R n × k is the probability matrix that a node belongs to a cluster, E ( t ) is the edge matrix with type t, W i j is the weight on ( v i , v j ) , r denotes the relation type, and E ( 1 , 2 ) is the set of edges in two component graphs G 1 and G 2 .
Table 4. A summary of matrix factorization-based models for heterogeneous graphs and dynamic graphs. H R n × k is the probability matrix that a node belongs to a cluster, E ( t ) is the edge matrix with type t, W i j is the weight on ( v i , v j ) , r denotes the relation type, and E ( 1 , 2 ) is the set of edges in two component graphs G 1 and G 2 .
ModelsGraph TypesTasksLoss Function
DBMM [134]Dynamic graphsNode classification, Node clustering A A ^ F 2
[135]Dynamic graphsLink prediction A A ^ 2 2 + α L 2
[136]Dynamic graphsLink prediction A A ^ F 2
LIST [137]Dynamic graphsLink prediction A A ^ F 2 + L 2
TADW [131]Attributed graphsNode classification M W H X F 2 + α W F 2 + H F 2
PME [138]Heterogeneous graphsLink prediction ( v i , v j ) E ( r ) ( v i , v k ) E ( r ) Z i Z j 2 2 Z i Z n k 2 2 + m
EOE [139]Heterogeneous graphsNode classification ( v i , v j ) E ( 1 , 2 ) σ Z i Z j ( v l , v k ) E ( 1 , 2 ) σ Z i Z j + α 1 L 1 + α 2 L 2
[130]Heterogeneous graphsLink prediction ( v i , v j ) E ( t ) Z i ( t ) Z ^ i ( t ) F 2 + α 1 L 1 + α 2 L 2
ASPEM [140]Heterogeneous graphsNode classification, Link prediction v i V ( v i , v j , r ) E log p ( v i | v j , r )
MELL [141]Heterogeneous graphsLink prediction ( v i , v j ) E σ Z i Z j ( v i , v k ) E σ 1 Z i Z k + α L 2
PLE [142]Attributed graphsNode classification ( v i , v j ) E log σ ( Z i Z j ) + N n e g E v k P n ( v k ) log σ ( Z i Z k )
In the real world, several graphs often contain attributes for nodes and edges, such as user profiles on a social network. These attributes provide helpful information to improve the node representation and help to learn node embedding. Yang et al. [131] proposed the TADW model by representing the DeepWalk model as a matrix factorization and integrating text features into the factorization model. Ren et al. [142] introduced the PLE model to learn jointly different types of nodes and edges with text attributes. Since existing models often ignore the noise of labels, PLE is the first work to investigate the noisy type labels by measuring the similarity between entities and type labels.
Beyond static and homogeneous graphs, several models have been proposed to learn embeddings in dynamic and heterogeneous graphs. The embedding models for dynamic graphs are essentially the same as for static graphs, including Laplacian eigenmaps methods and node proximity matrix factorization to model relations in dynamic graphs over time. For Laplacian eigenmaps methods, Li et al. [81] presented DANE (Dynamic Attributed Network Embedding) model to learn node embeddings in dynamic graphs. The main idea of the DANE model is to represent a Laplacian matrix as L A ( t ) = D A ( t ) A ( t ) , where A ( t ) R n × n is the adjacency matrix of dynamic graphs at time t, D A is the diagonal matrix, then the model could be able to learn node embeddings by time in an online manner. To preserve the node proximity, the DANE model aimed to minimize the loss function:
L ( v i , v j ) = ( v i , v j ) i j A i j ( t ) Z i Z j 2 2 .
The eigenvectors λ of the Laplacian matrix L can be calculated by solving the generalized eigenproblem: L A ( t ) a = λ D A ( t ) , where a = a 0 a 1 a N is the eigenvectors.
Several models applied node proximity matrix factorization directly to dynamic graphs by updating the proximity matrix between entities in the dynamic graphs. Rossi et al. [134] presented dynamic graphs as a set of static graph snapshots: G = { G ( t 0 ) G ( t 1 ) G ( t N ) } . The model then learned a transition proximity matrix T, which describes all transitions from the dynamic graphs. For evaluation, they predict the graph G at time t + 1 : G ^ t + 1 = G t T t + 1 , then estimate the error using Frobenius loss: G ^ t + 1 G t + 1 F . Zhu et al. [135,137] aimed to preserve the graph structure based on temporal matrix factorization during the network evolution. Given an adjacency matrix A ( t ) at time t, two temporal rank-k matrix factorization U and V ( t ) are factorized as A ( t ) = f ( U V ( t ) ) , and the objective is to minimize the loss function L A which could be defined as:
L A = t = 1 T D ( t ) 2 A ( t ) A ^ ( t ) F 2 .
Matrix factorization models have been successfully applied to graph embedding, mainly for the node embedding problem. Most models are based on singular value decomposition to find eigenvectors in the latent space. There are several advantages of matrix factorization-based models:
  • Training data requirement: The matrix factorization-based models do not need much data to learn embeddings. Compared to other methods, such as neural network-based models, these models bring advantages in case there is little training data.
  • Coverage: Since the graphs are presented as Laplacian matrix L, or transition matrix M, then the models could capture all the proximity of the nodes in the graphs. The connection of all the pairs of nodes is observed at least once time under the matrix that makes the models could be able to handle sparsity graphs.
Although matrix factorization is widely used in graph embedding problems, it still has several limitations:
  • Computational complexity: The matrix factorization suffers from time complexity and memory complexity for large graphs with millions of nodes. The main reason is the time it takes to decompose the matrix into a product of small-sized matrices [15].
  • Missing values: Models based on matrix factorization cannot handle incomplete graphs with unseen and missing values [143,144]. When the graph data are insufficient, the matrix factorization-based models could not learn generalized vector embeddings. Therefore, we need neural network models that can generalize graphs and better predict entities in graphs.

3.3. Shallow Models

This section focuses on shallow models for mapping graph entities into vector space. These models mainly aim to map nodes, edges, and subgraphs as low-dimensional vectors while preserving the graph structure and entity proximity. Typically, the models first implement a sampling technique to capture graph structure and proximity relation and then learn embeddings based on shallow neural network algorithms. Several sampling strategies could be taken to capture the local and global information in graphs [14,145,146]. Based on the sampling strategy, we divide shallow models into two main groups: structure preservation and proximity reconstruction.
  • Structure preservation: The primary concept of these approaches is to define sampling strategies that could capture the graph structure within fixed-length samples. Several sampling techniques could capture both local and global graph structures, such as random-walk sampling, role-based sampling, and edge reconstruction. The model then applies shallow neural network algorithms to learn vector embeddings in the latent space in an unsupervised learning manner. Figure 6a shows an example of a random-walk-based sampling technique in a graph from a source node v s to a target node v t .
  • Proximity reconstruction: It refers to preserving a k-hop relationship between nodes in graphs. The relation between neighboring nodes in the k-hop distance should be preserved in the latent space. For instance, Figure 6b presents a 3-hop proximity from the source node v s .
In general, shallow models have achieved many successes in the past decade [4,14,21]. However, there are several disadvantages of shallow models:
  • Unseen nodes: When there is a new node in graphs, the shallow models cannot learn embeddings for new nodes. To obtain embedding for new nodes, the models must update new patterns, for example, re-execute random-walk sampling to generate new paths for new nodes, and then the models must be re-trained to learn embeddings. The re-sampling and re-training procedures make it impractical to apply them in practice.
  • Node features: Shallow models such as DeepWalk and Node2Vec mainly work suitably on homogeneous graphs and ignore information about the attributes/labels of nodes. However, in the real world, many graphs have attributes and labels that could be informative for graph representation learning. Only a few studies have investigated the attributes and labels of nodes, and edges. However, the limitations of domain knowledge when working with heterogeneous and dynamic graphs have made the model inefficient and increased the computational complexity.
  • Parameter sharing: One of the problems of shallow models is that these models cannot share the parameters during the training process. From the statistical perspective, parameter sharing could reduce the computational time and the number of weight updates during the training process.

3.3.1. Structure-Preservation Models

Choosing a strategy to capture the graph structure is essential for shallow models to learn vector embeddings. The graph structure can be sampled through connections between nodes in graphs or substructures (e.g., subgraphs, motifs, graphlets, roles, etc.). Table 5 briefly summarizes structure-preservation models for static and homogeneous graphs.
Over the last decade, various models have been proposed to capture the graph structure and learn embeddings [4,21,147,148]. Among those models, random-walk-based strategies could be considered one of the most typical strategies to sample the graph structures [4,14]. The main idea of the random-walk strategy is to gather information about the graph structure to generate paths that can be treated as sentences in documents. The definition of random walks could be defined as:
Definition 6
(Random walk [14]). Given a graph G = ( V , E ) , where V is the set of nodes and E is the set of edges, a random walk with length l is a process starting at a node v i V and moving to its neighbors for each time step. The next steps are repeated until the length l is reached.
Two models, DeepWalk [14] and Node2Vec [4] could be considered to be pioneer models to open a new direction for learning node embeddings.
Inspired by the disadvantages of the matrix factorization-based models, the DeepWalk model could preserve the node neighborhoods based on random-walk sampling, which could capture global information in graphs. Moreover, both DeeWalk and Node2Vec aim to maximize the probability of observing node neighbors by stochastic gradient descent on each single-layer neural network. Therefore, these models reduce running time and computational complexity. DeepWalk [14] is a simple node embedding model using the random-walk sampling strategy to generate node sequences and treat them as word sentences. The objective of DeepWalk is to maximize the probability of the set of neighbor nodes N ( v i ) given a target node v i . Formally, the optimization problem could be defined as:
ϕ * ( · ) = arg min ϕ ( · ) log p N ( v i ) | ϕ ( v i )
where v i denotes the target node, N ( v i ) is the set of neighbors of v i which could be generated from random-walk sampling, ϕ ( v i ) is the mapping function ϕ : v i V R | V | × d . The model uses two strategies for finding neighbors given a source node, based on the Breadth-First Search (BFS) and Depth First Search (DFS) strategies. The BFS strategy aims to represent a microscopic view that captures the local structure. In contrast, the DFS strategy delivers the global structure information in graphs. The DeepWalk then uses a skip-gram model and stochastic gradient descent (SGD) to learn latent representations.
Table 5. A summary of structure-preservation models for homogeneous and static graphs. K indicates the number of clusters in the graph, and μ k refers to the mean value of cluster k.
Table 5. A summary of structure-preservation models for homogeneous and static graphs. K indicates the number of clusters in the graph, and μ k refers to the mean value of cluster k.
ModelsGraph TypesTasksLoss Function
DeepWalk [14]Static graphsNode classification v i V y i log ( y ^ i )
Node2Vec [4]Static graphsNode classification, Link prediction v i V y i log ( y ^ i )
WalkLets [147]Static graphsNode classification v i V y i log ( y ^ i )
Div2Vec [149]Static graphsLink prediction v i V y i log ( y ^ i )
Static graphsNode classification v i V y i log ( y ^ i )
Node2Vec+ [148]Static graphsNode classification v i V y i log ( y ^ i )
Struct2Vec [21]Static graphsNode classification v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
DiaRW [150]Static graphsNode classification, Link prediction v i V y i log ( y ^ i ) ( 1 y i ) log ( 1 y ^ i )
Role2Vec [151]Attributed graphsLink prediction k = 1 K v i V k Z i μ k 2 2
NERD [152]Directed graphsLink Prediction, Graph Reconstruction, Node classification ( v i , v j ) E log σ ( Z i Z j ) + N n e g E v k P n ( v k ) log σ ( Z i Z k )
Sub2Vec [153]Static graphsCommunity detection, graph classification k = 1 K v i V k Z i μ k 2 2
Subgraph2Vec [145]Static graphsGraph classification, Clustering v i V y i log ( y ^ i )
RUM [89]Static graphsNode classification, Graph reconstruction v i V y i log ( y ^ i )
Gat2Vec [154]Attributed graphsNode classification, Link prediction v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
ANRLBRW [155]Attributed graphsNode classification v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
Gl2Vec [88]Static graphsNode classification v i V y i log ( y ^ i )
One of the limitations of DeepWalk is that the model can only capture the graph structure but fail to navigate the random-walk sampling to enrich the quality of the sampling graph structure. To overcome the limitations of DeepWalk, Grover and Leskovec introduced Node2Vec [4] with a flexible random-walk sampling strategy to navigate random walks via each time step. The key difference between DeepWalk and Node2Vec is that instead of using a truncated random walk, the model used a biased random-walk sampling process with two parameters (p and q) to adjust the random walk on graphs. Figure 7a presents two parameters p and q in Node2Vec model in detail. The model could capture more information on the graph structure locally and globally by introducing constraints when deciding the subsequent nodes visited.
Perozzi et al. [147] presented the WalkLets model, which was extended from the DeepWalk model. They modified the random-walk sampling strategy to capture more graph structure information by skipping and passing over multiple nodes at each time step. Therefore, these sampling strategies can capture more global graph structure by the power of the transition matrix when passing over multiple nodes. The main idea of the WalkLets model is to represent the random-walk paths as pairs of nodes in the multi-scale direction. Figure 7b depicts the sampling strategy of the WalkLets model using multi-scale random-walk paths. However, one of the limitations of the WalkLets is that the model could not distinguish local and global structures when passing and skipping over nodes in graphs. Jisu et al. [149] presented a variation of DeepWalk, named Div2Vec model. The main difference between the two models is the way that Div2Vec chooses the next node in the random-walk path, which will be visited based on the degree of neighboring nodes. The focus on the degree of neighboring nodes could help the models learn the importance of nodes that are popular in social networks. Therefore, at the current node v i , the probability of choosing the next node v j in a random-walk path is calculated as:
p ( v j | v i ) = f ( deg ( v j ) v i N f ( deg ( v i ) )
where deg ( v j ) depicts the degree of node v j , and f ( deg ( v j ) ) = 1 deg ( v j ) . Renming et al. [148] presented Node2Vec+, an improved version of Node2Vec. One limitation of the Node2Vec model is that it cannot determine the following nodes based on the target nodes. There is a significant difference between Node2Vec and Node2Vec+. The Node2Vec+ model can determine the state of the potential edge for a given node, therefore enhancing the navigability of the Node2Vec model to capture more graph structure. In particular, they introduced three neighboring edge states from a current node (out edge, noisy edge, and in edge) which are calculated to decide the next step. With potential out edges ( v i , v j ) E from previous node t, the in–out parameters p and q of Node2Vec model could then be re-defined as bias factor α as:
α p q ( t , v i , v j ) = 1 p if t = x . 1 if w ( v j , t ) d ˜ ( v j ) . min { 1 , 1 p } if w ( v j , t ) < d ˜ ( v j ) and w ( v j , t ) < d ˜ ( v i ) . 1 q + ( 1 1 q ) w ( v j , t ) d ( x ) if w ( v j , t ) < d ˜ ( v j ) and w ( v j , t ) d ˜ ( v i ) .
where d ˜ ( v i ) denotes a noisy edge threshold which could consider the next node state v j from the current node t and could be viewed as the weights of edges, w ( v i , v j ) is the weight of the edge between v i and v j .
In contrast to preserving graph topology which mainly focuses on distance relations, several models aimed to preserve the role and importance of nodes in graphs. In the case of social networks, for example, we could discover influencers with the ability to impact several activities of communities. In contrast to the random-walk-based technique, several studies [21,150] used the term “role-based” to preserve the nodes’ role, which random-walk-based sampling strategies cannot capture in a fixed length. Therefore, by preserving the role of nodes, role-based models could capture the structural equivalent. Ribeiro et al. [21] introduced the Struc2Vec model to capture graph structure based on the nodes’ role. Nodes that have the same degree should be encoded close in the vector space. Given a graph G, they introduced k graphs, each graph can be considered in one layer. Each layer denotes a graph that describes the weighted node degree from different hop distances. Specifically, at layer L k , for each node v i V , there are three probabilities of going to node v j in the same layer, jumping to the previous layer L k 1 and next layer L k + 1 :
p k ( v i k , v j k ) = e f k ( v i k , v j k ) Z k ( v i k )
p k ( v i k , v i k + 1 ) = w ( v i k , v i k + 1 ) w ( v i k , v i k + 1 ) + w ( v i k , v i k 1 )
p k ( v i k , v i k 1 ) = 1 p k ( v i k , v i k + 1 )
where f k ( v i v j ) presents the role-based distance between nodes v i and v j , and w ( · ) denotes the edge weight. Zhang et al. [150] presented the DiaRW model, which uses a random-walk strategy based on the node degree. The difference between other role-based models and the DiaRW model is that they used random walks that can vary in length based on the node degree. One of the limitations of the Struc2Vec model is that the model could not preserve the similarity of nodes in graphs. Motivated by this limitation, the DiaRW model aims to capture structural identity based on node degree and the neighborhood in which nodes have a high degree. The purpose of this model is to collect structural information around higher-order nodes, which is a limitation of models based on fixed-length random walks. Ahmed et al. [151] introduced the Role2Vec model that could capture the node’s similarity and structure by introducing a node-type parameter to guide random-walk paths. The core idea of Role2Vec is that nodes in the same cluster should be sampled together in the random-walk path. By only sampling nodes in the same clusters, Role2Vec could learn correct patterns with reduced computational complexity. The model then uses the skip-gram model to learn node embeddings. Unlike Rol2Vec, the NERD model [152] considers nodes’ asymmetric roles for directed graphs. The model sampled the neighbor’s nodes using an alternative random walk. The probability of the next node v i + 1 from the current node v i in the random-walk path could be defined as:
p ( v i + 1 | v i ) = 1 d o u t ( v i ) · w ( v i , v j ) if ( v i , v j ) E . 1 d i n ( v i ) · w ( v i , v j ) if ( v j , v i ) E . 0 otherwise .
where w ( v i , v j ) is the weight of the edge e i j   d i n ( v i ) and d o u t ( v i ) present the total in-degree and out-degree of the node v i , respectively.
In some types of graphs, nodes in the same subgraphs tend to have similar labels. Studying low-level node representation could not bring significant generalization. Instead of embedding individual nodes in graphs, several studies aimed to learn subgraph similarity or the whole graphs. Inspired by representations of sentences and documents in the NLP (natural language processing) area, Bijaya et al. [153] proposed the Sub2Vec model to embed each subgraph into a vector embedding.
To learn a subgraph embedding S = { G 1 G 2 G n } from an original graph G, two properties should be preserved: similarity and structural property. The former ensures the connection between subgraph nodes by collecting sets of paths in a subgraph. The latter ensures that each node in a subgraph should be densely connected to all other nodes in the same subgraph. Figure 8 presents two subgraph properties that could capture each subgraph connection and structure.
In contrast to Sub2Vec, Subgraph2Vec [145] aimed to learn rooted subgraph embeddings for detecting Android malware. One of the advantages of this model with the Sub2Vec model is that Subgraph2Vec could consider different degrees of rooted subgraphs surrounding the target subgraph while Sub2Vec tried to detect the community. Annamalai et al. [156] targeted embedding the entire graph into the latent space. With the same idea as the Subgraph2Vec model, they extracted the set of subgraphs from the original graph using the WL relabeling strategy. However, the difference is that they used the Doc2Vec model by treating documents as graphs to learn graph embeddings.
Most models mentioned above aim to capture the graph structure based on low-level node representation, which could fail to represent the higher-level structure. Therefore, finding the community structure can be difficult for models based on random-walk sampling strategies. Motif-based models are one of the strategies to preserve the local structure and discover the global structure of graphs. Yanlei et al. [89] proposed the RUM (network Representation learning Using Motifs) model to learn small groups of nodes in graphs. The main idea of RUM was to build a new graph G = ( V , E ) based on the original graph by constructing new nodes and edges as follows:
  • Generating nodes in graph G : Each new node v in graph G is a tuple v i j k = v i , v j , v k in the original graph G. Therefore, they can map the triangle patterns of the original graph to the new graph for structure preservation.
  • Generating edges of graph G : Each edge of the new graph is formed from two nodes that have two edges in common in the original graph. For example, the edge e = ( v i j k , v i j l ) denotes that we the edge ( v i , v j ) E in the original graph G.
The model then used the skip-gram model to learn the node and motif embeddings. Figure 9b depicts the details of the random-walk sampling strategy based on motifs.
There are also several models based on motifs for heterogeneous graphs [87,90,91]. For instance, Qian et al. [90] proposed the MBRep model (Motif-based representation) with the same idea from the RUM model to generate a hyper-network based on a triangle motif. However, the critical difference is that the MBRep model could extract motifs based on various node and edge types in heterogeneous graphs.
Most of the above models aim to learn node embeddings without side information, which could be informative for learning graph structure. However, graphs in the real world could be composed of side information, such as attributes of nodes and edges. Several models tried to learn node embeddings in attributed graphs by adding node properties presented as attributed graphs. Nasrullah et al. [154] proposed the Gat2Vec model to capture the contextual attributes of nodes. Given by a graph G = ( V , E , X ) where X is the attribute function X : V 2 X , they generated a structural graph G s and a bipartite graph G a as:
G s = V s E
G a = V a X E a
where V s V , V a = { v i : X ( v i ) } , V a V , and E a = { ( v i , a ) , a : X ( v i ) } . They then used the random-walk sampling strategy to capture the graph structure in both types of graphs. Similar to Gat2Vec, Wei et al. [155] introduced the ANRLBRW model (Attributed Network Representation Learning Based on Biased Random Walk) with the idea of splitting the original graph G into a geological graph and attributed graph. However, there is a slight difference between the two models. ANRLBRW model used a biased random-walk sampling inspired by Node2Vec, which includes two parameters p and q in the sampling strategy. Kun et al. [88] introduced the Gl2Vec model to learn node embeddings based on graphlets. To generate the feature representation for graphs, they capture the proportion of graphlet occurrences in a graph compared with random graphs.
For social networks, the connections of nodes are far more complex than the node-to-node edge relationship, which constructs hypergraphs. In contrast to homogeneous graphs, edges in hypergraphs could connect more than two nodes in graphs which leads to difficult learning node embeddings. Several models have been proposed to learn node and edge embeddings in the hypergraphs [157,158]. For example, Yang et al. [157] proposed the LBSN2Vec (Location Based Social Networks) model, a hypergraph embedding model to learn hyperedges including both user-user connection and user-check-in locations over time. Since most existing models fail to capture mobility features and co-location rates dynamically, the model could learn the impact of user mobility in social networks for prediction tasks. The objective of this model is to use a random-walk-based sampling strategy on hyperedges with a sequence length to capture the hypergraph structure. They then use cousin similarity to preserve nodes’ proximity in the random-walk sequences. Table 6 lists a summary of representative models for heterogeneous graphs.
Several types of graphs in the real world are heterogeneous, with different node and edge types. Most of the above models fail to capture heterogeneous graphs. Several models have been proposed to capture the heterogeneous graph structure [159,164,166]. Dong et al. [20] introduced the Metapath2Vec model, the idea based on random walks to learn node embeddings in heterogeneous graphs. One of the powers of meta-path is that it can capture the relationship between various types of nodes and edges in heterogeneous graphs. To capture the structure of heterogeneous graphs with different types of nodes and edges, they presented meta-path random walks P with length l:
P : v 1 t 1 v 2 t 2 t k 1 v k t k t l 1 v l
where t i presents the relation type between nodes v i and v i + 1 . Therefore, the transition probability of node v i + 1 given by node v i in the meta-path P could be defined as:
p ( v i + 1 | v i t , P ) = 1 | N t + 1 ( v i t ) | if ( v i + 1 , v i t ) E E ( t ) ( v i + 1 ) = t + 1 . 0 , otherwise . ,
where N t + 1 ( v i t ) is the number of the neighbors of node v i with node type t + 1 . Then, similar to DeepWalk and Node2Vec models, they used the skip-gram model to learn node embeddings. The approach from JUST [159] was conceptually similar to Metapath2Vec but the sampling strategy is performed differently. The model introduced a biased random-walk strategy with two parameters (jumping and staying) which aims to change the current domain or stay in the same domain for the next step.
Since the vanilla meta-path sampling strategy fails to capture different types of graphs, such as multiplex graphs and sparse graphs, several sampling strategies have been proposed for heterogeneous graphs based on meta-path strategies. The work of Zhang et al. [160] was similar to Metapath2Vec which implements random-walk sampling of all node types in the multiplex network. Lee et al. [161] introduced a BHIN2vec model which uses the random-walk strategy to capture sparse and rare patterns in heterogeneous graphs. Some models [162,163,164,165] have been applied to biological areas based on random-walk strategies. Lee et al. [166] used the WL relabeling strategy to capture temporal substructures of graphs. The model targeted the proximity of substructures in graphs instead of node proximity to learn the bibliographic entities in heterogeneous graphs. There are several models [167,168,176,177] that aim to capture entities from multiple networks. Du and Tong et al. [167] presented the MrMine model (Multi-resolution Multi-network) to learn embeddings with multi-resolutions. They first used WL label transformation to label nodes by the degree sequences, then adopted a dynamic time wrapping measure [21] to calculate the distance of each sequence to generate a relation network. The truncated random-walk sampling strategy is adopted to capture the graph structure. In contrast to the MrMine model, Lee and colleagues [168,176,177] explored in-depth multi-layered structure to represent the relation and proximity of individual characters, substructures, and the story network as a whole. To embed the substructure and story network, they first used WL relabeling [33] to extract substructures in the story network and then used Subgraph2Vec and Doc2Vec models to learn node embeddings.
Several types of graphs in the real world, however, show dynamic behaviors. Since most graph embedding models aim to learn node embeddings in static graphs, several models have been applied to learn node embeddings in dynamic graphs [10,92,173,174,175]. Most of them were based on the idea of DeepWalk and Node2Vec to capture the graph structure. By representing dynamic graphs as a set of static graphs, some models captured changes in the dynamic graph structure and updated changes in random walks over time. Then, the skip-gram model is used to learn node embeddings. For instance, the key idea of Sajjad et al. [169] is to generate random-walk paths on the first snapshot and then update random-walk paths in the corpus by time. Most existing models re-generate node embeddings for each graph snapshot to capture the dynamic behaviors. By contrast, the model introduced a set of dynamic random walks, which are frequently updated when there are any changes in dynamic graphs. This could reduce the computational complexity when the model handles large graphs. Figure 10 shows an example of how random-walk paths are updated in dynamic graphs.
Since the evolution of graphs only takes place at every few nodes and within a specific range of neighbors, updating the entire random walk is time-consuming. Several models [169,170,171,172,174,175] suggested updating dynamic steps over time for a few nodes and their local neighbors’ relationship. For example, Sedigheh et al. [174] presented the Dynnode2Vec model to capture the temporal evolution from graph G t to G t + 1 by a set of new nodes and edges ( V n e w , E n e w ) and a set of removed nodes and edges ( V d e l , E d e l ) . Motivated by Node2Vec architecture, the Dynnode2Vec model could learn the dynamic structure by inducing an adequate group of random walks for only dynamic nodes. The random-walk strategy, therefore, could be more computational efficiency when the model handles large graphs. Furthermore, the proposed dynamic skip-gram model could learn node embeddings at time t by adopting the results of the previous time t 1 as initial weights. As a result, the dynamic skip-gram model could learn the dynamic behaviors over time.
Therefore, the changes in nodes at time t + 1 could be described as:
Δ V t = V a d d { v i V t + 1 | e = ( v i , v j ) ( E a d d E d e l ) } .
In summary, structure-preservation methods have succeeded in learning embeddings over the past decade. There are several key advantages of these models:
  • Computational complexity: Unlike kernel models and matrix factorization-based models, which require considerable computational costs, structure preservation models could learn embeddings with an efficient time. This effectiveness comes from search-based sampling strategies and the model generalizability from the training process.
  • Classification tasks: Since the models aim to find structural neighbor relationships from a target node, these show power in problems involving node classification. In almost all graphs, nodes that have the same label tend to be connected at a small, fixed-length distance. This is a strength of models based on preserving structure in problems related to classification tasks.
However, there are a few limitations that these models suffer when preserving the graph structure:
  • Transductive learning: Most models cannot learn node embeddings that have not been seen in the training data. To learn new node embeddings, the model should re-sample the graph structure and learn the new samples again which could be time-consuming.
  • Missing connection problem: Many graphs have sparse connections and missing connections between nodes in the real world. However, most structure-preservation models cannot handle missing connections between nodes since the sampling strategies could not be able to capture these connections. In the case of a random-walk-based sampling strategy, for example, these models only capture graph structure when nodes are linked together.
  • Parameter sharing: These models could only learn node embeddings for individual nodes and do not share parameters. The absence of sharing parameters could reduce the effectiveness of learning representation.

3.3.2. Proximity Reconstruction Models

The purpose of graph embedding models is not only to preserve the graph structure but also to preserve the proximity of nodes in graphs. Most proximity reconstruction-based models are used for link prediction or node recommendation tasks [178,179,180] due to the nature of the similarity strategies. In this part, we discuss various models attempting to preserve the proximity of entities in graphs. Table 7 describes a summary of representative proximity reconstruction-based graph embedding models.
One of the typical models is LINE [16], which aims to preserve the symmetric proximity of node pairs in graphs. The advantage of the LINE model is that it could learn the node similarity which most structure-preservation models cannot represent this structural information. The main goal of the LINE model is to preserve the 1st-order and 2nd-order proximity of node pairs in graphs. The 1st-order proximity can be defined as follows:
Definition 7
(1st-order proximity [16]). The 1st-order proximity describes the local pairwise similarity between two nodes in graphs. Let w i j be the weight of an edge between two nodes v i and v j , and the 1st-order proximity is defined as w i j when two nodes are connected and w i j = 0 when there is no link between them.
In the case of binary graphs, w i j = 1 if two nodes v i and v j are connected, and w i j = 0 otherwise. To preserve the 1st-order proximity, the objective function of two distribution p ^ 1 ( v i , v j ) and p 1 ( v i , v j ) should be minimized:
L 1 ( θ ) = arg min θ d p ^ 1 ( v i , v j ) , p 1 ( v i , v j ) | θ
p ^ 1 ( v i , v j ) = w i j ( v k , v l ) E Z k Z l p 1 ( v i , v j ) = exp ( Z i Z j ) ( v k , v l ) E Z k Z l
where p ^ 1 ( v i , v j ) and p 1 ( v i , v j ) depict the empirical probability, and the actual probability of the 1st-order proximity, respectively, v i and v j are two nodes in G, Z i and Z j are embedding vectors in latent space corresponding to v i and v j , respectively, d · · is the distance between the two distributions. The statistical distance, Kullback–Leibler divergence [181], is usually used to measure the difference between two distributions. In addition to preserving the proximity of two nodes that are connected directly, the LINE model also introduced 2nd-order proximity, which could be defined as follows:
Definition 8
(2nd-order proximity [16]). The 2nd-order proximity when k = 2 captures the relationship of neighbors of each pair of nodes in the graph G. The idea of the 2nd-order proximity is that nodes should be closed if they share the same neighbors.
Let Z i and Z j are vector embeddings of nodes v i and v j , respectively, the probability of the specific context v j given by the target node v i could be defined as:
p 2 ( v j | v i ) = exp ( Z j Z i ) v k V exp ( Z k Z i ) .
Therefore, the minimization of the objective function L 2 could be defined as:
L 2 ( θ ) = arg min θ v i V D K L p ^ 2 ( . | v i ; θ ) , p 2 ( . | v i )
where p ^ 2 ( v j | v i ) = w i j k N ( i ) w i k is the observed distribution, w i j is the weighted edge between v i and v j .
Table 7. A summary of proximity reconstruction models. v i ( t ) denotes the type t of node v i , w i j is the weight between node v i and v j , P is a meta-path in heterogeneous graphs, N 2 is the 1st-order and 2nd-order proximity of a node v i , and P n ( v ) is the noise distribution for negative sampling.
Table 7. A summary of proximity reconstruction models. v i ( t ) denotes the type t of node v i , w i j is the weight between node v i and v j , P is a meta-path in heterogeneous graphs, N 2 is the 1st-order and 2nd-order proximity of a node v i , and P n ( v ) is the noise distribution for negative sampling.
ModelsGraph TypesObjectiveLoss Function
LINE [16]Static graphsNode classification v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
APP [76]Static graphsLink prediction v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
PALE [77]Static graphsLink prediction v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
v i , v j E Z i Z j F 2
CVLP [182]Attributed graphsLink prediction ( v i , v j ) E ( v i , v k ) E log σ Z i Z j Z i Z k + α 1 Z i Z j 2 2 + α 2 L 1 + α 3 L 2
[183]Static graphsLink prediction v i , v j E w i j log p 1 ( v i | v j ) v i , v j E w i j log p 2 ( v j | v i )
HARP [178]Static graphsNode classification v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
PTE [179]Heterogeneous graphsLink prediction v i ( t ) , v j ( t ) E ( t ) w i j log ( v i ( t ) | v j ( t ) )
Hin2Vec [180]Heterogeneous graphsNode classification, link prediction v i V y i log ( y ^ i ) ( 1 y i ) log ( 1 y ^ i )
[78]Heterogeneous graphsNode classification ( v i , v j ) E log σ ( Z i Z j ) + N n e g E v k P n ( v k ) log σ ( Z i Z k )
[184]Signed graphsLink prediction v i V y i log ( y ^ i ) ( 1 y i ) log ( 1 y ^ i )
[185]Heterogeneous graphsNode classification, Node clustering ( v i , v j ) P log 1 + e Z i Z j + N n e g E v k P n ( v k ) log 1 + e Z i Z k
[186]Heterogeneous graphsLink prediction ( v i , v j ) N 2 log σ ( Z i Z j ) + N n e g E v k P n ( v k ) log σ ( Z i Z k )
[187]Static graphsNode classification ( v i , v j ) N 2 log σ ( Z i Z j ) + N n e g E v k P n ( v k ) log σ ( Z i Z k )
[188]Heterogeneous graphGraph reconstruction, link prediction, node classification v i V Z i Z ^ i B 2 2 + α L 2
ProbWalk [189]Static graphsNode classification, link prediction v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
[190]Static graphsNode classification, link prediction 1 | V | v i V y i log y ^ i + α ( 1 y i ) log ( 1 y ^ i )
NEWEE [191]Static graphsNode classification, link prediction v i V , v i , v j E log σ ( Z i Z j ) | N n e g | v k P n ( v ) log σ ( Z i Z k )
DANE [192]Attributed graphsNode classification, Link prediction v i V X i X ^ i 2 2 + v i V M i M ^ i 2 2 ( v i , v j ) E log p i j ( v i , v j ) E log p i j ( v i , v j ) E log ( 1 p i j )
CENE [193]Attributed graphsNode classification v i V y i log ( y ^ i ) ( 1 y i ) log ( 1 y ^ i )
HSCA [194]Attributed graphsNode classification M W H X F 2 + α W F 2 + H F 2
However, the LINE model had several limitations as it only handles symmetric proximity pairs of nodes, and the proximity of node pairs was only considered up to 2nd-order proximity. To deal with directed graphs, Chang et al. [76] introduced the APP model, which could preserve the asymmetric proximity of node pairs. They introduced two roles for each node v i V as the source role s v i and target role t v i . The probability of each pair of nodes that start from a source node to the target node could be defined as:
p ( v i | v j ) = exp ( s v j · t v i ) v k V exp ( s v j · t v k ) .
Tong et al. [77] presented the PALE (Predicting Anchor Links via Embedding) model to predict the anchor links in social networks. The idea of the PALE model was the same as that of the LINE model, but they sampled only 1st-order proximity. The loss function with the negative sampling could be defined as:
L ( V ) = ( v i , v j ) E log σ ( Z i Z j ) N n e g E v k P n ( v k ) log σ Z i Z k .
Wei et al. [182] presented the CVLP (Cross-View Link Prediction) model that could predict the connections of nodes in the context of missing and noisy attributes. Given by a triplet ( v i , v j , v k ) where ( v i , v j ) E and ( v i , v k ) E , the probability of proximity preservation is defined as:
P ( s i j > s i k | U g ) = σ s i j s i k
where U g is the latent representation, s i j is the inner product of the representation s i j = U i g ( U j g ) , and σ · is the sigmoid function. Li et al. [183] performed a similar study to deeply learn follower-ship and followee-ship between users across different social networks. The main idea of this model is that the proximity between nodes in a social network should be preserved in another social network. For each node v i in a graph, there are three vector representations (a node vector Z i , an input context vector Z i ( 1 ) , and output context vector Z i ( 2 ) ). In particular, if a node v i is following a node v j in a social network, then vector Z i should contribute to the input context of Z j ( 1 ) , and vector Z j should contribute to the output context of Z i ( 2 ) . Therefore, given a node v i , the input and output context probability of node v j could be defined as follows:
p i n p u t ( v j | v i ) = exp ( Z j ( 1 ) Z i ) k = 1 N ( Z k ( 1 ) Z i ) p o u t p u t ( v i | v j ) = exp ( Z j ( 2 ) Z j ) k = 1 N ( Z k ( 2 ) Z j ) .
Haochen et al. [178] presented HARP (Hierarchical Representation) model with a meta-strategy to capture more global proximity of each pair node in graphs. The critical difference between HARP and LINE models is that they presented the original graph G as a series of graphs G 1 , G 2 , , G L where each graph can represent the collapse of adjacent edges and nodes. Figure 11 shows the way that two edges and nodes are collapsed in a graph. By representing L graphs after multiple collapses of edges and nodes, the graph can compress the proximity of nodes through supernodes.
Several variations and extensions of the LINE model are applied to heterogeneous and dynamic graphs. Jian et al. [179] presented the PTE model to preserve the 1st-order and 2nd-order proximity for heterogeneous graphs. By considering heterogeneous graphs as the set of bipartite graphs, they could independently construct the 1st-order and 2nd-order proximity for each homogeneous graph. Specifically, a bipartite graph G could be defined as G = ( V A V B , E ) where V A and V B are the set of nodes with different types. The probability of a node v i in the set V A given by a node v j in the set V B could be defined as follows:
p ( v i , v j ) = exp ( Z i · Z j ) v k V A exp ( Z k · Z j ) .
The PTE model decomposes heterogeneous graphs into k homogeneous graphs, and the loss function is the sum of the component loss functions, which could be formulated as:
L ( V ) = v i ( t ) , v j ( t ) E ( t ) w i j log p v i ( t ) | v j ( t ) ,
where K is the number of bipartite graphs extracted from the heterogeneous graphs. Similar to the PTE model, Tao-yang et al. [180] proposed the Hin2Vec model to capture the 2nd-order proximity in heterogeneous graphs. However, instead of treating heterogeneous graphs as sets of bipartite graphs, the Hin2Vec model captured the relationship between two nodes within 2-hop distance. For instance, in the DBLP network, the relationship set is R = { P P , P A , A P , P P P , P P A , P A P , A P P , A P A } where P is the paper node type and A is the author node type. Zhipeng and Nikos [185] presented the HINE model (Heterogeneous Information Network Embedding) to preserve the truncated proximity of nodes. They defined an empirical joint probability of two entities in a graph as:
p ^ ( v i , v j ) = s ( v i , v j ) v k V s ( v i , v k )
where v i and v j are nodes, and s ( v i , v j ) depicts the proximity between v i and v j in G. The proximity score s ( v i , v j ) could be measured by counting the number of instances of the meta-path containing two nodes or a probability gained from implementing a random-walk sampling from v i to v j .
Graphs in the real world, however, could contain attributes where several existing models, such as LINE and APP, fail to capture this information. Several models have been proposed to learn structural similarity in attributed graphs [193,195]. Sun et al. [193] proposed a CENE (content-enhanced network embedding) model to learn structural graphs and side information jointly. The objective of the CENE model is to preserve the similarity between node pairs and node-content pairs. Zhang et al. [194] proposed the HSCA (Homophily, Structure, and Content Augmented network) model to learn the homophily property of node sequences. To gain the node sequences, HSCA uses the DeepWalk model to capture the short random-walk sampling, which could represent the node context. The model then learns node embeddings based on matrix factorization by decomposing the probability transition matrix.
Most models mentioned above mainly consider the edge’s existence and ignore the dissimilarities between edges. Beyond preserving the topology and proximity of the aforementioned nodes, there are a variety of studies on edge reconstruction. The main idea of edge initialization-based models is that the edge weights can be transformed as transition probability. Wu et al. [189] introduced the ProbWalk model to learn weighted edges based on random-walk paths for edges and the skip-gram model to learn edge embeddings. The advantage of random walk on weighted edges is that this could help the model to generate more accurate node sequences and capture more useful structural information. To calculate the probability of weighted edges in graphs, they introduced a joint distribution:
p ( v 1 , v 2 v k | v i ) = v j C e Z j · Z i m = 1 n e Z m · Z i
where v i is the target node, C = { v 1 , v 2 , , v k } is the context of node v i , and Z i is vector embedding of node v i .
Alternatively, several tasks need to preserve the proximity between different relationship types of nodes. Qi et al. [190,191] proposed the NEWEE model to learn edge embeddings and then adopted a biased random-walk sampling to capture the graph structure. To learn edge embeddings, they first look for a self-centered network of each graph node. In this situation, the model could explore the similarity between edges in the self-centered network since their score tends to be higher than those in the different self-centered networks. Given a node v i in G, the self-centered network is a set of nodes containing v i and its neighbors. For example, Figure 12 depicts two self-centered networks C 1 and C 2 of node v 1 . The objective of the model is to make all edges embedded in the same self-centered network should be close in the vector space. Therefore, given a self-centered network G = ( V , E ) , the objective function aims to maximize the proximity between edges in the same network, which could be defined as:
L E = v i V e i j E e i k E log σ e i j Z i + log 1 σ e i j Z k
where e i j denotes the edge between node v i and v j in a self-centered network G , e i k denotes a negative edge that v i and v k coming from different self-centered network.
In summary, compared with structure-preservation models, the proximity construction models bring several advantages:
  • Inter-graph proximity: Proximity-based models not only explore proximity between nodes in a single graph but can also are applied for proximity reconstruction across different graphs with common nodes [183]. These methods can preserve the structural similarity of nodes in other graphs which are entirely different from other models. In the case of models based on structure-preservation strategies, these must re-learn node embeddings in other graphs.
  • Proximity of nodes belonging to different clusters: In the context of clusters with different densities and sizes, proximity reconstruction-based models could capture nodes that are close to each other but in different clusters. This feature shows an advantage over structure reconstruction-based models, which tend to favor searching for neighboring nodes in the same cluster.
  • Link prediction and node classification problem: Since structural identity is based on proximity between nodes, two nodes with similar neighborhoods should be close in the vector space. For instance, the LINE model considered preserving the 1st-order and 2nd-order proximity between two nodes. As a result, proximity reconstruction provides remarkable results for link prediction and node classification tasks [16,76,77].
However, besides the advantages of these models, there are also a few disadvantages of the proximity-based models:
  • Weighted edges problems: Most proximity-based models do not consider the weighted edges between nodes. These models consider proximity based only on the number of connections shared without weights which could lead to structural loss.
  • Capturing the whole graph structure: Proximity-based models mostly focus on 1st-order and 2nd-order proximity which cannot specify the global structure of graphs. A few models try to capture the higher-order proximity of nodes in graphs, but there is a problem with the computational complexity.
To overcome these limitations, shallow models should be replaced by models based on deep neural networks. Deep neural network-based models can better generalize and capture more of graph entity relationships and graph structure.

3.4. Deep Neural Network-Based Models

In recent years, large-scale graphs have challenged the ability of numerous graph embedding models. Traditional models, such as shallow neural networks or statistical methods, cannot efficiently capture complex graph structures due to their simple architecture. Recently, there have been various studies on deep graph neural networks, which are exploding rapidly because of their ability to work with complex and large graphs [11,14,23,196]. Based on the model architecture, we separate deep graph neural networks into four main groups: graph autoencoders, recurrent GNNs, convolutional GNNs, and graph transformer models. This section provides a detailed picture of deep neural network-based methods.
Unlike earlier models, most deep neural network-based models adopt the graph structure (represented as A) and node attributes/features (represented as X) to learn node embeddings. For instance, users in the social network could have text data, such as profile information. For nodes with missing attribute information, the attributes/features could be represented as node degree or one-hot vectors [72].

3.4.1. Graph Autoencoders

Graph autoencoder models are unsupervised learning algorithms that aim to encode graph entities into the latent space and reconstruct these entities from the encoded information. Based on the encoder and decoder architecture, we can classify graph autoencoder models into multilayer perceptron-based models and recurrent graph neural networks.
Early-stage graph autoencoder models are primarily based on multilayer perceptron (MLP) to learn embeddings [50,51,196]. Table 8 lists a summary of fully connected graph autoencoder models. Daixin et al. [50] introduced the SDNE model (Structural Deep Network Embedding) to capture the graph structure based on autoencoder architecture. Similar to the LINE model, the SDNE model aimed to preserve the 1st-order and 2nd-order proximity between two nodes in graphs, but it used the autoencoder-based architecture. Figure 13 presents the general architecture of the SDNE model with the corresponding encoder and decoder layers. The joint loss function that combines two loss functions for 1st-order proximity and 2nd-order proximity can be formulated as:
L Z , X = i , j = 1 n ( X ^ X ) B F 2 + λ i , j = 1 n s i j Z i Z j 2 2 + L 2
where s i j denotes the proximity between two nodes v i and v j . However, the SDNE model has been proposed to learn node embeddings in homogeneous graphs. Extension of the SDNE model to heterogeneous graphs was suggested by several graph autoencoder models [51,196]. Ke et al. [51] presented the DHNE model (Deep Hyper-Network Embedding) to preserve neighborhood structures, ensuring that the nodes with similar neighborhood structures will have similar embeddings. The autoencoder layer adopts an adjacency matrix A of a hypergraph as an input, which can be formulated as:
A = H H D v
where D v is the diagonal matrix of node degree, and H is a matrix of size | V | × | E | presents the relation between nodes and hyperedges in graphs. The autoencoder includes two main layers: an encoder layer and a decoder layer. The encoder part takes the adjacency matrix as input and compresses it to generate node embeddings, and then the decoder part tries to reconstruct the input. Formally, the output of the encoder and decoder layer of node v i could be defined as follows:
Z i = σ ( W A i + b ) A ^ i = σ ( W ^ Z i + b ^ ) .
One of the limitations of SDNE and DHNE models is that these models cannot handle signed graphs. Shen and Chung [197] proposed the DNE-SBP model (Deep Network Embedding with Structural Balance Preservation) to preserve the proximity of nodes in signed graphs. The DNE-SBP model constructed the input and output of the autoencoder which could be defined as:
H ( l ) = σ X ( l ) W 1 ( l ) T + B 1 ( l ) X ^ ( l ) = σ H ^ ( l ) ( W 2 ( l ) ) T + B 2 ( l )
where X ( 1 ) = A , X ( l ) = H ( l 1 ) , and σ is an activation function. The joint loss function is then composed of reconstruction errors with ML and CL pairwise constraints [200].
For dynamic graphs, graph autoencoder models take snapshots of graphs as inputs, and the model tries to rebuild snapshots. In several models, the output can predict future graphs by reconstructing coming snapshot graphs. Inspired by the SDNE model for static graphs, Palash et al. [198] presented the DynGEM model for dynamic graph embedding. Figure 14 presents the overview architecture of the DynGEM model. Given a sequence of graph snapshots G = { G 1 , G 2 , , G T } and a sequence of a mapping function ϕ = { ϕ 1 , ϕ 2 , , ϕ T } , the DynGEM model aims to generate an embedding Z t + 1 = ϕ t + 1 ( G t + 1 ) . The stability of embeddings is the ratio of the difference between embeddings over the difference between adjacency matrices over time which could be defined as:
A a b s ( ϕ ; t ) = Z t + 1 ( V t ) Z t ( V t ) A t + 1 ( V t ) A t ( V t )
where A t is the weighted adjacency matrix of graph G t , Z t ( V t ) presents embeddings of all nodes V t at time t. The model learns parameter θ for each graph snapshot G t at time t. Similar to the SDNE model, the loss function of the DynGEM model could be defined as:
L ( Z , X ) = ( A ^ A ) B F 2 + λ i , j = 1 n s i j Z i Z j 2 2 + L 1 + L 2
where L 1 and L 2 are regularization terms to prevent the over-fitting, and s i j is the similarity between v i and v j . Similar to SDNE, Palash et al. [201] used autoencoder architecture and adopted the adjacency matrix of graph snapshots as input of the encoder layer. However, they updated parameters θ t at time t based on parameter θ t 1 from the previous graph G t 1 .
Unlike the aforementioned models, Wenchao et al. [199] presented the NetWalk model that composes initial embeddings first and then updates the embeddings by learning paths in graphs, which are sampled by a reservoir sampling strategy. NetWalk model sampled the graph structure using a random-walk strategy as input to the autoencoder model. If there are any changes in dynamic graphs, the Netwalk model first updates the list of neighbors for each node and corresponding edges and then only learns embeddings again for the changes.
The aforementioned autoencoder models, which are based on feedforward neural networks, only focus on preserving pairs of nodes in graphs. Several models focus on integrating recurrent neural networks and LSTM into the autoencoder architecture, bringing prominent results, which we cover in the following section.

3.4.2. Recurrent Graph Neural Networks

One of the first models applying deep neural networks to graph representation learning was based on graph neural networks (GNNs). The main idea of GNNs is that it considers messages shared between target nodes and their neighbors until a steady balance is acquired. Table 9 summarizes graph recurrent autoencoder models.
Scarselli et al. [44,45] proposed a GNN model which could learn embeddings directly for different graphs, such as acyclic/cyclic and directed/undirected graphs. These models assumed that if nodes are directly connected in graphs, the distance between them should be minimized in the latent space. The GNN models used a data diffusion mechanism to aggregate signals from neighbor nodes (units) to target nodes. Therefore, the state of a node describes the context of its neighbors and can be used to learn embeddings. Mathematically, given a node v i in a graph, the state of v i and its output can be defined as:
H i = v j N ( v i ) f w ( y i , e i j , H j , y j ) , Z i = g w ( H i , y i ) ,
where f w ( · · · · ) and g w ( · · ) are transition functions, and y i , e i j denote the label of node v i , edge ( v i , v j ) , respectively. By considering the state H i that is revised by the shift process, H i and its output at layer l could be defined as:
H i ( l ) = f w y i , e i j , H j ( l 1 ) , y j , Z i ( l ) = g w H i ( l ) , y i .
However, one of the limitations of GNNs is that the model learns node embeddings as single output, which could cause problems with sequence output. Several studies tried to improve GNNs using recurrent graph neural networks [17,48,49]. Unlike the GNNs which could represent a single output for each entity in a graph, Li et al. [17] attempted to output sequences by applying gated recurrent units. The model used two gated graph neural networks F x ( l ) and F o ( l ) to predict the output O l and the following hidden states. Therefore, the output of node v i at layer l + 1 could be computed as:
H i ( l + 1 ) = σ H i ( l ) , v j N ( v i ) W H j ( l ) ,
where N ( v i ) denotes the set of neighbors of node v i .
Wang et al. [49] proposed Topo-LSTM model to capture the diffusion structure by representing graphs as a diffusion cascade to capture active and inactive nodes in graphs. Given by a cascade sequence s = { ( v 1 , 1 ) ( v T , T ) } , the hidden state can be represented as follows:
h t ( p ) = ϕ ( h v | v P v ) ,
h t ( q ) = ϕ ( h v | v Q v \ P v ) ,
where p and q denote the input aggregation for active nodes connected with v t and not connected with the node v t , respectively, P v depicts the precedent sets of active nodes at time t, and Q v depicts the set of activated nodes before time t. Figure 15 presents an example of the Topo-LSTM model. However, these models could not capture global graph structure since they only capture the graph structure within k-hop distance. Several models have been proposed by combining graph recurrent neural network architecture with random-walk sampling structure to capture higher structural information [48,93]. Huang et al. [93] introduced the GraphRNA model to combine a joint random-walk strategy on attributed graphs with recurrent graph networks. One of the powers of the random-walk sampling strategy is to capture the global structure. By considering the node attributes as a bipartite network, the model could perform joint random walks on the bipartite matrix containing attributes to capture the global structure of graphs. After sampling the node attributes and graph structure through joint random walks, the model uses graph recurrent neural networks to learn embeddings. Similar to GraphRNA model, Zhang et al. [48] presented the SHNE model to analyze the attributes’ semantics and global structure in attributed graphs. The SHNE model also used a random-walk strategy to capture the global structure of graphs. However, the main difference between SHNE and GraphRNA is that the SHNE model first applied GRU (gated recurrent units) model to learn the attributes and then combined them with graph structure via random-walk sampling.
Since the power of autoencoders architecture is to learn compressed representations, several studies [57,205] aimed to combine RGNNs and autoencoders with learning node embeddings in weighted graphs. For instance, Seo and Lee [57] adopted an LSTM autoencoder to learn node embeddings for weighted graphs. They used the BFS algorithm to travel nodes in graphs and extract the node-weight sequences of graphs as inputs for the LSTM autoencoder. The model then could leverage the graph structure reconstruction based on autoencoder architecture and the node attributes by the LSTM model. Figure 16 presents the sampling strategy of this model, which lists the nodes and their respective weighted edges. To capture local and global graph structure, Aynaz et al. [205] proposed a sequence-to-sequence autoencoder model, which could represent inputs with arbitrary lengths. The LSTM-based autoencoder model architecture consists of two main parts: The encoder layer LSTM e n c and the decoder layer LSTM d e c . For the sequence-to-sequence autoencoder, at each time step l, the hidden vectors in the encoder and decoder layers can be defined as:
h e n c t = LSTM e n c Z i ( t ) , h e n c t 1 , h d e c t = LSTM d e c Z i ( t 1 ) , h d e c t 1
where h e n c t and h d e c t are the hidden states at step t in the encoder and decoder layers, respectively. To generate the sequences of nodes, the model implemented different sampling strategies, including random walks, shortest paths, and breadth-first search with the WL algorithm to encode the information of node labels.
Since the aforementioned models learn node embeddings for static graphs, Shima et al. [203] presented an LSTM-Node2Vec model by combining an LSTM-based autoencoder architecture with the Node2Vec model with learning embeddings for dynamic graphs. The idea of the LSTM-Node2Vec model is that it uses an LSTM autoencoder to preserve the history of node evolution with a temporal random-walk sampling. It then adopted the Node2Vec model to generate the vector embeddings for the new graphs. Figure 17 presents a temporal random-walk sampling strategy to travel a dynamic graph.
Jinyin et al. [204] presented the E-LSTM-D model (Encoder-LSTM-Decoder) to learn embeddings for dynamic graphs by combining autoencoder architecture and LSTM layers. Given by a set of graph snapshots S = { G t k , G t k + 1 , , G t 1 } , the objective of the model is to learn a mapping function ϕ : ϕ ( S ) G t . The model takes the adjacency matrix as the input of the autoencoder model, and the output of the encoder layer could be defined as:
H e , i ( 1 ) = ReLU W e ( 1 ) s i + b i ( 1 )
H e , i ( l ) = ReLU W e ( l ) b e , i ( l 1 ) + b e ( l )
H e ( l ) = H e , 0 ( l ) , H e , 1 ( l ) , , H e , N 1 ( l )
where s i denotes the i-th graph in the series of graph snapshots, ReLU ( · ) = m a x ( 0 , · ) is the activation function. For the decoder layer, the model tried to reconstruct the original adjacency matrix from vector embeddings, which could be defined as follows:
H d ( 1 ) = ReLU W d ( 1 ) H e + b d ( 1 )
H d ( l ) = ReLU W d ( l ) H d ( l 1 ) + b d ( l )
where H e depicts the output of the stacked LSTM model, which captures the current graph’s structure G t . Similar to E-LSTM-D model, Palash et al. [201] proposed a variant of Dyngraph2Vec model, named Dyngraph2VecAERNN (Dynamic Graph to Vector Autoencoder Recurrent Neural Network) which also considers the adjacency matrix as input for the model. However, the critical difference between the E-LSTM-D model and the Dyngraph2VecAERNN model is that they feed the LSTM layers directly into the encoder part to learn embeddings. The decoder layer is composed of fully connected neural network layers to reconstruct the inputs.
There are several advantages of recurrent graph neural networks compared to shallow learning techniques:
  • Diffusion pattern and multiple relations: RGNNs show superior learning ability when dealing with diffuse information, and they can handle multi-relational graphs where a single node has many relations. This feature is achieved due to the ability to update the states of each node in each hidden layer.
  • Parameter sharing: RGNNs could share parameters across different locations, which could be able to capture the sequence node inputs. This advantage could reduce computational complexity during the training process with fewer parameters and increase the performance of the models.
However, one of the disadvantages of the RGNNs is that these models use recurrent layers with the same weights during the weight update process. This leads to inefficiencies in representing different relationship constraints between neighbor and target nodes. To overcome the limitation of RGNNs, convolutional GNNs have shown remarkable ability in recent years when it uses different weights in each hidden layer.

3.4.3. Convolutional Graph Neural Networks

CNNs have achieved remarkable success in the image processing area. Since image data can be considered to be a special case of graph data, convolution operators can be defined and applied to graph mining. There are two strategies to implement when applying convolution operators to the graph domain. The first strategy is based on graph spectrum theory which transforms graph entities from the spatial domain to the spectral domain and applies convolution filters on the spectral domain. The other strategy directly employs the convolution operators in the graph domain (spatial domain). Table 10 summarizes spectral CGNN models.
When computing power is insufficient for implementing convolution operators directly on the graph domain, several studies focus on transforming graph data to the spectral domain and applying filtering operators to reduce computational time [18,55,213]. The signal filtering process acts as the feature extraction on the Laplacian matrix. Most models adopted single and undirected graphs and presented graph data as a Laplacian matrix:
L = I n D 1 2 A D 1 2
where D denotes the diagonal matrix of the node degree, A is the adjacency matrix. The matrix L is a symmetric positive definite matrix describing the graph structure. Considering a matrix U as a graph Fourier basis, the Laplacian matrix then could be decomposed into three components: L = U Λ U where Λ is the diagonal matrix which denotes the spectral representation of graph topology and U = [ u 0 , u 1 , , u n 1 ] is eigenvectors matrix. The filter function g θ resembles a k-order polynomial, and the spectral convolution acts as diffusion convolution in graph domains. The spectral graph convolution given by an input x with a filter g θ is defined as:
g θ x = U g θ U x
where ∗ is the convolution operation. Bruna et al. [56] transformed the graph data to the spectral domain and applied filter operators on a Fourier basis. The hidden state at the layer l could be defined as:
H i ( l ) = σ V j = 1 c l 1 D i j ( l ) V H j ( l )
where D i j ( l ) is a diagonal matrix at layer l, c l 1 denotes the number of filters at layer l 1 , and V denotes the eigenvectors of the L matrix. Typically, most of the energy of the D matrix is concentrated in the first d elements. Therefore, we can obtain the first d values of the matrix V, and the number of parameters that should be trained is c l 1 · c l · d .
Several studies focused on improving spectral filters to reduce computational time and capture more graph structure in the spectral domain [210,216]. For instance, Defferrard et al. [216] presented a strategy to re-design convolutional filters for graphs. Since the spectral filter g θ ( Λ ) indeed generates a kernel on graphs, the key idea is that they consider g θ ( Λ ) as a polynomial which includes k-localized kernel:
g θ ( Λ ) = k = 0 K 1 θ k Λ ( k )
where θ is a vector of polynomial coefficients. This k-localized kernel provides a circular distribution of weights in the kernel from a target node to k-hop nodes in graphs.
Unlike the above models, Zhuang and Ma [211] tried to capture the local and global graph structures by introducing two convolutional filters. The first convolutional operator, local consistency convolution, captures the local graph structure. The output of a hidden layer Z l , then, could be defined as:
Z ( l ) = σ ( D ˜ 1 2 A ˜ D ˜ 1 2 Z ( l 1 ) W ( l ) )
where A ˜ = A + I denotes the self-loops adjacency matrix, and D ˜ i . i = j A ˜ i j is the diagonal matrix presenting the degree information of nodes. In addition to the first filter, the second filter aims to capture the global structure of graphs which could be defined as:
Z ( l ) = σ ( D 1 2 A P D 1 2 Z ( l 1 ) W ( l ) )
where P denotes the PPMI matrix, which can be calculated via frequency matrix using random-walk sampling.
Most of the above models learn node embeddings by transforming graph data to signal domain and use convolutional filters which lead to increased computational complexity. In 2016, Kipf and Welling [18] introduced graph convolutional networks (GCNs), which were considered to be a bridge between spectral and spatial approaches. The spectral filter g θ ( Λ ) and the hidden layers of the GCN model followed the layer-wise propagation rule can be defined as follows:
g θ ( Λ ) k = 0 K θ k T k ( Λ ˜ )
H ( l + 1 ) = σ D ˜ 1 2 A D ˜ 1 2 H ( l ) W ( l )
where Λ ˜ = 2 λ max Λ I N and λ max is the largest eigenvalue of Laplacian matrix L, θ R K is Chebyshev coefficients vector, T k ( x ) is Chebyshev polynomials could be defined as:
T k ( x ) = 2 x T k 1 ( x ) T k 2 ( x )
where T 0 ( x ) = 1 and T 1 ( x ) = x . Consequently, the convolution filter of an input x is defined as:
g θ x k = 0 K θ k T k L ˜ x , L ˜ = 2 λ max L I N .
Although spectral CGNNs are effective in applying convolution filters on the spectral domain, they have several limitations as follows:
  • Computational complexity: The spectral decomposition of the Laplacian matrix into matrices containing eigenvectors is time-consuming. During the training process, the dot product of the U, Λ , and U T matrices also increase the training time.
  • Difficulties for handling large-scale graphs: Since the number of parameters for the kernels also corresponds to the number of nodes in graphs. Therefore, spectral models could not be suitable for large-scale graphs.
  • Difficulties for considering graph dynamicity: To apply convolution filters to graphs and train the model, the graph data must be transformed to the spectral domain in the form of a Laplacian matrix. Therefore, when the graph data changes, in the case of dynamic graphs, the model is not applicable to capture changes in dynamic graphs.
Motivated by the limitations of spectral domain-based CGNNs, spatial models apply convolution operators directly to the graph domain and learn node embeddings in an effective way. Recently, various spatial CGNNs have been proposed showing remarkable results in handling different graph structures compared to spectral models [52,95]. Based on the mechanism of aggregation from graphs and how to apply the convolution operators, we divide CGNN models into the following main groups: (i) Aggregation mechanism improvement, (ii) Training efficiency improvement, (iii) Attention-based models, and (iv) Autoencoder-CGNN models. Table 11 and Table 12 present a summary of spatial CGNN models for all types of graphs ranging from homogeneous to heterogeneous graphs.
Gilmer et al. [222] presented the MPNN (Message-Passing Neural Network) model to employ the concept of messages passing over nodes in graphs. Given a pair of nodes ( v i , v j ) , a message from v j to v i could be calculated by a message function M i j . During the message-passing phase, a hidden state at layer l of a node v i could be calculated based on the message-passing from its neighbors, which could be defined as:
m i ( l + 1 ) = v j N ( v i ) M ( l ) h i ( l ) , h j ( l ) , e i j ,
h i ( l + 1 ) = σ h i ( l ) , m i ( l + 1 ) ,
where M ( l ) denotes the message function at layer l which could be a MLP function, σ is an activation function, and N ( v i ) denotes the set of neighbors of node v i .
Most previous graph embedding models work in transductive learning which cannot handle unseen nodes. In 2017, Hamilton et al. [22] introduced the GraphSAGE model (SAmple and aggreGatE) to generate inductive node embeddings in an unsupervised manner. The hidden state at layer l + 1 of a node v i could be defined as:
h i ( l + 1 ) = AGG ( l + 1 ) { h j ( l ) , v j N ( v i ) }
where N ( v i ) denotes the set of neighbors of node v i , h j ( l ) is the hidden state of node v j at layer l. The function AGG ( · · ) is a differentiable aggregator function. There are three aggregators (e.g., Mean, LSTM, and Pooling) to aggregate information from neighboring nodes and separate nodes into mini batches. Algorithm 1 presents the algorithm of the GraphSAGE model.
Algorithm 1: GraphSAGE algorithm. The model first takes the node features as inputs. For each layer, the model aggregates the information from neighbors and then updates the hidden state of each node v i .
Input: G = ( V , E ) : The graph G with set of nodes V and set of edges E.
             x i : The input features of node v i
            L: The depth of hidden layers, l { 1 L }
             AGG k : Differentiable aggregator functions
             N ( v i ) : The set of neighbors of node v i .
Output: Z i : Vector representations for v i .
h i 0 x i , v i V
Sensors 23 04168 i001
Z i h i L , v i V
Lo et al. [231] aimed to apply the GraphSAGE model to detect computer attackers in computer network systems, named E-graphSAGE. The main difference between the two models is that E-graphSAGE used the edges of graphs as aggregation information for learning embeddings. The edge information between two nodes is the data flow between two source IP addresses (Clients) and destination IP addresses (Servers).
By evaluating the contribution of neighboring nodes to target nodes, Tran et al. [229] proposed convolutional filters with different parameters. The key idea of this model is to rank the contributions of different distances from the set of neighbor nodes to target nodes using short path sampling. Formally, the hidden state of a node at layer l + 1 could be defined as multiple graph convolutional filters:
h r , l + 1 = j = 0 r ( D j ) 1 S P j h l W j , l
where ‖ denotes the concatenation, r and S P j denote the r-hop distance and the shortest-path distance j, respectively. Ying et al. [225] considered random-walk sampling as the aggregation information that can be aggregated to the hidden state of CGNNs. To collect the neighbors of node v, the idea of the model is to gather a set consisting of random-walk paths from node v and then select the top k nodes with the highest probability.
For hypergraphs, several GNN models have been proposed to learn high-order graph structure [27,44,234]. Feng et al. [27] proposed HGNN (Hypergraph Neural Networks) model to learn hypergraph structure based on spectral convolution. They first learn each hyperedge feature by aggregating all the nodes connected by the hyperedge. Then, each node’s attribute is updated with a vector embedding based on all the hyperedges connecting to the nodes. By contrast, Yadati [234] presented the HyperGCN model to learn hypergraphs based on spectral theory. Since each hyperedge could connect several nodes between them, this model’s idea is to filter far apart nodes. Therefore, they adopt the Laplacian operator first to learn node embedding and filter edges, which connect two nodes at a high distance. The GCNs could then be used to learn node embeddings.
One of the limitations of GNN models is that the models consider the set of neighbors as permutation invariant. This limitation then makes the models cannot distinguish between isomorphic subgraphs. By considering the message-passing set from neighbors of nodes as permutation invariant, several works aimed to improve the message-passing mechanism by simple aggregation functions. Xu et al. [24] proposed GIN (Graph Isomorphism Network) model, which aims to learn vector embeddings as powerful as the 1-dimensional WL isomorphism test. Formally, the hidden state of node v i at layer l could be defined as:
h i ( l ) = MLP ( l ) 1 + ε ( l ) · h i ( l 1 ) + v j N ( v i ) h j ( l 1 )
where MLP denotes multilayer perceptions and ε is a parameter that could be learnable or fixed scalar. Another problem of GNNs is the over-smoothing problem when stacking more layers in the models. DeeperGCN [98] was a similar approach that aims to solve the over-smoothing problem by generalized aggregations and skip connections. The DeeperGCN model defined a simple normalized message-passing, which could be defined as:
m i j ( l ) = ReLU h i ( l ) + 𝟙 h e i j ( l ) · h e i j ( l ) + ε
h i ( l + 1 ) = MLP h i ( l ) + s · h i ( l ) 2 · m i ( l ) m i ( l ) 2
where m i j denotes the message-passing from node v j to node v i , h e i j is the edge feature of the edge e i j , 𝟙 ( · ) presents an indicator procedure which is being 1 if two nodes v i and v j are connected. Le et al. [233] presented the PHC-GNN model, which improves the message-passing compared to the GIN model. The main difference between PHC-GNN and GIN models is that the PHC-GNN model added edge embeddings and a residual connection after the message-passing. Formally, the message-passing and hidden state of a node v i at layer l + 1 could be defined as:
m i ( l + 1 ) = v j N ( v i ) α i j h i ( l ) + h e i j ( l ) ,
h ˜ i ( l + 1 ) = MLP ( l + 1 ) h i ( l ) + m i ( l + 1 ) ,
h i ( l + 1 ) = h i ( a ) + h ˜ i ( l + 1 ) .
A few studies focused on building pre-trained GNN models, which could be used to initialize other tasks [209,246,247]. These pre-trained models are also beneficial to handle the little availability of node labels. For example, the main objective of the GPT-GNN model [247] is to reconstruct the graph structure and the node features by masking the attributes and edges. Given a permutated order, the model maximizes the node attributes based on observed edges and then generates the remaining edges. Formally, the conditional probability could be defined as:
p X i , E i | X < i , E < i = m p X i , E i , ¬ m | E i , m X < i , E < i · p E i , m | X < i , E < i
where E i , m and E i , ¬ m depict the observed and masked edges, respectively.
Since learning node embeddings in the whole graphs is time-consuming, several approaches aim to apply standard cluster algorithms (e.g., METIS, K-means, etc.) to cluster nodes into different subgraphs, then use GCNs to learn node embeddings. Chiang et al. [95] proposed a Cluster-GCN model to increase the computational efficiency during the training of the CGNNs. Given a graph G, the model first separates G into c clusters G = { G 1 , G 2 , , G c } where G i = { V i , E i } using Metis clustering algorithm [248]. The model then aggregates information within each cluster. GraphSAINT model [53] had a similar structure to Cluster-GCN and [249] model. GraphSAINT model aggregated neighbor information and samples nodes directly on a subgraph at each hidden layer. The probability of keeping a connection from a node u at layer l to a node v in layer l + 1 could be based on the node degree. Figure 18 presents an example of aggregation strategy for the GraphSAINT model. By contrast, Jiang et al. [54] presented a hi-GCN model (hierarchical GCN) that could effectively model the brain network with two-level GCNs. Since individual brain networks have multiple functions, the first level GCN aims to capture the graph structure. The objective of the 2nd GCN level is to provide the correlation between network structure and contextual information to improve the semantic information. The work from Huang et al. [250] was similar to GraphSAGE and FastGCN models. However, instead of using node-wise sampling at each hidden layer, the model provided two strategies: a layer-wise sampling strategy and a skip-connection strategy that could directly share the aggregation information between hidden layers and improve message-passing. The main idea of the skip-connection strategy is to reuse the information from previous layers that could usually be forgotten in dense graphs.
One of the limitations of the CGNNs is that at the hidden layer, the model updates the state of all neighboring nodes. This can lead to slow training and updating because of inactive nodes. Some models aimed to enhance CGNNs by improving the sampling strategy [52,223,224]. For example, Chen et al. [52] presented a FastGCN model to improve the training time and the model performance compared to CGNNs. One of the problems with existing GNN models is scalability which expands the neighborhood and increases computational complexity. The model could learn neighborhood sampling at each convolution layer which mainly focuses on essential neighbor nodes. Therefore, the model could learn the essential neighbor nodes for every batch.
By considering each hidden layer as an embedding layer of independent nodes, FastGCN aims to subsample the receptive area at each hidden layer. For each layer, they chose t k i.i.d. nodes u 1 ( l ) , u 2 ( l ) , , u k ( l ) and compute the hidden state which could be defined as:
h ˜ k + 1 ( l + 1 ) ( v ) = 1 k j = 1 k A ˜ ( v , u j ( l ) ) h k ( l ) u j ( l ) W ( l )
h k + 1 ( l + 1 ) ( v ) = σ ( h ˜ k + 1 ( l + 1 ) ( v ) )
where A ˜ ( v , u j ( l ) ) denotes the kernel, and σ denotes the activation function. Wu et al. [214] introduced SGC (Simple Graph Convolution) model, which could improve 1st-order proximity in the GCN model. The model removed nonlinear activation functions at each hidden layer. Instead, they used a final SoftMax function at the last layer to acquire probabilistic outputs. Chen et al. [224] presented a model to improve the updating of the nodes’ state. Instead of collecting all the information from the neighbors of each node, the model proposed an option to keep track of the activation history states of the nodes to reduce the receptive scope. The model aimed to maintain the history state h ¯ v ( l ) for each state h v ( l ) of each node v.
Similar to [250], Chen et al. [28] presented a GCNII model using an initial residual connection and identity mapping to overcome the over-smoothing problem. The GCNII model aimed to maintain the structural identity of target nodes to overcome the over-smoothing problem. They introduced an initial residual connection H 0 at the first convolution layer and identity mapping I n . Mathematically, the hidden state at layer l + 1 could be defined as:
H ( l + 1 ) = σ ( 1 a l ) P ˜ · H ( l ) + a l H ( 0 ) ( 1 b l ) I n + b l W ( l )
where P ˜ = D ˜ 1 2 A ˜ D ˜ 1 2 denotes the convolutional filter with normalization. Adding two parameters H ( 0 ) and I n is for the purpose of tackling the over-smoothing problem.
Several models aim to maximize the node representation and graph structure by matching a prior distribution. There have been a few studies based on the idea of Deep Infomax [227] from image processing to learn graph embeddings [26,242]. For example, Velickovic et al. [26] introduced the Deep Graph Infomax (DGI) model, which could adopt the GCN as an encoder. The main idea of mutual information is that the model trains the GCN encoder to maximize the understanding of local and global graph structure in actual graphs and minimize that in fake graphs. There are four components in the DGI model, including:
  • A corruption function C : This function aims to generate negative examples from an original graph with several changes in structure and properties.
  • An encoder ϕ : R N × M × R N × N R N × D . The goal of function ϕ is to encode nodes into vector space so that ϕ ( X , A ) = H = { h 1 , h 2 , h N } presents vector embeddings of all nodes in graphs.
  • Readout function R : R N × D R D . This function maps all embedding nodes into a single vector (supernode).
  • A discriminator D : R M × R M R compares vector embeddings against the global vector of the graph by calculating a score between 0 and 1 for each vector embedding.
One of the limitations of the DGI model is that it only works with attributed graphs. Several studies have improved DGI to work with heterogeneous graphs with attention and semantic mechanisms [242,243]. Similar to the DGI model, Park et al. [243] presented the DMGI model (Deep Multiplex Graph Infomax) for attributed multiplex graphs. Given a specific node with relation type r, the hidden state could be defined as:
H ( r ) = σ D ^ r 1 2 A ^ ( r ) D ^ r 1 2 XW r
where A ^ ( r ) = A ( r ) + α I n , and D ^ i i = j A ^ i j , W r R n × d is trainable weights, and σ is the activation function. Similar to the DGI model, the readout function and discriminator can be employed as:
S ( r ) = Readout ( H ( r ) ) = σ 1 N i = 1 N h i ( r )
D h i ( r ) , S ( r ) = σ h i ( r ) T M ( r ) s ( r )
where h i ( r ) is the i-th vector of matrix H ( r ) , M r denotes a trainable scoring matrix, S r is a function with S r = σ 1 N i = 1 N h i r . The attention mechanism is adopted from [251], which could capture the importance of node type to generate the vector embeddings at the last layer. Similarly, Jing et al. [242] proposed HDMI (High-order Deep Multiplex Infomax) model, which is conceptually similar to the DGI model. The HDMI model could optimize the high-order mutual information to process different relation types.
Increasing the number of hidden layers to aggregate more structural information of graphs can lead to an over-smoothing problem [97,252]. Previous models have considered the weights of messages to be the same role in aggregating information from neighbors of nodes. In recent years, various studies have focused on attention mechanisms to extract valuable information from neighborhoods of nodes [19,253,254]. Table 13 presents a summary of attentive GNN models.
Velickovi et al. [19] presented the GATs (graph attention networks) model, one of the first models in applying attention mechanism to graph representation learning. The purpose of the attention mechanism is to compute a weighted message for each neighbor node during the message-passing of GNNs. Formally, there are three steps for GATs which can be explained as follows:
  • Attention score: At layer l, the model takes a set of features of a node as inputs h = { h i R d | v i V } and the output h = { h i R d | v i V } . An attention score measures the importance of neighbor nodes v i to the target node v j could be computed as:
    s i j = σ a ( W h i W h j )
    where a R 2 d , and W ( k ) R d × d are trainable weights, denotes the concatenation.
  • Normalization: The score then is normalized comparable across all neighbors of node v i using the SoftMax function:
    α i j = SoftMax ( s i j ) = exp s i j v k N ( v i ) exp s i k .
  • Aggregation: After normalization, the embeddings of node v i could be computed by aggregating states of neighbor nodes which could be computed as:
    h i = σ v j N ( v i ) α i j · W h j .
Furthermore, the GAT model used multi-head attention to enhance the model power and stabilize the learning strategy. Since the GAT model takes the attention coefficient between nodes as inputs and ranks the attention unconditionally, this results in a limited capacity to summarize the global graph structure.
In recent years, various models have been proposed based on the GAT idea. Most of them aimed to improve the ability of the self-attention mechanism to capture more global graph structures [253,254]. Zhang et al. [253] presented GaAN (Gated Attention Networks) model to control the importance of neighbor nodes by controlling the amount of attention score. The main idea of GaAN is to measure the different weights that come to different heads in target nodes. Formally, the gated attention aggregator could be defined as follows:
h i = MLP θ x i M h e a d m = 1 g i ( m ) j N ( v i ) w i j ( m ) MLP θ ( m ) ( h i )
g i = [ g i ( 1 ) , g i ( 2 ) g i ( M h e a d ) ]
where MLP ( · ) denotes a simple linear transformation, and g i ( m ) is the gate value of m-th head of node v i .
To capture a coarser graph structure, Kim and Oh [258] considered attention based on the importance of nodes to each other. The importance of nodes is based on whether the two nodes are directly connected. By defining the different attention from target nodes to context nodes, the model could solve the permutation equivalent and capture more global graph structure. Based on this idea, they proposed the SuperGAT model with two variants, scaled dot product (SD) and mixed GO and DP (MX), to enhance the attention span of the original model. The attention score s i j between two nodes v i and v j can be defined as follows:
s i j , S D ( l + 1 ) = W ( l + 1 ) h i ( l ) × W ( l + 1 ) h j ( l ) d
s i j , M X ( l + 1 ) = a ( l + 1 ) W ( l + 1 ) h i ( l ) | | W ( l + 1 ) h j ( l ) · σ W ( l + 1 ) h i ( l ) × W ( l + 1 ) h j ( l )
where d denotes the number of features at layer l + 1 . The two attention scores can softly decline the number of nodes that are not connected to the target node v i .
Wang et al. [259] aimed to introduce a margin-based constraint to control over-fitting and over-smoothing problems. By assigning the attention weight of each neighbor to target nodes across all nodes in graphs, the proposed model can adjust the influence of the smoothing problem and drop unimportant edges.
Extending the GAT model to capture more global structural information using attention, Haonan et al. [256] introduced the GraphStar model using a virtual node (a virtual start) to maintain global information at each hidden layer. The main difference between the GraphStar and GATs models is that they introduce three different types of relationships: node-to-node (self-attention), node-to-start (global attention), and node-to-neighbors (local attention). Using different types of relationships, GraphStar could solve the over-smoothing problem when staking more neural network layers. Formally, the attention coefficients could be defined as:
h i ( t + 1 ) = | | M h e a d m σ r R j N i r α i j r m W 1 m ( t ) h j t + α i s , r = s m W 2 m ( t ) S t + α i 0 , r = 0 m W 3 m ( t ) h i t
where W 1 m ( t ) , W 2 m ( t ) , and W 3 m ( t ) denotes the node-to-node, node-to-start and node-to-neighbors relations at the m-th head of node v i , respectively.
One of the problems with the GAT model is that the model only provides static attention which mainly focuses the high-weight attention on several neighbor nodes. As a result, GAT cannot learn universal attention for all nodes in graphs. Motivated by the limitations of the GAT model, Brody et al. [58] proposed the GATv2 model using dynamic attention which could learn graph structure more efficiently from a target node v i to neighbor node v j . The attention score can be computed with a slight modification:
s i j = a σ ( W · [ h i | | h j ] ) .
Similar to Wang et al. [259], Zhang et al. [260] presented ADSF (ADaptive Structural Fingerprint) model, which could monitor attention weights from each neighbor of the target node. However, the difference between GraphStar [259] and the ADSF model is that the ADSF model introduced two attention scores s i j and e i j for each node v i which can capture the graph structure and context, respectively.
Besides the GAT-based models applied to homogeneous graphs, several models tried to apply attention mechanism to heterogeneous and knowledge graphs [25,261,262]. For example, Wang et al. [25] presented hierarchical attention to learn the importance of nodes in graphs. One of the advantages of this model is to handle heterogeneous graphs with different types of nodes and edges by deploying local and global level attention. The model proposed two levels of attention: node and semantic-level attention. The node-level attention aims to capture the attention between two nodes in meta-paths. Given a node pair ( v i , v j ) in a meta-path P, the attention score of P could be defined as:
s i j P = Att n o d e ( h i , h j ; P )
where h i and h j denote the original and projected features of node v i and v j via a projection function M ϕ , respectively, and Att n o d e is a function which scores the node-level attention. To make the coefficients across other nodes in a meta-path P which contain a set of neighbors N i P of a target node v i , the attention score α i j P , and node embedding with k multi-head attention can be defined as:
α i j P = exp σ ( s i j T · [ h i | | h j ] ) k N i P exp σ ( s i k T · [ h i | | h k ] )
z i P = | | K k = 1 σ j N i P α i j P h j .
The score z i P indicates how the importance of the set of neighbors based on meta-path P contributes to node v i . Furthermore, the semantic-level aggregation aims to score the importance of meta-paths. Given an attention coefficient z i P , the importance of meta-path P and its normalization could be defined as w P :
w P = 1 | V | i V q · tanh ( W · z i P + b )
w ¯ P = exp w P p = 1 l exp w p .
In addition to applying CGNNs to homogeneous graphs, several studies focused on applying CGNNs for heterogeneous and knowledge graphs [224,241,243,263,264,266]. Since heterogeneous graphs have different types of edges and nodes, the main problem when applying CGNN models is the aggregation of messages based on different edge types. Schlichtkrull et al. [241] introduced the R-GCNs model (Relational Graph Convolutional Networks) to model relational entities in knowledge graphs. R-GCNs is the first model to be applied to learn node embeddings in heterogeneous graphs to several downstream tasks, such as link prediction and node classification. In addition, they also use parameter sharing to learn the node embedding efficiently. Formally, given a node v i under relation r R , the hidden state at layer l + 1 could be defined as:
h i ( l + 1 ) = σ r R j N i r 1 c i , r W r ( l ) h j ( l ) + W 0 ( l ) h i ( l ) ,
where c i , r is the normalization constant, and N i r denotes the set of neighbors of node v i with relation r. Wang et al. [265] introduced HANE (Heterogeneous Attributed Network Embedding) model to learn embeddings for heterogeneous graphs. The key idea of the HANE model is to measure attention scores for different types of nodes in heterogeneous graphs. Formally, given a node v i , the attention coefficients s i j ( l ) , attention score α i j ( l ) , and the hidden state h i ( l + 1 ) at layer l + 1 could be defined as:
z i ( l ) = W i ( l ) x i ( l ) s i j ( l ) = ( z i ( l ) | | z j ( l ) ) α i j ( l ) = exp ( s i j ( l ) ) v k N ( v i ) exp ( s i k ( l ) )
h i ( l + 1 ) = σ z i ( l ) v k N ( v i ) α i k ( l ) z l ( l )
where N ( v i ) denotes the set of neighbors of node v i , x i denotes the feature of v i , and W i ( l ) is the weighted matrix of each node type.
Several studies focused on applying CGNNs for recommendation systems [228,267,268,269]. For instance, Wang et al. [267] presented KGCN (Knowledge Graph Convolutional Networks) model to extract the user preferences in the recommendation systems. Since most existing models suffer from the cold start problem and sparsity of user–item interactions, the proposed model can capture users’ side information (attributes) on knowledge graphs. The users’ preferences, therefore, could be captured by a multilayer receptive field in GCN. Formally, given a user u, item v, N v denotes the set of items connected to u, the user–item interaction score could be computed as:
π ˜ r v , e u = exp ( π r v , e u ) e N ( v ) exp ( π r v , e u ) , v N ( v ) u = e N ( v ) exp ( π ˜ r v , e u e )
where π r v , e u denotes an inner product where the score between user u and relation r, e is the representation of item v.
Since the power of the autoencoder architecture is to learn a low-dimensional node representation in an unsupervised manner, several studies focused on integrating the convolutional GNNs into autoencoder architecture to leverage the power of the autoencoder architecture [72,270]. Table 14 summarizes graph convolutional autoencoder models for static and dynamic graphs.
Most graph autoencoder models were designed based on VAE (variational autoencoders) architecture to learn embeddings [274]. Kipf and Welling [72] introduced the GAE model, one of the first studies on applying autoencoder architecture to graph representation learning. GAE model [72] aimed to reconstruct the adjacency matrix A and feature matrix X from original graphs by adopting the CGNNs as an encoder and an inner product as the decoder part. Figure 19 presents the detail of the GAE model. Formally, the output embedding Z and the reconstruction process of the adjacency matrix input could be defined as:
Z = GCN X , A error = σ Z Z
where GCN · , · function could be defined by Equation (65), and σ is an activation function ReLU · = m a x 0 , · . The model aims to reconstruct the adjacency matrix A by an inner product decoder part:
p ( A , Z ) = i = 1 N j = 1 N p ( A i j | Z i , Z j ) , p ( A i j = 1 | Z i , Z j ) = σ ( Z i Z j )
where σ is the sigmoid function and A i j is the value at row i-th and column j-th in the adjacency matrix A. In the training process, the model tries to minimize the loss function by gradient descent:
L ( θ ) = E q ( Z | X , A ) l o g p ( A | Z ; θ ) KL q ( Z | Z , A ) | | p ( Z )
where KL [ q ( Z | Z , A ) | | p ( Z ) ] is the Kullback–Leibler divergence between two distributions p and q.
Several models attempted to incorporate the autoencoder architecture into the GNN model to reconstruct graphs. For example, the MGAE model [270] combined the message-passing mechanism from GNNs and GAE architecture for graph clustering. The primary purpose of MGAE is to capture information about the features of the nodes by randomly removing several noise pieces of information from the feature matrix to train the GAE model.
The GNNs have shown outstanding performance in learning complex structural graphs that shallow models could not solve [245,275,276]. There are several main advantages of deep neural network models:
  • Parameter sharing: Deep neural network models share weights during the training phase to reduce training time and training parameters while increasing the performance of the models. In addition, the parameter-sharing mechanism allows the model to learn multi-tasks.
  • Inductive learning: The outstanding advantage of deep models over shallow models is that deep models can support inductive learning. This makes deep-learning models capable of generalizing to unseen nodes and having practical applicability.
However, the CGNNs are considered the most advantageous in the line of GNNs and have limitations in graph representation learning.
  • Over-smoothing problem: When capturing the graph structure and entity relationships, CGNNs rely on an aggregation mechanism that captures information from neighboring nodes for target nodes. This results in stacking multiple graph convolutional layers to capture higher-order graph structure. However, increasing the depth of convolution layers could lead to over-smoothing problems [252]. To overcome this drawback, models based on transformer architecture have shown several improvements compared to CGNNs using self-attention.
  • The ability on disassortative graphs: Disassortative graphs are graphs where nodes with different labels tend to be linked together. However, the aggregation mechanism in GNN samples all the features of the neighboring nodes even though they have different labels. Therefore, the aggregation mechanism is the limitation and challenge of GNNs for disassortative graphs in classification tasks.

3.4.4. Graph Transformer Models

Transformers [277] have gained tremendous success for many tasks in natural language processing [278,279] and image processing areas [280,281]. In documents, the transformer models could tokenize sentences into a set of tokens and represent them as one-hot encodings. With image processing, the transformer models could adopt image patches and use two-dimensional encoding to tokenize the image data. However, the tokenization of graph entities is non-trivial since graphs have irregular structures and disordered nodes. Therefore, applying transformers to graphs is still an open question of whether the graph transformer models are suitable for graph representation learning.
The transformer architecture consists of two main parts: a self-attention module and a position-wise feedforward network. Mathematically, the input of the self-attention model at layer l could be formulated as H = h 1 l , h 2 l , , h N l where h i l denotes the hidden state of position of node v i . Then, the self-attention could be formulated as:
Q = H W Q K = H W K V = H W V
S = Q K T d K S ( H ) = Softmax ( S ) V
where Q, K, and V depict the query matrix, key matrix, and value matrix, respectively, and d is the hidden dimension embedding. The matrix S measures the similarity between the queries and keys.
The architecture of graph transformer models differs from GNNs. GNNs use message-passing to aggregate the information from neighbor nodes to target nodes. However, graph transformer models use a self-attention mechanism to capture the context of target nodes in graphs, which usually denotes the similarity between nodes in graphs. The self-attention mechanism could help capture the amount of information aggregated between two nodes in a specific context. In addition, the models use a multi-head self-attention that allows various information channels to pass to the target nodes. Transformer models then learn the correct aggregation patterns during training without pre-defining the graph structure sampling. Table 15 lists a summary of graph transformer models.
In this section, we divide graph transformer models for graph representation learning into three main groups based on the strategy of applying graph transformer models.
  • Structural encoding-based graph transformer: These models focus on various positional encoding schemes to capture absolute and relative information about entity relationships and graph structure. Structural encoding strategies are mainly suitable for tree-like graphs since the models should capture the hierarchical relations between the target nodes and their parents as well as the interaction with other nodes of the same level.
  • GNNs as an auxiliary module: GNNs bring a powerful mechanism in terms of aggregating local structural information. Therefore, several studies try integrating message-passing and GNN modules with a graph transformer encoder as an auxiliary.
  • Edge channel-based attention: The graph structure could be viewed as the combination of the node and edge features and the ordered/unordered connection between them. From this perspective, we do not need GNNs as an auxiliary module. Recently, several models have been proposed to capture graph structure in depth as well as apply graph transformer architecture based on the self-attention mechanism.
Several models tried to apply vanilla transformers to tree-like graphs to capture the node position [64,65,277,288]. Preserving tree structure depicts preserving a node’s relative and absolute structural positions in trees. Absolute structural position describes the positional relationship of the current node to the parent nodes (root nodes). In contrast, relative structural position describes the positional relationship of the current node to its neighbors.
Shiv and Quirk [64] proposed to build a positional encoding (PE) strategy for programming language translation tasks. The significant advantage of tree-based models is that they can explore nonlinear dependencies. By custom positional encodings of nodes in the graph in a hieratical manner, the model could strengthen the transformer model’s power to capture the relationship between node pairs in the tree. The key idea is to represent programming language data in the form of a binary tree and encode the target nodes based on the location of the parent nodes and the relationship with neighboring nodes at the same level. Specifically, they used binary matrices to encode the relationship of target nodes with their parents and neighbors.
Similarly, Wang et al. [65] introduced structural position representations for tree-like graphs. However, they combine sequential and structural positional encoding to enrich the contextual and structural language data. The absolute position and relative position encoding for each word w i could be defined as:
PE i = f Abs ( v i ) 10,000 2 i / d
PE i j = x i W Q ( x j W K ) + x i W Q ( a i j K ) d
where Abs is the absolute position of the word in the sentence, d denotes the hidden size of K, Q matrix, f · is the s i n / c o s function depending on the even/old dimension, respectively, and R is the matrix presenting relative position representation.
The sentences also are represented in an independent tree which could represent the structural relations between words. For structural position encoding, the absolute and relative structural position of a node v i could be encoded as:
PE i = d v i , r o o t
PE i j = PE i PE j if   ( v i , v j ) E . PE i + PE j if   ( v i , v j ) E , i > j . ( PE i + PE j ) if   ( v i , v j ) E , i < j . 0 otherwise .
where d ( · · ) denotes the distance between the root node and the target nodes. They then use a linear function to combine sequential PE and structural PE as inputs to the transformer encoder.
To capture more global structural information in the tree-like graphs, Cai and Lam [282] also proposed an absolute position encoding to capture the relation between target and root nodes. Regarding the relative positional encoding, they use attention score to measure the relationship between nodes in the same shortest path sampled from graphs. The power of using the shortest path is that it can capture the hieratical proximity and the global structure of the graph. Given two nodes v i and v j , the attention score between two nodes can be calculated as:
S i j = H i W q W k H j
where W q and W k are trainable projection matrices, H i and H j depict the node presentation v i and v j , respectively. To define the relationship r i j between two nodes v i and v j , they adopt a bi-directional GRUs model, which could be defined as follows:
s i = GRU s i 1 , S P D i j
s i = GRU s i + 1 , S P D i j
where S P D denotes the shortest path from node v i to node v j , s i and s i are the states of the forward and backward GRU, respectively.
Several models tried to encode positional information of nodes based on subgraph sampling [63,283]. Zhang et al. [63] proposed a Graph-Bert model, which samples the subgraph structure using absolute and relative positional encoding layers. In terms of subgraph sampling, they adopt a top-k intimacy sampling strategy to capture subgraphs as inputs for positional encoding layers. Four layers in the model are responsible for positional encoding. Since several strategies were implemented to capture the structural information in graphs, the advantage of Graph-Bert is that it can be trainable with various types of subgraphs. In addition, Graph-Bert could be further fine-tuned to learn various downstream tasks. For each node v i in a subgraph G i = ( V i , E i ) , they first embed raw feature x i using a linear function. They then adopt three layers to encode the positional information of a node, including absolute role embedding, relative positional embedding, and hop-based relative distance embedding. Formally, the output of three embedding layers of the node v i from subgraph G i could be defined as follows:
PE i ( 1 ) = f W L ( v i ) ,
PE i ( 2 ) = f P ( v i ) ,
PE i ( 3 ) = f H ( v j , v i ) ,
f ( x i ) = s i n x i 10,000 2 l / d , c o s x i 10,000 2 l + 1 / d l = 0 d 2 ,
where W L ( v i ) denotes the WL code that labels node v i , which can be calculated from whole graphs, l and d are the numbers of interactions throughout all nodes, and the vector dimension of nodes, P ( · ) is a position metric, H ( · · ) denotes the distance metric between two nodes, and PE i ( 1 ) , PE i ( 2 ) , PE i ( 3 ) denote the absolute, relative structure intimacy, and relative structure hop PE, respectively. They then aggregate all the vector embeddings together as initial embedding vectors for the graph transformer encoder. Mathematically, the transformer architecture could be explained as follows:
h i ( 0 ) = PE i ( 1 ) + PE i ( 2 ) + PE i ( 3 ) + X i
H ( l ) = Transformer H ( l 1 )
Z i = Fusion H ( l ) .
Similar to Graph-Bert, Jeon et al. [283] tried to present subgraphs for the paper citation network and capture the contextual citation of each paper. Each paper is considered a subgraph with nodes as reference papers. To extract the citation context, they encode the order of the referenced papers in the target paper based on the position and order of the referenced papers. In addition, they use the WL label to capture the structural role of the references. The approach by Liu et al. [289] was conceptually similar to [283]. However, there is a significant difference between them. They proposed an MCN sampling strategy to capture the contextual neighbors from a subgraph. The purpose of MCN sampling is based on the importance of the target node based on the frequency of occurrence when sampling.
In several types of graphs, such as molecular networks, the edges could bring features presenting the chemical connections between atoms. Several models adopted Laplacian eigenvectors to encode the positional node information with edge features [29,284]. Dwivedi and Bresson [29] proposed the positional encoding strategy using node position and edge channel as inputs to the transformer model. The idea of this model is to use Laplacian eigenvectors to encode the node position information from graphs and then define edge channels to capture the global graph structures. The advantage of using the Laplacian eigenvector is that it can help the transformer model learn the proximity of neighbor nodes by maximizing the dot product operator between Q and K matrix. They first pre-computed Laplacian eigenvectors from the Laplacian matrix that could be calculated as:
Δ = I D 1 2 A D 1 2 = U Λ U
where Δ is the Laplacian matrix, and Λ and U denote the eigenvalues and eigenvectors, respectively. The Laplacian eigenvectors λ i then could denote the positional encoding for node v i . Given node v i with feature x i and the edge feature e i j , the first hidden layer and edge channel could be defined as:
h i ( 0 ) = A 0 x i + λ i ( 0 ) + a ( 0 )
e i j ( 0 ) = B ( 0 ) e i j + b ( 0 ) .
The hidden layers h ^ i ( l + 1 ) of node v i and the edge channel e ^ i ( l + 1 ) at layer l + 1 could be defined as follows:
h ^ i ( l + 1 ) = O h ( l ) | | H k = 1 j N i A i j k , l V k , l h j l ,
e ^ i ( l + 1 ) = O e ( l ) | | H k = 1 A i j k , l ,
S i j k , l = Q k , l h i l · K k , l h j l d k · E k , l e i j l
where Q, K, V, E are learned output projection matrices, H denotes the number of attention head.
Similar to [29], Kreuzer at al. [284] aimed to add edge channels to all pairs of nodes in an input graph. However, the critical difference between them is that they combine full-graph attention with sparse attention. One of the advantages of the model is that it could capture more global structural information since they implement self-attention to nodes in the sparse graph. Therefore, they use two different types of similarity matrices to guide the transformer model to distinguish the local and global connections between nodes in graphs. Formally, they re-define the similarity matrix for pair of connected and disconnected nodes, which could be defined as follows:
S ^ i j k , l = Q 1 , k , l h i l · K 1 , k , l h i l E 1 , k , l e i j d if   ( v i , v j ) E . Q 2 , k , l h i l · K 2 , k , l h i l E 2 , k , l e i j d otherwise .
where S ^ i j k , l denotes the similarity between two nodes v i and v j , ( Q 1 , K 1 , E 1 ) and ( Q 2 , K 2 , E 2 ) are the keys, queries, and edge projections of connected and disconnected pair nodes, respectively.
In some specific cases where graphs are sparse, small, or fully connected, the self-attention mechanism could lead to the over-smoothing problem and structure loss since it cannot learn the graph structure. To overcome these limitations, several models adopt GNNs as an auxiliary model to maintain the local structure of the target nodes [99,100,285]. Rong et al. [100] proposed the Grover model, which integrates the message-passing mechanism into the transformer encoder for self-supervised tasks. They used the dynamic message-passing mechanism to capture the number of hops compatible with different graph structures. To avoid the over-smoothing problem, they used a long-range residual connection to strengthen the awareness of local structures.
Several models attempted to integrate GNNs on top of the multi-attention sublayers to preserve local structure between nodes neighbors [63,99,290]. For instance, Lin et al. [99] presented Mesh Graphormer model to capture the global and local information from 3D human mesh. Unlike the Grover model, they inserted a sublayer graph residual block with two GCN layers on top of the multi-head attention layer to capture more local connections between connected pair nodes. Hu et al. [285] integrated message-passing with a transformer model for heterogeneous graphs. Since heterogeneous graphs have different types of node and edge relations, they proposed an attention score, which could capture the importance of nodes. Given a source node v i and a target node v j with the edge e i j , the attention score could be defined as:
S ( v i , e i j , v j ) = Softmax | | m = 1 M h e a d α m v i , e i j , v j
α m v i , e i j , v j = K m ( v i ) W τ ( e i j ) Q m ( v j ) μ d
where α m ( · , · , · ) denotes the m-th attention head, W τ ( e i j ) is the attentive trainable weights for each edge types, K and Q are linear projection of all type of source node v i and v j , respectively, and μ is the importance of each relationship.
Nguyen et al. [61] introduced the UGformer model, which uses a convolution layer on top of the transformer layer to work with sparse and small graphs. Applying only self-attention could result in structure loss in several small-sized and sparse graphs. A GNN layer is stacked after the output of the transformer encoder to maintain local structures in graphs. One of the advantages of the GNN layer is that it can help the transformer model retain the local structure information since all the nodes in the input graph are fully connected.
In graphs, the nodes are arranged chaotically and non-ordered compared to sentences in documents and pixels in images. They can be in a multidimensional space and interact with each other through connection. Therefore, the structural information around a node can be extracted by the centrality of the node and its edges without the need for a positional encoding strategy. Recently, several proposed studies have shown remarkable results in understanding graph structure.
Several graph transformer models have been proposed to capture the structural relations in the natural language processing area. Zhu et al. [62] presented a transformer model to encode abstract meaning representation (AMR) graphs to word sequences. This is the first transformer model that aims to integrate structural knowledge in AMR graphs. The model aims to add a sequence of edge features to the similarity matrix and attention score to capture the graph structure. Formally, the attention score and the vector embedding could be defined as:
S i j = x i W Q