Graph Representation Learning and Its Applications: A Survey

Hoang, Van Thuy; Jeon, Hyeon-Ju; You, Eun-Soon; Yoon, Yoewon; Jung, Sungyeop; Lee, O-Joun

doi:10.3390/s23084168

Open AccessReview

Graph Representation Learning and Its Applications: A Survey

by

Van Thuy Hoang

¹

,

Hyeon-Ju Jeon

²

,

Eun-Soon You

¹,

Yoewon Yoon

³

,

Sungyeop Jung

⁴

and

O-Joun Lee

^1,*

¹

Department of Artificial Intelligence, The Catholic University of Korea, 43, Jibong-ro, Bucheon-si 14662, Gyeonggi-do, Republic of Korea

²

Data Assimilation Group, Korea Institute of Atmospheric Prediction Systems (KIAPS), 35, Boramae-ro 5-gil, Dongjak-gu, Seoul 07071, Republic of Korea

³

Department of Social Welfare, Dongguk University, 30, Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea

⁴

Semiconductor Devices and Circuits Laboratory, Advanced Institute of Convergence Technology (AICT), Seoul National University, 145, Gwanggyo-ro, Yeongtong-gu, Suwon-si 16229, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(8), 4168; https://doi.org/10.3390/s23084168

Submission received: 8 March 2023 / Revised: 16 April 2023 / Accepted: 17 April 2023 / Published: 21 April 2023

(This article belongs to the Special Issue Application of Semantic Technologies in Sensors and Sensing Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Graphs are data structures that effectively represent relational data in the real world. Graph representation learning is a significant task since it could facilitate various downstream tasks, such as node classification, link prediction, etc. Graph representation learning aims to map graph entities to low-dimensional vectors while preserving graph structure and entity relationships. Over the decades, many models have been proposed for graph representation learning. This paper aims to show a comprehensive picture of graph representation learning models, including traditional and state-of-the-art models on various graphs in different geometric spaces. First, we begin with five types of graph embedding models: graph kernels, matrix factorization models, shallow models, deep-learning models, and non-Euclidean models. In addition, we also discuss graph transformer models and Gaussian embedding models. Second, we present practical applications of graph embedding models, from constructing graphs for specific domains to applying models to solve tasks. Finally, we discuss challenges for existing models and future research directions in detail. As a result, this paper provides a structured overview of the diversity of graph embedding models.

Keywords:

graph embedding; graph representation learning; graph transformer; graph neural networks

1. Introduction

Graphs are a common language for representing complex relational data, including social media, transportation system networks, and biological protein–protein networks [1,2]. Since most graph data are complex and high-dimensional, it is difficult for researchers to extract valuable knowledge. Therefore, processing graph data and transforming them into a form (fixed-dimensional vectors) is an important process that researchers can then apply to different downstream tasks [3]. The objective of graph representation learning is to obtain vector representations of graph entities (e.g., nodes, edges, subgraphs, etc.) to facilitate various downstream tasks, such as node classification [4], link prediction [5,6], community detection [7], etc. As a result, graph representation learning plays an important role since it could significantly promote the performance of the downstream tasks.

Representation of the graph data, however, is challenging and different from image and text data [8]. In textual data, words are linked together in a sentence, and they have a fixed position in that sentence. In image data, pixels are arranged on an ordered grid space and can be represented by a grid matrix. However, the nodes and edges in graphs are non-ordered and have their features. This leads to mapping graph entities to latent space while preserving the graph structure, and proximity relationships are challenging. In the case of a social network, a user can have many friends (neighbors) and various personal information, such as hometown, education level, and hobbies, which makes preserving the graph structure and properties significantly problematic. In addition, many real-world networks show dynamic behaviors in which graph structures and node features could be changed over time [9,10]. These could deliver challenges in capturing the graph structure and mapping graph entities into vector space.

Over decades, various graph representation learning models have been proposed to project graph entities into fixed-length vectors [11,12,13]. Graph embedding models are mainly divided into five main groups: graph kernels, matrix factorization models, shallow models, deep neural network models, and non-Euclidean models. Figure 1 presents the popularity of different graph representation learning models from 2010 to 2022. The number of graph representation learning studies increased considerably over the period of 12 years. Furthermore, there was significant growth in the frequency of research studies on graph neural networks, graph convolutional networks, and graph transformer models. In contrast, the number of studies in graph kernels, graph autoencoder, and matrix factorization-based models increased slightly over the period of 12 years. We obtained the frequency of academic publications including each keyword from Scopus (https://www.scopus.com (accessed on 16 April 2023)).

Historically, the first graph representation learning models were graph kernels. The idea of graph kernel methods perhaps comes from the most essential and well-known Weisfeiler–Lehman (WL) isomorphic testing in 1968 [31]. Graph kernels are kernel functions that aim to measure the similarity between graphs and their entities [32]. The main idea of graph kernels is to decompose original graphs into substructures and construct vector embeddings based on the substructure features. There are two main types of graph kernels: kernels for graphs and kernels on graphs. The former aims to measure the similarity between pairs of graphs, while the latter estimates the similarity between graph nodes. Several strategies to estimate the similarity of graph pairs have been proposed to represent various graph structures, such as graphlet kernels, random walk, and the shortest path, which started in the 2000s. Based on WL isomorphic testing, various graph kernels are built to compute the similarity of pairs of graph entities, such as WL kernels [31], WL subtree kernels [33,34,35], and random walks [36,37]. However, one of the limitations of graph kernels is the computational complexity when working with large-scale graphs since computing graph kernels is an NP-hard class.

Early models for graph representation learning primarily focused on matrix factorization methods, which are motivated by traditional techniques for dimensionality reduction in 2002 [38]. Several matrix factorization-based models have been proposed to handle large graphs with millions of nodes [39,40]. The objective of matrix factorization models is to decompose the proximity matrix into a product of small-sized matrices and then learn the embeddings that fit the proximity. Based on the ways to learn vector embeddings, there are two main lines of matrix factorization models: Laplacian eigenmaps and node proximity matrix factorization. Starting in the 2000s, Laplacian eigenmaps methods [41,42] aim to represent each node by Laplacian eigenvectors along with the first k eigenvalues. In contrast, the node proximity matrix factorization methods [5,15] aim to gain node embeddings by the singular value decomposition in 2015. Various proximity matrix factorization models have successfully handled large graphs and achieved great performance [15,43]. However, matrix factorization models suffer from capturing high-order proximity due to computational complexity when performing with high transition matrices.

In 2014 and 2016, early shallow models, DeepWalk [14] and Node2Vec [4] were proposed, which learn node embeddings based on shallow neural networks. Remarkably, the primary concept is to learn node embeddings by maximizing the neighborhood probability of target nodes using the skip-gram model started in the natural language processing area. The purpose of this strategy could then be optimized with SGD on neural network layers, thus reducing computational complexity. With this historic milestone, various models have been developed by improving multiple sampling strategies and training processes. Shallow models are the embedding models that aim to map graph entities to low-dimensional vectors by conducting an embedding lookup for each graph entity [3]. From this perspective, the embedding of node

v_{i}

could be represented as

Z_{i} = M x_{i}

, where M denotes an embedding matrix of all nodes and

x_{i}

is a one-hot vector of node

v_{i}

. Various shallow models have been proposed to learn embeddings with different strategies to preserve graph structures and the similarity between graph entities. Structure-preservation models aim to preserve the structural connection between entities (e.g., DeepWalk [14], Node2Vec [4]). In 2015, Tang et al. [16] proposed the LINE model, a proximity reconstruction method that aims to preserve proximity between nodes in graphs. After that, various models have been proposed to preserve the node proximity with higher-order proximity and capture more global graph structure. However, most of the above models focus on transductive learning and ignore node features, which may have several limitations to practical applications.

Breakthroughs in deep learning have led to a new research perspective on applying deep neural networks to the graph domain. Since the 2000s, there have been several early models on GNNs designed to learn node embeddings based on neighborhood information using an aggregation mechanism [44,45]. Graph neural networks (GNNs) have shown a significant expressive capacity to represent graph embeddings in an inductive learning manner and solve the limitations of aforementioned shallow models [46,47]. Recurrent GNNs are also the first studies on GNNs based on recurrent neural network architecture [48,49] in 2005. These models aim to learn node embeddings via recurrent layers with the same weights in each hidden layer and run recursively until convergence. Several recurrent GNNs with different strategies have been proposed by the power of recurrent neural network architecture and the combinations with several sampling strategies. However, using the same weights at each hidden layer of the RGNN model may cause the model to be incapable of distinguishing the local and global structure. Since 2016, several graph autoencoder models have been proposed based on the original autoencoder architecture, which could learn complex graph structures by reconstructing the input graph structure [50,51]. The graph autoencoders comprise two main layers: encoder layers take the adjacency matrix as input and squeeze it to generate node embeddings, and decoder layers reconstruct the input data. By contrast, the idea of CGNNs is to use convolutional operators with different weights in each hidden layer, which are more efficient in capturing and distinguishing the local and global structures [18,52,53,54]. Many studies have been proposed with various variants of CGNNs, including spectral CGNNs [55,56,57] started in 2014, spatial CGNNs [22,24,52] started in 2016, and attentive CGNNs [19,58] started in 2017. Nevertheless, most GNNs suffer limitations such as over-smoothing problems and noise from neighbor nodes when stacking more GNN layers [59,60].

Motivated by transformer architecture started from natural language processing applications in 2017, several graph transformer models were proposed using the transformer architecture to the graph domain in 2019 [61,62]. Graph transformer models have shown competitive and superior performance against GNNs in learning complex graph structures [30,63]. Graph transformer models can be divided into three main groups: transformer for tree-like graphs, transformer with GNNs, and transformer with global self-attention. Early graph transformer models aim to learn tree-like graphs, which mainly aim at learning node embeddings in tree-like graphs where nodes are arranged hierarchically [64,65] since 2019. These models encode the node positions through their relative and absolute positional encoding in trees as constraints with root nodes and neighbor nodes at the same level. Second, several models leverage the power of GNNs as an auxiliary module in computing attention scores [66]. In addition, some models put GNN layers on top of the model to overcome the over-smoothing problem and make the model remember the local structure [61]. Most above graph transformer models adopt vanilla transformer architecture to learn embeddings that rely on multi-head self-attention. Third, several graph transformer models use a global self-attention mechanism to learn node embeddings, which implements self-attention independently and does not require constraints from the neighborhood [30,67]. These models work directly on input graphs and can capture the global structure with global self-attention.

Most of the above models learn embeddings in Euclidean space and represent graph entities as vector points in latent space. However, graphs in the real world could have complex structures and different forms, such that Euclidean space may be inadequate to represent the graph structure and ultimately lead to structural loss [68,69]. Early models learn complex graphs in non-Euclidean geometry by developing efficient algorithms for learning node embeddings based on manifold optimization [70] in 2017. Following the line, several models aim to represent graph data in non-Euclidean space and gain desirable results [68,69,71]. Two typical non-Euclidean spaces, including spherical and hyperbolic geometry, have their advantages. Spherical space could represent graph structures with large cycles, while hyperbolic space is suitable for hierarchical graph structures. Most non-Euclidean models aim to design an efficient algorithm for learning node embeddings since it is challenging to implement operators directly in non-Euclidean. Furthermore, to deal with uncertainty, several Gaussian graph models have been introduced to represent graph entities as density-based embeddings [23] started in 2016. Node embeddings could be defined as a continuous density mostly based on Gaussian distribution [72].

To the extent of our knowledge, no comparable paper in the literature focuses on a wide range of graph embedding models for static and dynamic graphs in different geometric spaces. Most current papers only presented specific approaches for graph representation learning. Wu et al. [8] focused on graph neural network models, which are presented as a section in this paper. Several surveys [13,73,74] summarized graph embedding models for various types of graphs, but they did not mention either graph transformer models or non-Euclidean models. From applying graph embedding models to practical applications, several papers only list the applications for specific and narrow tasks [12,75]. However, we discuss how graphs are constructed in specific applications and how graph embedding models are implemented in various domains.

This paper presents a comprehensive picture of graph embedding models in static and dynamic graphs in different geometric spaces. In particular, we recognize five general categories of models for addressing graph representation learning, including graph kernels, matrix factorization models, shallow models, deep neural network models, and non-Euclidean models. The contribution of this study can be categorized as follows:

This paper presents a taxonomy of graph embedding models based on various algorithms and strategies.
We provide readers with an in-depth analysis of an overview of graph embedding models with different types of graphs ranging from static to dynamic and from homogeneous to heterogeneous graphs.
This paper presents graph transformer models, which have achieved remarkable results in a deeper understanding of graph structures in recent years.
We cover applications of graph representation learning in various areas, from constructing graphs to applying models in specific tasks.
We discuss the challenges and future directions of existing graph embedding models in detail.

Since abundant graph representation learning models have been proposed recently, we employed different approaches to find related studies. We built a search strategy by defining keywords and analyzing reliable sources. The list of keywords includes graph embedding, graph representation learning, graph neural networks, graph convolution, graph attention, graph transformer, graph embedding in non-Euclidean space, Gaussian graph embedding, and applications of graph embedding. We found related studies at famous top-tier conferences and journals such as AAAI, IJCAI, SIGKDD, ICML, WSDM, Nature Machine Intelligence, Pattern Recognition, Intelligent Systems with Applications, the Web, and so on.

The following sections of this paper are summarized as follows. Section 2 describes fundamental concepts and backgrounds related to graph representation learning. In Section 3, all the graph embedding models will be presented, such as graph kernels, matrix factorization models, shallow models, deep neural network models, and non-Euclidean models. Section 4 discusses a wide range of practical applications of graph embedding models in the real world. Section 5 summarizes the latest benchmarks, downstream tasks, evaluation metrics, and libraries. Challenges for existing graph embedding models and future research directions will be discussed in Section 6. The last section, Section 7 is the conclusion.

2. Problem Description

Graph representation learning aims to project the graph entities into low-dimensional vectors while preserving the graph structure and the proximity of entities in graphs. With the desire to map graph entities into vector space, it is necessary to model the graph in mathematical form. Therefore, we begin with several fundamental definitions of graphs. The list of standard notations used in this survey is detailed in Table 1. Mathematically, a graph G can be defined as follows:

Definition 1

(Graph [3]). A graph is a discrete structure consisting of a set of nodes and the edges connecting those nodes. The graph can be described mathematically in the form:

G = (V, E, A)

, where

V = {v_{1}, v_{2}, \dots, v_{N}}

is the set of nodes,

E = {(v_{i}, v_{j}) | (v_{i}, v_{j}) \in V \times V}

is the set of edges, and A is an adjacency matrix. A is a square matrix of size

N \times N

where N is the number of nodes in graphs. This can be formulated as follows:

\begin{matrix} A = [\begin{matrix} A_{11} & \dots & A_{1 N} \\ ⋮ & ⋱ & ⋮ \\ A_{N 1} & \dots & A_{N N} \end{matrix}], A_{i j} = \{\begin{matrix} 1, & i f e_{i j} \in E . \\ 0, & o t h e r w i s e . \end{matrix} \end{matrix},

(1)

where

A_{i j}

indicates adjacency between node

v_{i}

and node

v_{j}

.

When

A_{i j}

is binary, the matrix A represents only the existence of connections between nodes. By extending the definition of matrix A, we could expand to abundant different types of graph G:

Directed graph: When $A_{i j} = A_{j i}$ for any $1 \leq i, j \leq n$ , then the graph G is called an undirected graph, and G is directed graph otherwise.
Weighted graph: is a graph in which each edge is assigned a specific weight value. Therefore, the adjacency matrix could be presented as: $A_{i j} = w_{i j}$ , where $w_{i j} \in R$ is the weight of the edge $e_{i j}$ .
Signed graph: When $A_{i j} \in [- \infty, \infty]$ , the graph G is called signature/signed graph. The graph G could have all positive signed edges when $A_{i j} > 0$ for any $1 \leq i, j \leq n$ , and G could have all negative signed edges otherwise.
Attributed graph: A graph $G = (V, E, A, X)$ is an attributed graph where V, E is the set of nodes and edges, respectively, and X is the matrix of node attributes with size $n \times d$ . Furthermore, we could also have the matrix X as the matrix of edge input attribute with size $m \times d$ where m is the number of edges $e_{i j} \in E$ for any $1 \leq i, j \leq n$ .
Hyper graph: A hyper graph G could be represented as $G = (V, E, W)$ , where V denotes the set of nodes and E denotes a set of hyperedge. Each hyperedge $e_{i j}$ can connect multiple nodes and is assigned a weight $w_{i j} \in W$ . The hypergraph G could be represented by an incidence matrix H size $|V| \times |E|$ with entries $h (v_{i}, v_{j}) = 1$ if $e_{i j} \in E$ , and $h (v_{i}, v_{j}) = 0$ otherwise.
Heterogeneous graph: A heterogeneous graph is defined as $G = (V, E, T, φ, ρ)$ where V, and E are the set of nodes and edges, respectively, $φ$ is the mapping function: $φ : V \to T_{v}$ , and the mapping function $ρ : E \to T_{e}$ with $T_{v}$ , $T_{e}$ describe the set of node types and edge types, respectively, and $T = T_{v}$ + $T_{e}$ is the sum of the number of node types and edge types.

According to the definitions of graph

G = (V, E)

that have been represented mathematically above, the idea of graph embedding is to map graph entities into low-dimensional vectors with the number of dimensions d with

d ≪ N

. Mathematically, the graph embedding is formulated as follows:

Definition 2

(Graph embedding [14]). Given a graph

G = (V, E)

where V is the set of nodes, and E is the set of edges, graph embedding is a projection function

ϕ (\cdot)

, where

ϕ : V \to R^{d}

(

d ≪ | V |

) and

k (v_{i}, v_{j}) ≃ 〈 ϕ (v_{i}), ϕ (v_{j}) 〉

describes the proximity of two nodes

v_{i}

and

v_{j}

in the graph and

〈 ϕ (v_{i}), ϕ (v_{j}) 〉

is the distance of two vectors

ϕ (v_{i})

and

ϕ (v_{j})

in the vector space.

Graph representation learning aims to project graph entities into the vector space while preserving the graph structure and entity proximity. For example, if two nodes

v_{i}

and

v_{j}

in the graph G are connected directly, then in vector space, the distance between two vectors

ϕ (v_{i})

and

ϕ (v_{j})

must be minimal. Figure 2 shows an example of a graph embedding model that transforms nodes in a graph to low-dimensional vectors

(Z_{1} Z_{2} \dots Z_{n})

in the vector space.

When mapping graph entities to latent space, preserving the proximity of graph entities is one of the most important factors in preserving the graph structure and the relationship between nodes. In other words, if two nodes

v_{i}

and

v_{j}

are connected or close in the graph, the distance between the two vectors

Z_{i}

and

Z_{j}

must be minimal in the vector space. Several models [16,76,77,78] aim to preserve k-order proximity between graph entities in vector space. Formally, the k-order proximity is defined as follows:

Definition 3

(k-order proximity [79]). Given a graph

G = (V, E)

where V is the set of nodes, and E is the set of edges, k-order proximity describes the similarity of nodes with the distance captured from the k-hop in the graph G. When

k = 1

, it is 1st-order proximity that captures the local pairwise proximity of two nodes in graphs. When k is higher, it could capture the global graph structure.

There is another way to define graph embedding from the perspective of Encoder-Decoder architecture [3]. From this perspective, the task of the encoder part is to encode graph entities into low-dimensional vectors, and the decoder part tries to reconstruct the graph from the latent space. In the real world, many graphs show dynamic behaviors, including node and edge evolution, and feature evolution [80]. Dynamic graphs are found widely in many applications [81], such as social networks where connections between friends could be added or removed over time.

Definition 4

(Dynamic graph [80]). A dynamic graph

G

is formed of three entities:

G = (V, E, T)

where

V = \{V (t)\}

is the group of node sets,

E = {E (t)}

with

t \in T

is the group of edge sets over time span T, and T denotes the time span. From the statistic perspective, we could also consider a dynamic graph

G = {G (t_{0}) G (t_{1}) \dots G (t_{n})}

as a collection of static graphs

G (t_{k})

where

G (t_{k}) = (V (t_{k}), E (t_{k}))

denotes the static graph G at time

t_{k}

, and

V (t_{k}), E (t_{k})

denotes the set of nodes and set of edges at time

t_{k}

, respectively.

Figure 3a presents an example of dynamic graph representation. At time

t + 1

, there are several changes in the graph

G (t + 1)

such as: The edge

e_{23}

will be removed, node

v_{6}

will be added and new edge

e_{56}

. Casteigts et al. [80] proposed an alternative definition of a dynamic graph with five components:

G = (V, E, T, ρ, ζ)

where

ρ : V \times T \to {0, 1}

describes the existence of each node at time t, and

ζ : E \times T \to Ψ

describes the existence of an edge at time t.

There is another way to model a dynamic graph based on the changes of the graph entities (edges, nodes) taking place on the graph

G

over a time span t or by an edge stream. From this perspective, a dynamic

G

could be modeled as

G = (V, E_{t}, T)

where

E_{t}

presents the collection of edges of dynamic graph

G

at time t, and function

f : E \to R^{+}

to map edges into integer numbers. It notices that all the edges at time t will have the same labels. Figure 3b describes the evolution of the edges of a graph from time

(t)

to

(t + 1)

.

Definition 5

(Dynamic graph embedding [82]). Given a dynamic graph

G = (V, E, T)

where

V = {V (t)}

is the group of node sets, and

E = {E (t)}

is the group of edge sets over time span T, a dynamic graph embedding is a projection function

ϕ (\cdot)

, where

ϕ (\cdot) : G \times T \to R^{d} \times T

.

T

describes the time domain in latent space and T is the time span. When

G

is represented as the collection of snapshots:

G = {G (t_{0}) G (t_{1}) \dots G (t_{n})}

, then the projection function ϕ will be defined as:

ϕ = {ϕ (0) ϕ (1) \dots ϕ (n)}

where

ϕ (t)

is the vector embedding of the graph

G (t)

at time t.

There are two ways to represent a dynamic graph

G

, including a temporal dynamic graph embedding (changes over a period of time) and topological dynamic graph embedding (changes in the graph structure over time).

Temporal dynamic graph embedding: A temporal dynamic embedding is a projection function $ϕ (\cdot)$ , where $ϕ_{t} : G_{t - k, t} \times T \to R^{d} \times T$ and $G_{t - k, t} = {G_{t - k} G_{t - k + 1} \dots G_{t}}$ describes the collection of graph $G$ during time interval $[t - k, t]$ .
Topological dynamic graph embedding: A topological dynamic graph embedding for graph $G$ for nodes is a mapping function $ϕ$ , where $ϕ : V \times T \to R^{d} \times T$ .

3. Graph Representation Learning Models

This section presents a taxonomy of existing graph representation learning models in the literature. We categorize the existing graph embedding models into five main groups based on strategies to preserve graph structures and proximity of entities in graphs, including graph kernels, matrix factorization-based models, shallow models, deep neural network models, and non-Euclidean models. Figure 4 presents the proposed taxonomy of the graph representation learning models. Furthermore, we deliver open-source implementations of graph embedding models in Appendix A.

Graph kernels and matrix factorization-based models are one of the pioneer models for graph representation learning. Graph kernels are prevalent in learning graph embeddings using a deterministic mapping function in solving graph classification tasks [83,84,85]. There are two types of graph kernels: kernels for graphs, which aim to compare the similarity between graphs, and kernels on graphs aim to find the similarity between nodes in graphs. Second, matrix factorization-based models aim to represent the graph as matrices and gain embeddings by decomposing the matrices [5,86]. There are several strategies for factorization modeling, and most of these models aim to approximate high-order proximity between nodes. However, graph kernels and matrix factorization-based models suffer from computational complexity when handling large graphs and capturing high-order proximity.

Shallow models aim to construct an embedding matrix to transform each graph entity into vectors. We categorize shallow models into two main groups: structure preservation and proximity reconstruction. Structure-preservation strategies aim to conserve structural relationships between nodes in graphs [4,14,87]. Depending on specific tasks, several sampling strategies could be employed to capture graph structures, such as random walks [4,14], graphlets [88], motifs [89,90,91], etc. By contrast, the objective of the proximity reconstruction models is to preserve the proximity of nodes in graphs [16,92]. The proximity strategies can vary across different models based on their objectives. For example, the LINE model [16] aims to preserve 1st-order and 2nd-order proximity between nodes, while PALE [77] preserves pairwise similarities.

Graph neural networks have shown great performance in learning complex graph structures [18,50]. GNNs can be categorized into three main groups: graph autoencoder [50,51], recurrent GNNs [17,93], and convolutional GNNs. Graph autoencoders and recurrent GNNs are mostly pioneer studies of GNNs based on autoencoder architecture and recurrent neural networks, respectively. Graph autoencoders are composed of an encoder layer and a decoder layer. The encoder layer aims to compress a proximity graph matrix to vector embeddings, and the decoder layer reconstructs the proximity matrix. Most graph autoencoder models employ multilayer perceptron-based layers or recurrent GNNs as the core of autoencoder architecture. Recurrent GNNs aim to learn node embeddings based on recurrent neural network architecture in which connections between neurons can make a cycle. Therefore, earlier RGNNs mainly aimed to learn embeddings on directed acyclic graphs [94]. Recurrent GNNs employ the same weights in all hidden layers to capture local and global structures. Recently, convolutional GNNs have been much more efficient and can gain outstanding performance compared to RGNNs. The main difference between RGNNs and CGNNs is that CGNNs use different weights in each hidden layer, which could distinguish local and global structures. Various CGNN models have been proposed and mainly fall into two categories: spectral CGNNs, and spatial CGNNs [22,52,95]. Spectral CGNNs aim to transform graph data to the frequency domain and learn node embeddings in this domain [56,96]. By contrast, spatial CGNNs work directly on the graph using convolutional filters [53,54]. By staking multiple GNN layers, the models could learn node embeddings more efficiently and capture higher-order structural information [97,98]. However, stacking many layers could cause the over-smoothing problem, which most GNNs have not fully solved in a whole extent.

Recently, several models have enabled transformer architecture to learn graph structures which gain significant results compared to other deep-learning models [30,46,99]. We categorize graph transformer models into three main groups: transformer for tree-like graphs [64,65], transformer with GNNs [99,100], and transformer with global self-attention [30,67]. Different types of graph transformer models aim to handle distinct types of graphs. The transformer for tree-like graphs aims to learn node embeddings in tree-like hierarchical graphs [64,65,101]. The hierarchical relationships from the target nodes to their parents and neighbors are presented as absolute and relative positional encoding, respectively. Several graph transformer models employ the message-passing mechanism from GNNs as an auxiliary module in computing the attention score matrix [61,100]. GNN layers can be used to aggregate information as input to graph transformer models or put on top of the model, which aims to preserve local structures. In addition, some graph transformer models can directly process graph data without support from GNN layers [30,67]. These models implement a global self-attention to learn local and global structures in a graph input without neighborhood constraints.

Most existing graph embedding models aim to learn embeddings in Euclidean space, which may not deliver good geometric representations and metrics. Recent studies have shown that non-Euclidean spaces are more suitable for representing complex graph structures. The non-Euclidean models could be categorized as hyperbolic, spherical, and Gaussian. Hyperbolic and spherical space are two types of non-Euclidean geometry that could represent different graph structures. Hyperbolic space [102] is more suitable for representing hierarchical graph structures that follow the power law, while the power of spherical space is to represent large circular graph structures [103]. Moreover, since the information about the embedding space is unknown and uncertain, several models aim at learning node embeddings as Gaussian distribution [23,104].

3.1. Graph Kernels

Graph kernels aim to compare graphs or their substructures (e.g., nodes, subgraphs, and edges) by measuring their similarity [105]. The problem of measuring the similarity of graphs is, therefore, at the core of learning graphs in an unsupervised manner. Measuring the similarity of large graphs is problematic since the graph isomorphism problem is assigned to the NP (nondeterministic polynomial time) class. However, it is an NP-complete for subgraphs isomorphism problem. Table 2 describes a summary of graph kernel models.

Kernel methods applied to the graph embedding problem can be understood in two forms, including the isomorphism testing of N graphs (kernels for graphs) and embedding entities of graphs to Hilbert space (kernels on graphs).

Kernels for graphs: Kernels for graphs aim to measure the similarity between graphs. The similarity between the two graphs (isomorphism) could be explained as follows: Given two undirected graphs $G_{1} = (V_{1} E_{1})$ and $G_{2} = (V_{2} E_{2})$ , $G_{1}$ and $G_{2}$ are isomorphic if they exist a bimodal mapping function $ϕ : V_{1} \to V_{2}$ such that $\forall a b \in V_{1}$ , a and b are contiguous on $G_{1}$ if $ϕ (a)$ and $ϕ (b)$ are contiguous on $G_{2}$ .
Kernels on graphs: To embed nodes in graphs, kernel methods refer to finding a function that maps pairs of nodes to latent space using particular similarity measures. Formally, graph kernels could be defined as: Given a graph $G = (V, E)$ , a function $K = V \times V \to R$ is a kernel on G if there is a mapping function $ϕ : V \to H$ such that $K (v_{i} v_{j}) = 〈 ϕ (v_{i}) ϕ (v_{j}) 〉$ for any node pairs $(v_{i} v_{j})$ .

There are several strategies to measure the similarity of pairs of graphs, such as graphlet kernels, WL kernels, random walk, and shortest paths [31,83]. Among the kernel methods, graphlet kernels are one of the simple kernels that could measure the similarity between graphs by counting subgraphs with a limited size k [83,106]. For instance, Shervashidze et al. [83] introduced a graphlet kernel with the main idea of finding the graph feature by counting the number of different graphlets in graphs. Formally, given an unlabeled graph G, a graphlet list

V_{k} = (G_{1} + G_{2} + \dots + G_{n_{k}})

is the set of the graphlets with size k where

n_{k}

depicts the number of graphlets. The graphlet kernel for two unlabeled graphs G and

G^{'}

could be defined as:

K (G, G^{'}) = 〈 ϕ (G), ϕ (G^{'}) 〉,

(2)

where

ϕ^{G}

and

ϕ^{G^{'}}

are vectors that depict the number of graphlets in a

G_{i}

and

G_{i}^{'}

, respectively. By counting all graphlets with size k for a graph, the computation time is expensive by the enumeration

n^{k}

with n depicts the number of nodes in G. One of the practical solutions to overcome this limitation is to design the feature

{ϕ_{i}}^{G}

more effectively, called Weisfeiler–Lehman.

Weisfeiler–Lehman (WL) test [31] is considered to be a traditional strategy to test the homomorphism of two graphs using color refinements. Figure 5 presents the main idea of the WL homomorphism test for two graphs in detail. By updating node labels, all the structure information of nodes in graphs could be stored at each node, including both local and global information, depending on the number of iterations. We can then compute histograms or other summary statistics over these labels as a vector representation for graphs.

Several models improved the idea from WL isomorphism test [34,84]. The concept of the WL isomorphism test inspired various GNN models later, which aim to be expressive as powerful as the WL test to distinguish different graph structures. Shervashidze et al. [33] presented three instances of WL kernels, including the WL subtree kernel, WL edge kernel, and WL shortest-path kernel with an enrichment strategy for labels. The key idea of [33] is to represent a graph G as WL sequences with the height of h. The WL sequences of two graphs G and

G^{'}

can be defined as:

\begin{matrix} k_{W L}^{(h)} (G, G^{'}) = k (G_{0}, G_{0}^{'}) + k (G_{1}, G_{1}^{'}) + \dots + k (G_{h}, G_{h}^{'}) \end{matrix}

(3)

where

k (G_{i}, G_{i}^{'}) = 〈ϕ (G_{i}), ϕ (G_{i}^{'})〉

. For N graphs, the WL subtree kernel could be computed in a runtime of

O (N h m + N^{2} h n)

, where h and m are the numbers of interactions and edges in G, respectively. Therefore, the algorithm could capture more information about the graph G after h interactions and compare graphs at different levels.

However, the vanilla WL isomorphism test requires massive resources since the methods are an NP-hard class. Following the WL isomorphism idea, Morris et al. [34] presented a set of k-set forms

V {(G)}_{k}

and built a local and global neighborhood of the

k

-sets. Instead of working on each node in graphs, the models calculate and update the labels based on the

k

-set. The feature vectors of graph G then could be calculated by counting the number of occurrences of

k

-sets. Several models [84,114] improved the Wasserstein distance based on the WL isomorphism test, and the models could estimate weights of subtree patterns before the kernel construction [35]. Several models adopted a random-walk sampling strategy to capture the graph structure that could help reduce the computational complexity to handle large graphs [36,37,85,107].

However, the above methods only focus on homogeneous graphs in which nodes do not have side information. In the real world, graph nodes could contain labels and attributes and change over time, making it challenging to learn node embeddings. Several models have been proposed with slight variations from the traditional WL isomorphism test and random walk methods [109,110,111,112,113]. For example, Borgwardt et al. [109] presented random-walk sampling on attributed edges to capture the graph structure. Since existing kernel models primarily work on small-scale graphs or a subset of graphs, improving similarity based on shortest paths could achieve better computational efficiency for graph kernels in polynomial time. An all-paths kernel K could be defined as:

K (P (G_{1}), P (G_{2})) = \sum_{p_{1} \in P (G_{1})} \sum_{p_{2} \in P (G_{2})} k_{p a t h} (p_{1}, p_{2}),

(4)

where

P (G_{1})

and

P (G_{2})

are the set of random-walk paths in

G_{1}

and

G_{2}

, respectively, and

k_{p a t h} (p_{1}, p_{2})

depicts a positive definite kernel on two paths

p_{1}

and

p_{2}

. The model then applied Floyd–Warshall algorithm [115] to find k shortest-path kernels in graphs. One of the disadvantages of this model is the runtime complexity, which is about

O (k \times n^{4})

, where n depicts the number of nodes in graphs. Morris et al. [108] introduced a variation of the WL subtree kernel for attributed graphs by improving existing shortest-path kernels. The key idea of this model is to use a hash function that maps continuous attributes to label codes, and then it normalizes the discrete label codes.

To sum up, graph kernels are effective models and bring several advantages:

Coverage: The graph kernels are one of the most useful functions to measure the similarity between graph entities by performing several strategies to find a kernel in graphs. This could be seen as a generalization of the traditional statistical methods [116].
Efficiency: Several kernel tricks have been proposed to reduce the computational cost of kernel methods on graphs [117]. Kernel tricks could reduce the number of spatial dimensions and computational complexity on substructures while still providing efficient kernels.

Although kernel methods have several advantages, several disadvantages make the kernels difficult to scale:

Missing entities: Most kernel models could not learn node embeddings for new nodes. In the real world, graphs are dynamic, and their entities could evolve. Therefore, the graph kernels must re-learn graphs every time a new node is added, which is time-consuming and difficult to apply in practice.
Dealing with weights: Most graph kernel models do not consider the weighted edges, which could lead to structural information loss. This could reduce the possibility of graph representation in the hidden space.
Computational complexity: Graph kernels are an NP-hard class [109]. Although several kernel-based models aim to reduce the computational time by considering the distribution of substructures, this may increase the complexity and reduce the ability to capture the global structure.

Although the graph kernels delivered good results when working with small graphs, they remain limitations when working with large and complex graphs [118]. To address the issue, matrix factorization-based models could bring far more advantages to learning node embeddings by decomposing the large original graphs into small-sized components. Therefore, we discuss matrix factorization-based models for learning node embeddings in the next section.

3.2. Matrix Factorization-Based Models

Matrix factorization aims to reduce the high-dimensional matrix that describes graphs (e.g., adjacency matrix, Laplacian matrix) into a low-dimensional space. Several well-known decomposition models (e.g., SVD, PCA, etc.) are widely applied in graph representation learning and recommendation system problems. Table 3 and Table 4 present matrix factorization-based models for static and dynamic graphs, respectively. Based on the strategy to preserve the graph structures, matrix factorization models could be categorized into two main groups: graph embedding Laplacian eigenmaps and node proximity matrix factorization.

The Laplacian eigenmaps: To learn representations of a graph $G = (V, E)$ , these approaches first represent G as a Laplacian matrix L where $L = D - A$ and D is the degree matrix [41]. In the matrix L, the positive values depict the degree of nodes, and negative values are the weights of the edges. The matrix L could be decomposed to find the smallest number from eigenvalues which are considered node embeddings. The optimal node embedding $Z^{*}$ , therefore, could be computed using an objective function:

$Z^{*} = \underset{Z}{arg min} Z^{⊺} L Z .$

(5)
Node proximity matrix factorization: The objective of these models is to decompose node proximity matrix into small-sized matrices directly. In other words, the proximity of nodes in graphs will be preserved in the latent space. Formally, given a proximity matrix M, the models try to optimize the distance between two pair nodes $v_{i}$ and $v_{j}$ , which could be defined as:

$Z^{*} = \underset{Z}{arg min} ∥M_{i j} - Z_{i} {Z_{j}}^{T}∥ .$

(6)

Hofmann et al. [119] proposed an MSDC (Multidimensional Scaling and Data Clustering) model based on matrix factorization. The key idea of MSDC is to represent data points as a bipartite graph and then learn node embeddings based on node similarity in the graph. This method requires a symmetric proximity matrix

M \in R^{N \times N}

as input and learns a latent representation of the data in Euclidean space by minimizing the loss that could be defined as:

Z^{*} = \underset{Z}{arg min} \frac{1}{2 | V |} {\sum_{(v_{i}, v_{j}) \in E} [{|Z_{i} - Z_{j}|}^{2} - M_{i j}]}^{2} .

(7)

However, the limitation of the MSDC model is that the model focuses only on the pairwise nodes, which cannot capture the global graph structure. Furthermore, the model investigated the proximity of all the data points in the graph, which could increase computational complexity when working on large graphs. Several models [39,120] adopted k-nearest methods to search neighbor nodes which can capture more graph structure. The k-nearest methods, therefore, could bring the advantage of reducing computational complexity since the models only take k neighbors as inputs. For example, Han et al. [120] proposed the similarity

S_{i j}

between two nodes

v_{i}

and

v_{j}

as:

S_{i j} = \{\begin{matrix} exp (- \frac{| | v_{i} - v_{j} | |^{2}}{δ^{2}}), & if v_{j} \in N_{k} (v_{i}) . \\ 0, & otherwise . \end{matrix},

(8)

where

N_{k} (v_{i})

depicts the set of k nearest neighbors of

v_{i}

in graphs. The model could measure the infringement of the constraints between pairs of nodes regarding label distribution. In addition, the model can estimate the correlation between features which would be beneficial to combine common features during the training process.

Several models [7,40,120,121,122] have been proposed to capture side information in graphs such as attributes and labels. He et al. [42] used the locality-preserving projection technique, a nonlinear Laplacian Eigenmap, to preserve the local structural information in graphs. The model first constructs an adjacency matrix with k nearest neighbors for each pair of nodes. The model then computes the objective function as:

\begin{matrix} a^{*} = \underset{a}{arg min} a^{⊺} X L X^{⊺} a \end{matrix}

(9)

\begin{matrix} subject to : a^{⊺} X L X^{⊺} a = 1 \end{matrix}

(10)

where D is a diagonal matrix,

L = D - A

is the Laplacian matrix, and a is the transformation matrix in the linear embedding

x_{i} \to y_{i} = A^{⊺} x_{i}

. Nevertheless, the idea from [42] only captures the structure within k nearest neighbors, which fails to capture the global similarity between nodes in the graph. Motivated by these limitations, Cao et al. [15] introduced the GraRep model, which considers a k-hop neighborhood of each target node. Accordingly, GraRep could capture global structural information in graphs. The model works with k-order probability transition matrix (proximity matrix)

M^{k}

which could be defined as:

M^{k} = \underset{k}{\underset{⏟}{M \dots M}}

(11)

where

M = D^{- 1} A

, D is the degree matrix, A is the adjacent matrix, and

M_{i j}^{k}

presents the transition probability from node

v_{i}

to

v_{j}

. The loss function, thus, is the sum of k transition loss functions:

\begin{matrix} L_{k} (v_{i}) & = \sum_{v_{j} \in N (v_{i})} M_{i j}^{k} log σ (Z_{i}^{⊺} Z_{j}) - |N_{n e g}| \sum_{v_{m} \sim P_{n} (v)} M_{i m}^{k} log σ (- Z_{i}^{⊺} Z_{m}) . \end{matrix}

(12)

To construct the vector embeddings, GraRep decomposed the transition matrix into small-sized matrices using SVD matrix factorization. Similarly, Li [123] introduced NECS (Learning network embedding with the community) to capture the high-order proximity using Equation (11).

Table 3. A summary of matrix factorization-based models for static graphs. C indicates the number of clusters in graphs,

N (Z_{i} | μ_{c}, Σ_{c})

refers to the multivariate Gaussian distribution for each cluster, L means the Laplacian matrix,

H \in R^{n \times k}

is the probability matrix that a node belongs to a cluster, U denotes the coefficient vector, and

W_{i j}

is the weight on

(v_{i}, v_{j})

.

Table 3. A summary of matrix factorization-based models for static graphs. C indicates the number of clusters in graphs,

N (Z_{i} | μ_{c}, Σ_{c})

refers to the multivariate Gaussian distribution for each cluster, L means the Laplacian matrix,

H \in R^{n \times k}

is the probability matrix that a node belongs to a cluster, U denotes the coefficient vector, and

W_{i j}

is the weight on

(v_{i}, v_{j})

.

Models	Graph Types	Tasks	Loss Function
SLE [39]	Static graphs	Node classification	$\sum_{v_{i} \in V} \sum_{v_{j} \in V} \|W_{i j}\| {∥Z_{i} - s_{i j} Z_{j}∥}_{2}^{2}$
[120]	Attributed graphs	Node classification	$\underset{W}{arg min} T r (U^{⊺} X^{⊺} L X U) + α_{} \sum_{\begin{matrix} v_{i} \in V \\ v_{j} \in N (v_{i}) \end{matrix}} {∥Z_{i} - Z_{j}∥}_{2}^{2} + α_{1} L_{1} + α_{2} L_{2}$
[7]	Attributed graphs	Community detection	$- \sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i}^{⊺} Z_{j}) - \sum_{\begin{matrix} v_{i} \in V \\ v_{j} \in N (v_{i}) \end{matrix}} log σ (Z_{i}^{⊺} Z_{j})$ $- \|N_{n e g}\| \sum_{v_{k} \sim P_{n} (v)} log σ (- Z_{i}^{⊺} Z_{k}) - \sum_{v_{i} \in V, c = 1}^{C} N (Z_{i} \| μ_{c}, Σ_{c})$
LPP [42]	Attributed graphs	Node classification	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[121]	Attributed graphs	Graph reconstruction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[40]	Static graphs	Node clustering	$\sum_{v_{i} \in V} \sum_{c = 1}^{C} ∥Z_{i} - μ_{c}∥$
GLEE [122]	Attributed graphs	Graph reconstruction, Link prediction	${∥L - \hat{L}∥}^{2}$
LPP [42]	Static graphs	Node classification	$\sum_{(v_{i}, v_{j}) \in E} {∥Z_{i} - Z_{j}∥}_{2}^{2}$
Grarep [15]	Static graphs	Node classification, Node clustering	$- \sum_{v_{i} \in V} \sum_{v_{j} \in N (v_{i})} A_{i j}^{l} log σ (Z_{i}^{⊺} Z_{j})$ $- \| N_{n e g} \| \sum_{v_{k} ~ P_{n} (v)} A_{i k}^{l} log σ (\sum Z_{i}^{⊺} Z_{k})$
NECS [123]	Static graphs	Graph reconstruction, Link prediction, Node classification	${∥M - \hat{M}∥}_{F}^{2} + α_{1} {∥H - \hat{H}∥}_{F}^{2} + α_{2} {∥H^{⊺} \hat{H} - I∥}_{F}^{2}$
HOPE [5]	Static graphs	Graph reconstruction Link prediction, Node classification	${∥M - Z \cdot Z^{⊺}∥}_{F}^{2}$
[124]	Static graphs	Link prediction	$\sum_{(v_{i}, v_{j}) \in S} A_{i j} {∥Z_{i} - Z_{j}∥}_{2}^{2}$
AROPE [86]	Static graphs	Graph reconstruction, Link prediction, Node classification	${∥M - Z \cdot Z^{⊺}∥}_{F}^{2}$
ProNE [43]	Static graphs	Node classification	$- \sum_{v_{i} \in V} [\sum_{v_{j} \in N (v_{i})} log σ (Z_{i}^{⊺} Z_{j}) + \sum_{v_{k} \sim P_{n} (v)} log σ (- Z_{i}^{⊺} Z_{k})]$
ATP [6]	Static graphs	Link prediction	${∥M - Z \cdot Z^{⊺}∥}_{F}^{2}$
[125]	Static graphs	Graph partition	${\sum_{\begin{matrix} (v_{i}, v_{j}) \in E \\ v_{i}, v_{j} \in V_{k} \end{matrix}} (W_{i j} - 〈Z_{i}^{(k)}, Z_{j}^{(k)}〉)}^{2} + \sum_{v_{i} \in V_{k}} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2}$
NRL-MF [126]	Static graphs	Node classification	$\sum_{\begin{matrix} v_{i} \in V \\ v_{j} \in N (v_{i}) \end{matrix}} {∥Z_{i} - Z_{j}∥}_{2}^{2}$

In terms of considering the node proximity based on neighbor relations, Ou et al. [5] presented HOPE, an approach for preserving structural information in graphs using k-order proximity. In contrast to GraRep, HOPE tried to solve the asymmetric transitivity problem in directed graphs by approximating high-order proximity. The objective function needs to be minimized for the approximation proximity could be defined as:

Z^{*} = \underset{Z}{arg min} {∥M_{i j} - Z_{i}^{⊺} Z_{j}∥}_{2}^{2}

(13)

where M is the high-order proximity matrix, for instance,

M_{i j}

presents the proximity of two nodes

v_{i}

and

v_{j}

,

Z_{i}

and

Z_{j}

denote vector embeddings of

v_{i}

and

v_{j}

, respectively. The proximity matrix M can be measured by decomposing into two small-sized matrices

(M = M_{g}^{- 1} \cdot M_{l})

. Several common criteria could measure the node proximity, such as Katz Index [127], Rooted PageRank [128], Adamic-Adar [129], and Common Neighbors. Coskun and Mustafa [124] suggested changes in the proximity measure formulas of the HOPE model. For nodes that have a small degree, singular values could be zero after measuring the node proximity. Therefore, to solve this problem, they added a parameter

σ

to regularize the Laplacian graph.

A few models have been proposed with the same idea as HOPE and GraRep [43,86]. For example, ProNE model [43] aimed to use k number of the Chebyshev expansion to avoid Eigen decomposition, instead of using k-order proximity in HOPE models. Sun et al. [6] introduced a similar approach for preserving asymmetric transitivity with high-order proximity. However, the significant difference is that they proposed a strategy to break directed acyclic graphs while preserving the graph structure. The non-negative matrix factorization could then be applied to produce an embedding matrix. Several models [125,130,131] mainly focused on the pointwise mutual information (PMI) of nodes in graphs which calculates the connection between nodes in terms of linear and nonlinear independence. Equation (5) is used to learn node embeddings.

Several models aimed to reduce computational complexity from matrix factorization by improving the sampling strategies [126,132,133]. For instance, the key idea of the NRL-MF model [126] was to deal with a hashing function for computing dot products. Each node is presented as a binarized vector by a hashing function, which can be calculated faster by XOR operators. The model could learn the binary and quantized codes based on matrix factorization and preserve high-order proximity. Jiezhong [133] targeted sparse matrix factorization. They implemented random-walk sampling on graphs to construct a NetMF Matrix Sparsifier. RNP model [132] explored in-depth vector embeddings based on personalized PageRank values, then approximated the PPR matrices.

Table 4. A summary of matrix factorization-based models for heterogeneous graphs and dynamic graphs.

H \in R^{n \times k}

is the probability matrix that a node belongs to a cluster,

E^{(t)}

is the edge matrix with type t,

W_{i j}

is the weight on

(v_{i}, v_{j})

, r denotes the relation type, and

E^{(1, 2)}

is the set of edges in two component graphs

G_{1}

and

G_{2}

.

Table 4. A summary of matrix factorization-based models for heterogeneous graphs and dynamic graphs.

H \in R^{n \times k}

is the probability matrix that a node belongs to a cluster,

E^{(t)}

is the edge matrix with type t,

W_{i j}

is the weight on

(v_{i}, v_{j})

, r denotes the relation type, and

E^{(1, 2)}

is the set of edges in two component graphs

G_{1}

and

G_{2}

.

Models	Graph Types	Tasks	Loss Function
DBMM [134]	Dynamic graphs	Node classification, Node clustering	${∥A - \hat{A}∥}_{F}^{2}$
[135]	Dynamic graphs	Link prediction	${∥A - \hat{A}∥}_{2}^{2} + α_{} L_{2}$
[136]	Dynamic graphs	Link prediction	${∥A - \hat{A}∥}_{F}^{2}$
LIST [137]	Dynamic graphs	Link prediction	${∥A - \hat{A}∥}_{F}^{2} + L_{2}$
TADW [131]	Attributed graphs	Node classification	${∥M - W^{⊺} H X∥}_{F}^{2} + α ({∥W∥}_{F}^{2} + {∥H∥}_{F}^{2})$
PME [138]	Heterogeneous graphs	Link prediction	$\sum_{\begin{matrix} (v_{i}, v_{j}) \in E^{(r)} \\ (v_{i}, v_{k}) \notin E^{(r)} \end{matrix}} ({∥Z_{i} - Z_{j}∥}_{2}^{2} - {∥Z_{i} - Z_{n_{k}}∥}_{2}^{2} + m)$
EOE [139]	Heterogeneous graphs	Node classification	$\sum_{(v_{i}, v_{j}) \in E^{(1, 2)}} σ (Z_{i}^{⊺} Z_{j}) - \sum_{(v_{l}, v_{k}) \notin E^{(1, 2)}} σ (- Z_{i}^{⊺} Z_{j}) + α_{1} L_{1} + α_{2} L_{2}$
[130]	Heterogeneous graphs	Link prediction	$\sum_{(v_{i}, v_{j}) \in E^{(t)}} {∥Z_{i}^{(t)} - {\hat{Z}}_{i}^{(t)}∥}_{F}^{2} + α_{1} L_{1} + α_{2} L_{2}$
ASPEM [140]	Heterogeneous graphs	Node classification, Link prediction	$- \sum_{\begin{matrix} v_{i} \in V \\ (v_{i}, v_{j}, r) \in E \end{matrix}} log (p (v_{i} \| v_{j}, r))$
MELL [141]	Heterogeneous graphs	Link prediction	$- \sum_{(v_{i}, v_{j}) \in E} σ (Z_{i}^{⊺} Z_{j}) - \sum_{(v_{i}, v_{k}) \notin E} σ (1 - Z_{i}^{⊺} Z_{k}) + α L_{2}$
PLE [142]	Attributed graphs	Node classification	$\sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i} Z_{j}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k}))$

In the real world, several graphs often contain attributes for nodes and edges, such as user profiles on a social network. These attributes provide helpful information to improve the node representation and help to learn node embedding. Yang et al. [131] proposed the TADW model by representing the DeepWalk model as a matrix factorization and integrating text features into the factorization model. Ren et al. [142] introduced the PLE model to learn jointly different types of nodes and edges with text attributes. Since existing models often ignore the noise of labels, PLE is the first work to investigate the noisy type labels by measuring the similarity between entities and type labels.

Beyond static and homogeneous graphs, several models have been proposed to learn embeddings in dynamic and heterogeneous graphs. The embedding models for dynamic graphs are essentially the same as for static graphs, including Laplacian eigenmaps methods and node proximity matrix factorization to model relations in dynamic graphs over time. For Laplacian eigenmaps methods, Li et al. [81] presented DANE (Dynamic Attributed Network Embedding) model to learn node embeddings in dynamic graphs. The main idea of the DANE model is to represent a Laplacian matrix as

L_{A}^{(t)} = D_{A}^{(t)} - A^{(t)}

, where

A^{(t)} \in R^{n \times n}

is the adjacency matrix of dynamic graphs at time t,

D_{A}

is the diagonal matrix, then the model could be able to learn node embeddings by time in an online manner. To preserve the node proximity, the DANE model aimed to minimize the loss function:

\begin{matrix} L (v_{i}, v_{j}) = \sum_{\begin{matrix} (v_{i}, v_{j}) \\ i \neq j \end{matrix}} A_{i j}^{(t)} {∥Z_{i} - Z_{j}∥}_{2}^{2} . \end{matrix}

(14)

The eigenvectors

λ

of the Laplacian matrix L can be calculated by solving the generalized eigenproblem:

L_{A}^{(t)} a = λ D_{A}^{(t)}

, where

a = 〈 a_{0} a_{1} \dots a_{N} 〉

is the eigenvectors.

Several models applied node proximity matrix factorization directly to dynamic graphs by updating the proximity matrix between entities in the dynamic graphs. Rossi et al. [134] presented dynamic graphs as a set of static graph snapshots:

G = {G (t_{0}) G (t_{1}) \dots G (t_{N})}

. The model then learned a transition proximity matrix T, which describes all transitions from the dynamic graphs. For evaluation, they predict the graph G at time

t + 1

:

{\hat{G}}_{t + 1} = G_{t} T_{t + 1}

, then estimate the error using Frobenius loss:

{∥{\hat{G}}_{t + 1} - G_{t + 1}∥}_{F}

. Zhu et al. [135,137] aimed to preserve the graph structure based on temporal matrix factorization during the network evolution. Given an adjacency matrix

A (t)

at time t, two temporal rank-k matrix factorization U and

V (t)

are factorized as

A (t) = f (U V {(t)}^{⊺})

, and the objective is to minimize the loss function

L (A)

which could be defined as:

\begin{matrix} L (A) = \sum_{t = 1}^{T} \frac{D (t)}{2} {∥A (t) - \hat{A} (t)∥}_{F}^{2} . \end{matrix}

(15)

Matrix factorization models have been successfully applied to graph embedding, mainly for the node embedding problem. Most models are based on singular value decomposition to find eigenvectors in the latent space. There are several advantages of matrix factorization-based models:

Training data requirement: The matrix factorization-based models do not need much data to learn embeddings. Compared to other methods, such as neural network-based models, these models bring advantages in case there is little training data.
Coverage: Since the graphs are presented as Laplacian matrix L, or transition matrix M, then the models could capture all the proximity of the nodes in the graphs. The connection of all the pairs of nodes is observed at least once time under the matrix that makes the models could be able to handle sparsity graphs.

Although matrix factorization is widely used in graph embedding problems, it still has several limitations:

Computational complexity: The matrix factorization suffers from time complexity and memory complexity for large graphs with millions of nodes. The main reason is the time it takes to decompose the matrix into a product of small-sized matrices [15].
Missing values: Models based on matrix factorization cannot handle incomplete graphs with unseen and missing values [143,144]. When the graph data are insufficient, the matrix factorization-based models could not learn generalized vector embeddings. Therefore, we need neural network models that can generalize graphs and better predict entities in graphs.

3.3. Shallow Models

This section focuses on shallow models for mapping graph entities into vector space. These models mainly aim to map nodes, edges, and subgraphs as low-dimensional vectors while preserving the graph structure and entity proximity. Typically, the models first implement a sampling technique to capture graph structure and proximity relation and then learn embeddings based on shallow neural network algorithms. Several sampling strategies could be taken to capture the local and global information in graphs [14,145,146]. Based on the sampling strategy, we divide shallow models into two main groups: structure preservation and proximity reconstruction.

Structure preservation: The primary concept of these approaches is to define sampling strategies that could capture the graph structure within fixed-length samples. Several sampling techniques could capture both local and global graph structures, such as random-walk sampling, role-based sampling, and edge reconstruction. The model then applies shallow neural network algorithms to learn vector embeddings in the latent space in an unsupervised learning manner. Figure 6a shows an example of a random-walk-based sampling technique in a graph from a source node $v_{s}$ to a target node $v_{t}$ .
Proximity reconstruction: It refers to preserving a k-hop relationship between nodes in graphs. The relation between neighboring nodes in the k-hop distance should be preserved in the latent space. For instance, Figure 6b presents a 3-hop proximity from the source node $v_{s}$ .

In general, shallow models have achieved many successes in the past decade [4,14,21]. However, there are several disadvantages of shallow models:

Unseen nodes: When there is a new node in graphs, the shallow models cannot learn embeddings for new nodes. To obtain embedding for new nodes, the models must update new patterns, for example, re-execute random-walk sampling to generate new paths for new nodes, and then the models must be re-trained to learn embeddings. The re-sampling and re-training procedures make it impractical to apply them in practice.
Node features: Shallow models such as DeepWalk and Node2Vec mainly work suitably on homogeneous graphs and ignore information about the attributes/labels of nodes. However, in the real world, many graphs have attributes and labels that could be informative for graph representation learning. Only a few studies have investigated the attributes and labels of nodes, and edges. However, the limitations of domain knowledge when working with heterogeneous and dynamic graphs have made the model inefficient and increased the computational complexity.
Parameter sharing: One of the problems of shallow models is that these models cannot share the parameters during the training process. From the statistical perspective, parameter sharing could reduce the computational time and the number of weight updates during the training process.

3.3.1. Structure-Preservation Models

Choosing a strategy to capture the graph structure is essential for shallow models to learn vector embeddings. The graph structure can be sampled through connections between nodes in graphs or substructures (e.g., subgraphs, motifs, graphlets, roles, etc.). Table 5 briefly summarizes structure-preservation models for static and homogeneous graphs.

Over the last decade, various models have been proposed to capture the graph structure and learn embeddings [4,21,147,148]. Among those models, random-walk-based strategies could be considered one of the most typical strategies to sample the graph structures [4,14]. The main idea of the random-walk strategy is to gather information about the graph structure to generate paths that can be treated as sentences in documents. The definition of random walks could be defined as:

Definition 6

(Random walk [14]). Given a graph

G = (V, E)

, where V is the set of nodes and E is the set of edges, a random walk with length l is a process starting at a node

v_{i} \in V

and moving to its neighbors for each time step. The next steps are repeated until the length l is reached.

Two models, DeepWalk [14] and Node2Vec [4] could be considered to be pioneer models to open a new direction for learning node embeddings.

Inspired by the disadvantages of the matrix factorization-based models, the DeepWalk model could preserve the node neighborhoods based on random-walk sampling, which could capture global information in graphs. Moreover, both DeeWalk and Node2Vec aim to maximize the probability of observing node neighbors by stochastic gradient descent on each single-layer neural network. Therefore, these models reduce running time and computational complexity. DeepWalk [14] is a simple node embedding model using the random-walk sampling strategy to generate node sequences and treat them as word sentences. The objective of DeepWalk is to maximize the probability of the set of neighbor nodes

N (v_{i})

given a target node

v_{i}

. Formally, the optimization problem could be defined as:

ϕ^{*} (\cdot) = arg min_{ϕ (\cdot)} - log p (N (v_{i}) | ϕ (v_{i}))

(16)

where

v_{i}

denotes the target node,

N (v_{i})

is the set of neighbors of

v_{i}

which could be generated from random-walk sampling,

ϕ (v_{i})

is the mapping function

ϕ : v_{i} \in V \to R^{| V | \times d}

. The model uses two strategies for finding neighbors given a source node, based on the Breadth-First Search (BFS) and Depth First Search (DFS) strategies. The BFS strategy aims to represent a microscopic view that captures the local structure. In contrast, the DFS strategy delivers the global structure information in graphs. The DeepWalk then uses a skip-gram model and stochastic gradient descent (SGD) to learn latent representations.

Table 5. A summary of structure-preservation models for homogeneous and static graphs. K indicates the number of clusters in the graph, and

μ_{k}

refers to the mean value of cluster k.

Table 5. A summary of structure-preservation models for homogeneous and static graphs. K indicates the number of clusters in the graph, and

μ_{k}

refers to the mean value of cluster k.

Models	Graph Types	Tasks	Loss Function
DeepWalk [14]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
Node2Vec [4]	Static graphs	Node classification, Link prediction	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
WalkLets [147]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
Div2Vec [149]	Static graphs	Link prediction	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
Node2Vec+ [148]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
Struct2Vec [21]	Static graphs	Node classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
DiaRW [150]	Static graphs	Node classification, Link prediction	$\sum_{v_{i} \in V} - y_{i}^{⊺} log ({\hat{y}}_{i}) - (1 - y_{i}) log (1 - {\hat{y}}_{i})$
Role2Vec [151]	Attributed graphs	Link prediction	$\sum_{k = 1}^{K} \sum_{v_{i} \in V_{k}} {∥Z_{i} - μ_{k}∥}_{2}^{2}$
NERD [152]	Directed graphs	Link Prediction, Graph Reconstruction, Node classification	$\sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i} Z_{j}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k}))$
Sub2Vec [153]	Static graphs	Community detection, graph classification	$\sum_{k = 1}^{K} \sum_{v_{i} \in V_{k}} {∥Z_{i} - μ_{k}∥}_{2}^{2}$
Subgraph2Vec [145]	Static graphs	Graph classification, Clustering	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
RUM [89]	Static graphs	Node classification, Graph reconstruction	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
Gat2Vec [154]	Attributed graphs	Node classification, Link prediction	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
ANRLBRW [155]	Attributed graphs	Node classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
Gl2Vec [88]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$

One of the limitations of DeepWalk is that the model can only capture the graph structure but fail to navigate the random-walk sampling to enrich the quality of the sampling graph structure. To overcome the limitations of DeepWalk, Grover and Leskovec introduced Node2Vec [4] with a flexible random-walk sampling strategy to navigate random walks via each time step. The key difference between DeepWalk and Node2Vec is that instead of using a truncated random walk, the model used a biased random-walk sampling process with two parameters (p and q) to adjust the random walk on graphs. Figure 7a presents two parameters p and q in Node2Vec model in detail. The model could capture more information on the graph structure locally and globally by introducing constraints when deciding the subsequent nodes visited.

Perozzi et al. [147] presented the WalkLets model, which was extended from the DeepWalk model. They modified the random-walk sampling strategy to capture more graph structure information by skipping and passing over multiple nodes at each time step. Therefore, these sampling strategies can capture more global graph structure by the power of the transition matrix when passing over multiple nodes. The main idea of the WalkLets model is to represent the random-walk paths as pairs of nodes in the multi-scale direction. Figure 7b depicts the sampling strategy of the WalkLets model using multi-scale random-walk paths. However, one of the limitations of the WalkLets is that the model could not distinguish local and global structures when passing and skipping over nodes in graphs. Jisu et al. [149] presented a variation of DeepWalk, named Div2Vec model. The main difference between the two models is the way that Div2Vec chooses the next node in the random-walk path, which will be visited based on the degree of neighboring nodes. The focus on the degree of neighboring nodes could help the models learn the importance of nodes that are popular in social networks. Therefore, at the current node

v_{i}

, the probability of choosing the next node

v_{j}

in a random-walk path is calculated as:

p (v_{j} | v_{i}) = \frac{f (deg (v_{j})}{\sum_{v_{i} \in N} f (deg (v_{i}))}

(17)

where

deg (v_{j})

depicts the degree of node

v_{j}

, and

f (deg (v_{j})) = \frac{1}{deg (v_{j})}

. Renming et al. [148] presented Node2Vec+, an improved version of Node2Vec. One limitation of the Node2Vec model is that it cannot determine the following nodes based on the target nodes. There is a significant difference between Node2Vec and Node2Vec+. The Node2Vec+ model can determine the state of the potential edge for a given node, therefore enhancing the navigability of the Node2Vec model to capture more graph structure. In particular, they introduced three neighboring edge states from a current node (out edge, noisy edge, and in edge) which are calculated to decide the next step. With potential out edges

(v_{i}, v_{j}) \in E

from previous node t, the in–out parameters p and q of Node2Vec model could then be re-defined as bias factor

α

as:

α_{p q} (t, v_{i}, v_{j}) = \{\begin{matrix} \frac{1}{p} & if t = x . \\ 1 & if w (v_{j}, t) \geq \tilde{d} (v_{j}) . \\ min {1, \frac{1}{p}} & if w (v_{j}, t) < \tilde{d} (v_{j}) and w (v_{j}, t) < \tilde{d} (v_{i}) . \\ \frac{1}{q} + (1 - \frac{1}{q}) \frac{w (v_{j}, t)}{d (x)} & if w (v_{j}, t) < \tilde{d} (v_{j}) and w (v_{j}, t) \geq \tilde{d} (v_{i}) . \end{matrix}

(18)

where

\tilde{d} (v_{i})

denotes a noisy edge threshold which could consider the next node state

v_{j}

from the current node t and could be viewed as the weights of edges,

w (v_{i}, v_{j})

is the weight of the edge between

v_{i}

and

v_{j}

.

In contrast to preserving graph topology which mainly focuses on distance relations, several models aimed to preserve the role and importance of nodes in graphs. In the case of social networks, for example, we could discover influencers with the ability to impact several activities of communities. In contrast to the random-walk-based technique, several studies [21,150] used the term “role-based” to preserve the nodes’ role, which random-walk-based sampling strategies cannot capture in a fixed length. Therefore, by preserving the role of nodes, role-based models could capture the structural equivalent. Ribeiro et al. [21] introduced the Struc2Vec model to capture graph structure based on the nodes’ role. Nodes that have the same degree should be encoded close in the vector space. Given a graph G, they introduced k graphs, each graph can be considered in one layer. Each layer denotes a graph that describes the weighted node degree from different hop distances. Specifically, at layer

L_{k}

, for each node

v_{i} \in V

, there are three probabilities of going to node

v_{j}

in the same layer, jumping to the previous layer

L_{k - 1}

and next layer

L_{k + 1}

:

\begin{matrix} p_{k} (v_{i}^{k}, v_{j}^{k}) & = \frac{e^{- f_{k} (v_{i}^{k}, v_{j}^{k})}}{Z_{k} (v_{i}^{k})} \end{matrix}

(19)

\begin{matrix} p_{k} (v_{i}^{k}, v_{i}^{k + 1}) & = \frac{w (v_{i}^{k}, v_{i}^{k + 1})}{w (v_{i}^{k}, v_{i}^{k + 1}) + w (v_{i}^{k}, v_{i}^{k - 1})} \end{matrix}

(20)

\begin{matrix} p_{k} (v_{i}^{k}, v_{i}^{k - 1}) & = 1 - p_{k} (v_{i}^{k}, v_{i}^{k + 1}) \end{matrix}

(21)

where

f_{k} (v_{i} v_{j})

presents the role-based distance between nodes

v_{i}

and

v_{j}

, and

w (\cdot)

denotes the edge weight. Zhang et al. [150] presented the DiaRW model, which uses a random-walk strategy based on the node degree. The difference between other role-based models and the DiaRW model is that they used random walks that can vary in length based on the node degree. One of the limitations of the Struc2Vec model is that the model could not preserve the similarity of nodes in graphs. Motivated by this limitation, the DiaRW model aims to capture structural identity based on node degree and the neighborhood in which nodes have a high degree. The purpose of this model is to collect structural information around higher-order nodes, which is a limitation of models based on fixed-length random walks. Ahmed et al. [151] introduced the Role2Vec model that could capture the node’s similarity and structure by introducing a node-type parameter to guide random-walk paths. The core idea of Role2Vec is that nodes in the same cluster should be sampled together in the random-walk path. By only sampling nodes in the same clusters, Role2Vec could learn correct patterns with reduced computational complexity. The model then uses the skip-gram model to learn node embeddings. Unlike Rol2Vec, the NERD model [152] considers nodes’ asymmetric roles for directed graphs. The model sampled the neighbor’s nodes using an alternative random walk. The probability of the next node

v_{i + 1}

from the current node

v_{i}

in the random-walk path could be defined as:

p (v_{i + 1} | v_{i}) = \{\begin{matrix} \frac{1}{d^{o u t} (v_{i})} \cdot w (v_{i}, v_{j}) & if (v_{i}, v_{j}) \in E . \\ \frac{1}{d^{i n} (v_{i})} \cdot w (v_{i}, v_{j}) & if (v_{j}, v_{i}) \notin E . \\ 0 & otherwise . \end{matrix}

(22)

where

w (v_{i}, v_{j})

is the weight of the edge

e_{i j}

d^{i n} (v_{i})

and

d^{o u t} (v_{i})

present the total in-degree and out-degree of the node

v_{i}

, respectively.

In some types of graphs, nodes in the same subgraphs tend to have similar labels. Studying low-level node representation could not bring significant generalization. Instead of embedding individual nodes in graphs, several studies aimed to learn subgraph similarity or the whole graphs. Inspired by representations of sentences and documents in the NLP (natural language processing) area, Bijaya et al. [153] proposed the Sub2Vec model to embed each subgraph into a vector embedding.

To learn a subgraph embedding

S = {G_{1} G_{2} \dots G_{n}}

from an original graph G, two properties should be preserved: similarity and structural property. The former ensures the connection between subgraph nodes by collecting sets of paths in a subgraph. The latter ensures that each node in a subgraph should be densely connected to all other nodes in the same subgraph. Figure 8 presents two subgraph properties that could capture each subgraph connection and structure.

In contrast to Sub2Vec, Subgraph2Vec [145] aimed to learn rooted subgraph embeddings for detecting Android malware. One of the advantages of this model with the Sub2Vec model is that Subgraph2Vec could consider different degrees of rooted subgraphs surrounding the target subgraph while Sub2Vec tried to detect the community. Annamalai et al. [156] targeted embedding the entire graph into the latent space. With the same idea as the Subgraph2Vec model, they extracted the set of subgraphs from the original graph using the WL relabeling strategy. However, the difference is that they used the Doc2Vec model by treating documents as graphs to learn graph embeddings.

Most models mentioned above aim to capture the graph structure based on low-level node representation, which could fail to represent the higher-level structure. Therefore, finding the community structure can be difficult for models based on random-walk sampling strategies. Motif-based models are one of the strategies to preserve the local structure and discover the global structure of graphs. Yanlei et al. [89] proposed the RUM (network Representation learning Using Motifs) model to learn small groups of nodes in graphs. The main idea of RUM was to build a new graph

G^{'} = (V^{'}, E^{'})

based on the original graph by constructing new nodes and edges as follows:

Generating nodes in graph $G^{'}$ : Each new node v in graph $G^{'}$ is a tuple $v_{i j k} = 〈 v_{i}, v_{j}, v_{k} 〉$ in the original graph G. Therefore, they can map the triangle patterns of the original graph to the new graph for structure preservation.
Generating edges of graph $G^{'}$ : Each edge of the new graph is formed from two nodes that have two edges in common in the original graph. For example, the edge $e = (v_{i j k}, v_{i j l})$ denotes that we the edge $(v_{i}, v_{j}) \in E$ in the original graph G.

The model then used the skip-gram model to learn the node and motif embeddings. Figure 9b depicts the details of the random-walk sampling strategy based on motifs.

There are also several models based on motifs for heterogeneous graphs [87,90,91]. For instance, Qian et al. [90] proposed the MBRep model (Motif-based representation) with the same idea from the RUM model to generate a hyper-network based on a triangle motif. However, the critical difference is that the MBRep model could extract motifs based on various node and edge types in heterogeneous graphs.

Most of the above models aim to learn node embeddings without side information, which could be informative for learning graph structure. However, graphs in the real world could be composed of side information, such as attributes of nodes and edges. Several models tried to learn node embeddings in attributed graphs by adding node properties presented as attributed graphs. Nasrullah et al. [154] proposed the Gat2Vec model to capture the contextual attributes of nodes. Given by a graph

G = (V, E, X)

where X is the attribute function

X : V \to 2^{X}

, they generated a structural graph

G_{s}

and a bipartite graph

G_{a}

as:

\begin{matrix} G_{s} = (V_{s} E) \end{matrix}

(23)

\begin{matrix} G_{a} = (V_{a} X E_{a}) \end{matrix}

(24)

where

V_{s} \subseteq V

,

V_{a} = {v_{i} : X (v_{i}) \neq ⌀}

,

V_{a} \subseteq V

, and

E_{a} = {(v_{i}, a), a : X (v_{i})}

. They then used the random-walk sampling strategy to capture the graph structure in both types of graphs. Similar to Gat2Vec, Wei et al. [155] introduced the ANRLBRW model (Attributed Network Representation Learning Based on Biased Random Walk) with the idea of splitting the original graph G into a geological graph and attributed graph. However, there is a slight difference between the two models. ANRLBRW model used a biased random-walk sampling inspired by Node2Vec, which includes two parameters p and q in the sampling strategy. Kun et al. [88] introduced the Gl2Vec model to learn node embeddings based on graphlets. To generate the feature representation for graphs, they capture the proportion of graphlet occurrences in a graph compared with random graphs.

For social networks, the connections of nodes are far more complex than the node-to-node edge relationship, which constructs hypergraphs. In contrast to homogeneous graphs, edges in hypergraphs could connect more than two nodes in graphs which leads to difficult learning node embeddings. Several models have been proposed to learn node and edge embeddings in the hypergraphs [157,158]. For example, Yang et al. [157] proposed the LBSN2Vec (Location Based Social Networks) model, a hypergraph embedding model to learn hyperedges including both user-user connection and user-check-in locations over time. Since most existing models fail to capture mobility features and co-location rates dynamically, the model could learn the impact of user mobility in social networks for prediction tasks. The objective of this model is to use a random-walk-based sampling strategy on hyperedges with a sequence length to capture the hypergraph structure. They then use cousin similarity to preserve nodes’ proximity in the random-walk sequences. Table 6 lists a summary of representative models for heterogeneous graphs.

Several types of graphs in the real world are heterogeneous, with different node and edge types. Most of the above models fail to capture heterogeneous graphs. Several models have been proposed to capture the heterogeneous graph structure [159,164,166]. Dong et al. [20] introduced the Metapath2Vec model, the idea based on random walks to learn node embeddings in heterogeneous graphs. One of the powers of meta-path is that it can capture the relationship between various types of nodes and edges in heterogeneous graphs. To capture the structure of heterogeneous graphs with different types of nodes and edges, they presented meta-path random walks P with length l:

P : v_{1} \overset{t_{1}}{\to} v_{2} \overset{t_{2}}{\to} \dots \overset{t_{k - 1}}{\to} v_{k} \overset{t_{k}}{\to} \dots \overset{t_{l - 1}}{\to} v_{l}

(25)

where

t_{i}

presents the relation type between nodes

v_{i}

and

v_{i + 1}

. Therefore, the transition probability of node

v_{i + 1}

given by node

v_{i}

in the meta-path P could be defined as:

p (v_{i + 1} | v_{i}^{t}, P) = \{\begin{matrix} \frac{1}{| N_{t + 1} (v_{i}^{t}) |} & if (v_{i + 1}, v_{i}^{t}) \in E E^{(t)} (v_{i + 1}) = t + 1 . \\ 0, & otherwise . \end{matrix},

(26)

where

N_{t + 1} (v_{i}^{t})

is the number of the neighbors of node

v_{i}

with node type

t + 1

. Then, similar to DeepWalk and Node2Vec models, they used the skip-gram model to learn node embeddings. The approach from JUST [159] was conceptually similar to Metapath2Vec but the sampling strategy is performed differently. The model introduced a biased random-walk strategy with two parameters (jumping and staying) which aims to change the current domain or stay in the same domain for the next step.

Since the vanilla meta-path sampling strategy fails to capture different types of graphs, such as multiplex graphs and sparse graphs, several sampling strategies have been proposed for heterogeneous graphs based on meta-path strategies. The work of Zhang et al. [160] was similar to Metapath2Vec which implements random-walk sampling of all node types in the multiplex network. Lee et al. [161] introduced a BHIN2vec model which uses the random-walk strategy to capture sparse and rare patterns in heterogeneous graphs. Some models [162,163,164,165] have been applied to biological areas based on random-walk strategies. Lee et al. [166] used the WL relabeling strategy to capture temporal substructures of graphs. The model targeted the proximity of substructures in graphs instead of node proximity to learn the bibliographic entities in heterogeneous graphs. There are several models [167,168,176,177] that aim to capture entities from multiple networks. Du and Tong et al. [167] presented the MrMine model (Multi-resolution Multi-network) to learn embeddings with multi-resolutions. They first used WL label transformation to label nodes by the degree sequences, then adopted a dynamic time wrapping measure [21] to calculate the distance of each sequence to generate a relation network. The truncated random-walk sampling strategy is adopted to capture the graph structure. In contrast to the MrMine model, Lee and colleagues [168,176,177] explored in-depth multi-layered structure to represent the relation and proximity of individual characters, substructures, and the story network as a whole. To embed the substructure and story network, they first used WL relabeling [33] to extract substructures in the story network and then used Subgraph2Vec and Doc2Vec models to learn node embeddings.

Several types of graphs in the real world, however, show dynamic behaviors. Since most graph embedding models aim to learn node embeddings in static graphs, several models have been applied to learn node embeddings in dynamic graphs [10,92,173,174,175]. Most of them were based on the idea of DeepWalk and Node2Vec to capture the graph structure. By representing dynamic graphs as a set of static graphs, some models captured changes in the dynamic graph structure and updated changes in random walks over time. Then, the skip-gram model is used to learn node embeddings. For instance, the key idea of Sajjad et al. [169] is to generate random-walk paths on the first snapshot and then update random-walk paths in the corpus by time. Most existing models re-generate node embeddings for each graph snapshot to capture the dynamic behaviors. By contrast, the model introduced a set of dynamic random walks, which are frequently updated when there are any changes in dynamic graphs. This could reduce the computational complexity when the model handles large graphs. Figure 10 shows an example of how random-walk paths are updated in dynamic graphs.

Since the evolution of graphs only takes place at every few nodes and within a specific range of neighbors, updating the entire random walk is time-consuming. Several models [169,170,171,172,174,175] suggested updating dynamic steps over time for a few nodes and their local neighbors’ relationship. For example, Sedigheh et al. [174] presented the Dynnode2Vec model to capture the temporal evolution from graph

G_{t}

to

G_{t + 1}

by a set of new nodes and edges

(V_{n e w}, E_{n e w})

and a set of removed nodes and edges

(V_{d e l}, E_{d e l})

. Motivated by Node2Vec architecture, the Dynnode2Vec model could learn the dynamic structure by inducing an adequate group of random walks for only dynamic nodes. The random-walk strategy, therefore, could be more computational efficiency when the model handles large graphs. Furthermore, the proposed dynamic skip-gram model could learn node embeddings at time t by adopting the results of the previous time

t - 1

as initial weights. As a result, the dynamic skip-gram model could learn the dynamic behaviors over time.

Therefore, the changes in nodes at time

t + 1

could be described as:

Δ V_{t} = V_{a d d} \cup {v_{i} \in V_{t + 1} | \exists e = (v_{i}, v_{j}) \in (E_{a d d} \cup E_{d e l})} .

(27)

In summary, structure-preservation methods have succeeded in learning embeddings over the past decade. There are several key advantages of these models:

Computational complexity: Unlike kernel models and matrix factorization-based models, which require considerable computational costs, structure preservation models could learn embeddings with an efficient time. This effectiveness comes from search-based sampling strategies and the model generalizability from the training process.
Classification tasks: Since the models aim to find structural neighbor relationships from a target node, these show power in problems involving node classification. In almost all graphs, nodes that have the same label tend to be connected at a small, fixed-length distance. This is a strength of models based on preserving structure in problems related to classification tasks.

However, there are a few limitations that these models suffer when preserving the graph structure:

Transductive learning: Most models cannot learn node embeddings that have not been seen in the training data. To learn new node embeddings, the model should re-sample the graph structure and learn the new samples again which could be time-consuming.
Missing connection problem: Many graphs have sparse connections and missing connections between nodes in the real world. However, most structure-preservation models cannot handle missing connections between nodes since the sampling strategies could not be able to capture these connections. In the case of a random-walk-based sampling strategy, for example, these models only capture graph structure when nodes are linked together.
Parameter sharing: These models could only learn node embeddings for individual nodes and do not share parameters. The absence of sharing parameters could reduce the effectiveness of learning representation.

3.3.2. Proximity Reconstruction Models

The purpose of graph embedding models is not only to preserve the graph structure but also to preserve the proximity of nodes in graphs. Most proximity reconstruction-based models are used for link prediction or node recommendation tasks [178,179,180] due to the nature of the similarity strategies. In this part, we discuss various models attempting to preserve the proximity of entities in graphs. Table 7 describes a summary of representative proximity reconstruction-based graph embedding models.

One of the typical models is LINE [16], which aims to preserve the symmetric proximity of node pairs in graphs. The advantage of the LINE model is that it could learn the node similarity which most structure-preservation models cannot represent this structural information. The main goal of the LINE model is to preserve the 1st-order and 2nd-order proximity of node pairs in graphs. The 1st-order proximity can be defined as follows:

Definition 7

(1st-order proximity [16]). The 1st-order proximity describes the local pairwise similarity between two nodes in graphs. Let

w_{i j}

be the weight of an edge between two nodes

v_{i}

and

v_{j}

, and the 1st-order proximity is defined as

w_{i j}

when two nodes are connected and

w_{i j} = 0

when there is no link between them.

In the case of binary graphs,

w_{i j} = 1

if two nodes

v_{i}

and

v_{j}

are connected, and

w_{i j} = 0

otherwise. To preserve the 1st-order proximity, the objective function of two distribution

{\hat{p}}_{1} (v_{i}, v_{j})

and

p_{1} (v_{i}, v_{j})

should be minimized:

\begin{matrix} L_{1} (θ) = \underset{θ}{arg min} d ({\hat{p}}_{1} (v_{i}, v_{j}), p_{1} (v_{i}, v_{j}) | θ) \end{matrix}

(28)

\begin{matrix} {\hat{p}}_{1} (v_{i}, v_{j}) = \frac{w_{i j}}{\sum_{(v_{k}, v_{l}) \in E} Z_{k}^{⊺} Z_{l}^{}} p_{1} (v_{i}, v_{j}) = \frac{exp (Z_{i}^{⊺} Z_{j}^{})}{\sum_{(v_{k}, v_{l}) \in E} Z_{k}^{⊺} Z_{l}^{}} \end{matrix}

(29)

where

{\hat{p}}_{1} (v_{i}, v_{j})

and

p_{1} (v_{i}, v_{j})

depict the empirical probability, and the actual probability of the 1st-order proximity, respectively,

v_{i}

and

v_{j}

are two nodes in G,

Z_{i}

and

Z_{j}

are embedding vectors in latent space corresponding to

v_{i}

and

v_{j}

, respectively,

d (\cdot \cdot)

is the distance between the two distributions. The statistical distance, Kullback–Leibler divergence [181], is usually used to measure the difference between two distributions. In addition to preserving the proximity of two nodes that are connected directly, the LINE model also introduced 2nd-order proximity, which could be defined as follows:

Definition 8

(2nd-order proximity [16]). The 2nd-order proximity when

k = 2

captures the relationship of neighbors of each pair of nodes in the graph G. The idea of the 2nd-order proximity is that nodes should be closed if they share the same neighbors.

Let

Z_{i}

and

Z_{j}

are vector embeddings of nodes

v_{i}

and

v_{j}

, respectively, the probability of the specific context

v_{j}

given by the target node

v_{i}

could be defined as:

p_{2} (v_{j} | v_{i}) = \frac{exp (Z_{j}^{⊺} Z_{i})}{\sum_{v_{k} \in V} exp (Z_{k}^{⊺} Z_{i})} .

(30)

Therefore, the minimization of the objective function

L_{2}

could be defined as:

L_{2} (θ) = \underset{θ}{arg min} \sum_{v_{i} \in V} D_{K L} ({\hat{p}}_{2} (. | v_{i}; θ), p_{2} (. | v_{i}))

(31)

where

{\hat{p}}_{2} (v_{j} | v_{i}) = \frac{w_{i j}}{\sum_{k \in N (i)} w_{i k}}

is the observed distribution,

w_{i j}

is the weighted edge between

v_{i}

and

v_{j}

.

Table 7. A summary of proximity reconstruction models.

v_{i}^{(t)}

denotes the type t of node

v_{i}

,

w_{i j}

is the weight between node

v_{i}

and

v_{j}

, P is a meta-path in heterogeneous graphs,

N_{2}

is the 1st-order and 2nd-order proximity of a node

v_{i}

, and

P_{n} (v)

is the noise distribution for negative sampling.

Table 7. A summary of proximity reconstruction models.

v_{i}^{(t)}

denotes the type t of node

v_{i}

,

w_{i j}

is the weight between node

v_{i}

and

v_{j}

, P is a meta-path in heterogeneous graphs,

N_{2}

is the 1st-order and 2nd-order proximity of a node

v_{i}

, and

P_{n} (v)

is the noise distribution for negative sampling.

Models	Graph Types	Objective	Loss Function
LINE [16]	Static graphs	Node classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
APP [76]	Static graphs	Link prediction	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
PALE [77]	Static graphs	Link prediction	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$ $\sum_{〈v_{i}, v_{j}〉 \in E} {∥Z_{i} - Z_{j}∥}_{F}^{2}$
CVLP [182]	Attributed graphs	Link prediction	$- \sum_{\begin{matrix} (v_{i}, v_{j}) \in E \\ (v_{i}, v_{k}) \notin E \end{matrix}} log σ (Z_{i}^{⊺} Z_{j} - Z_{i}^{⊺} Z_{k}) + α_{1} {∥Z_{i} - Z_{j}∥}_{2}^{2} + α_{2} L_{1} + α_{3} L_{2}$
[183]	Static graphs	Link prediction	$- \sum_{〈v_{i}, v_{j}〉 \in E} w_{i j} log (p_{1} (v_{i} \| v_{j})) - \sum_{〈v_{i}, v_{j}〉 \in E} w_{i j} log (p_{2} (v_{j} \| v_{i}))$
HARP [178]	Static graphs	Node classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
PTE [179]	Heterogeneous graphs	Link prediction	$- \sum_{〈v_{i}^{(t)}, v_{j}^{(t)}〉 \in E^{(t)}} w_{i j}^{} log (v_{i}^{(t)} \| v_{j}^{(t)})$
Hin2Vec [180]	Heterogeneous graphs	Node classification, link prediction	$\sum_{v_{i} \in V} - y_{i}^{⊺} log ({\hat{y}}_{i}) - (1 - y_{i}) log (1 - {\hat{y}}_{i})$
[78]	Heterogeneous graphs	Node classification	$\sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i} Z_{j}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k}))$
[184]	Signed graphs	Link prediction	$\sum_{v_{i} \in V} - y_{i}^{⊺} log ({\hat{y}}_{i}) - (1 - y_{i}) log (1 - {\hat{y}}_{i})$
[185]	Heterogeneous graphs	Node classification, Node clustering	$- \sum_{(v_{i}, v_{j}) \in P} log (1 + e^{- Z_{i} Z_{j}}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} [log (1 + e^{- Z_{i} Z_{k}})]$
[186]	Heterogeneous graphs	Link prediction	$\sum_{(v_{i}, v_{j}) \in N_{2}} log σ (Z_{i} Z_{j}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k}))$
[187]	Static graphs	Node classification	$\sum_{(v_{i}, v_{j}) \in N_{2}} log σ (Z_{i} Z_{j}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k}))$
[188]	Heterogeneous graph	Graph reconstruction, link prediction, node classification	$\sum_{v_{i} \in V} {∥(Z_{i} - {\hat{Z}}_{i}) ⊙ B∥}_{2}^{2} + α L_{2}$
ProbWalk [189]	Static graphs	Node classification, link prediction	$\sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
[190]	Static graphs	Node classification, link prediction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} [y_{i} log {\hat{y}}_{i} + α (1 - y_{i}) log (1 - {\hat{y}}_{i})]$
NEWEE [191]	Static graphs	Node classification, link prediction	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
DANE [192]	Attributed graphs	Node classification, Link prediction	$\sum_{v_{i} \in V} {∥X_{i} - {\hat{X}}_{i}∥}_{2}^{2} + \sum_{v_{i} \in V} {∥M_{i} - {\hat{M}}_{i}∥}_{2}^{2} - \sum_{(v_{i}, v_{j}) \in E} log p_{i j} - \sum_{(v_{i}, v_{j}) \in E} [log p_{i j} - \sum_{(v_{i}, v_{j}) \notin E} log (1 - p_{i j})]$
CENE [193]	Attributed graphs	Node classification	$\sum_{v_{i} \in V} - y_{i}^{⊺} log ({\hat{y}}_{i}) - (1 - y_{i}) log (1 - {\hat{y}}_{i})$
HSCA [194]	Attributed graphs	Node classification	${∥M - W^{⊺} H X∥}_{F}^{2} + α ({∥W∥}_{F}^{2} + {∥H∥}_{F}^{2})$

However, the LINE model had several limitations as it only handles symmetric proximity pairs of nodes, and the proximity of node pairs was only considered up to 2nd-order proximity. To deal with directed graphs, Chang et al. [76] introduced the APP model, which could preserve the asymmetric proximity of node pairs. They introduced two roles for each node

v_{i} \in V

as the source role

s_{v_{i}}

and target role

t_{v_{i}}

. The probability of each pair of nodes that start from a source node to the target node could be defined as:

p (v_{i} | v_{j}) = \frac{exp (s_{v_{j}} \cdot t_{v_{i}})}{\sum_{v_{k} \in V} exp (s_{v_{j}} \cdot t_{v_{k}})} .

(32)

Tong et al. [77] presented the PALE (Predicting Anchor Links via Embedding) model to predict the anchor links in social networks. The idea of the PALE model was the same as that of the LINE model, but they sampled only 1st-order proximity. The loss function with the negative sampling could be defined as:

L (V) = - \sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i} Z_{j}) - |N_{n e g}| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k})) .

(33)

Wei et al. [182] presented the CVLP (Cross-View Link Prediction) model that could predict the connections of nodes in the context of missing and noisy attributes. Given by a triplet

(v_{i}, v_{j}, v_{k})

where

(v_{i}, v_{j}) \in E

and

(v_{i}, v_{k}) \notin E

, the probability of proximity preservation is defined as:

P (s_{i j} > s_{i k} | U^{g}) = σ (s_{i j} - s_{i k})

(34)

where

U^{g}

is the latent representation,

s_{i j}

is the inner product of the representation

s_{i j} = U_{i}^{g} {(U_{j}^{g})}^{⊺}

, and

σ (\cdot)

is the sigmoid function. Li et al. [183] performed a similar study to deeply learn follower-ship and followee-ship between users across different social networks. The main idea of this model is that the proximity between nodes in a social network should be preserved in another social network. For each node

v_{i}

in a graph, there are three vector representations (a node vector

Z_{i}

, an input context vector

Z_{i}^{(1)}

, and output context vector

Z_{i}^{(2)}

). In particular, if a node

v_{i}

is following a node

v_{j}

in a social network, then vector

Z_{i}

should contribute to the input context of

Z_{j}^{(1)}

, and vector

Z_{j}

should contribute to the output context of

Z_{i}^{(2)}

. Therefore, given a node

v_{i}

, the input and output context probability of node

v_{j}

could be defined as follows:

\begin{matrix} p_{i n p u t} (v_{j} | v_{i}) = \frac{exp (Z {_{j}^{(1)}}^{⊺} Z_{i})}{\sum_{k = 1}^{N} (Z {_{k}^{(1)}}^{⊺} Z_{i})} p_{o u t p u t} (v_{i} | v_{j}) = \frac{exp (Z {_{j}^{(2)}}^{⊺} Z_{j})}{\sum_{k = 1}^{N} (Z {_{k}^{(2)}}^{⊺} Z_{j})} . \end{matrix}

(35)

Haochen et al. [178] presented HARP (Hierarchical Representation) model with a meta-strategy to capture more global proximity of each pair node in graphs. The critical difference between HARP and LINE models is that they presented the original graph G as a series of graphs

G_{1}, G_{2}, \dots, G_{L}

where each graph can represent the collapse of adjacent edges and nodes. Figure 11 shows the way that two edges and nodes are collapsed in a graph. By representing L graphs after multiple collapses of edges and nodes, the graph can compress the proximity of nodes through supernodes.

Several variations and extensions of the LINE model are applied to heterogeneous and dynamic graphs. Jian et al. [179] presented the PTE model to preserve the 1st-order and 2nd-order proximity for heterogeneous graphs. By considering heterogeneous graphs as the set of bipartite graphs, they could independently construct the 1st-order and 2nd-order proximity for each homogeneous graph. Specifically, a bipartite graph G could be defined as

G = (V_{A} \cup V_{B}, E)

where

V_{A}

and

V_{B}

are the set of nodes with different types. The probability of a node

v_{i}

in the set

V_{A}

given by a node

v_{j}

in the set

V_{B}

could be defined as follows:

p (v_{i}, v_{j}) = \frac{exp (Z_{i}^{⊺} \cdot Z_{j})}{\sum_{v_{k} \in V_{A}} exp (Z_{k}^{⊺} \cdot Z_{j})} .

(36)

The PTE model decomposes heterogeneous graphs into k homogeneous graphs, and the loss function is the sum of the component loss functions, which could be formulated as:

\begin{matrix} L (V) = - \sum_{(v_{i}^{(t)}, v_{j}^{(t)}) \in E^{(t)}} w_{i j} log p (v_{i}^{(t)} | v_{j}^{(t)}), \end{matrix}

(37)

where K is the number of bipartite graphs extracted from the heterogeneous graphs. Similar to the PTE model, Tao-yang et al. [180] proposed the Hin2Vec model to capture the 2nd-order proximity in heterogeneous graphs. However, instead of treating heterogeneous graphs as sets of bipartite graphs, the Hin2Vec model captured the relationship between two nodes within 2-hop distance. For instance, in the DBLP network, the relationship set is

R = {P - P, P - A, A - P, P - P - P, P - P - A, P - A - P, A - P - P, A - P - A}

where P is the paper node type and A is the author node type. Zhipeng and Nikos [185] presented the HINE model (Heterogeneous Information Network Embedding) to preserve the truncated proximity of nodes. They defined an empirical joint probability of two entities in a graph as:

\hat{p} (v_{i}, v_{j}) = \frac{s (v_{i}, v_{j})}{\sum_{v_{k} \in V} s (v_{i}, v_{k})}

(38)

where

v_{i}

and

v_{j}

are nodes, and

s (v_{i}, v_{j})

depicts the proximity between

v_{i}

and

v_{j}

in G. The proximity score

s (v_{i}, v_{j})

could be measured by counting the number of instances of the meta-path containing two nodes or a probability gained from implementing a random-walk sampling from

v_{i}

to

v_{j}

.

Graphs in the real world, however, could contain attributes where several existing models, such as LINE and APP, fail to capture this information. Several models have been proposed to learn structural similarity in attributed graphs [193,195]. Sun et al. [193] proposed a CENE (content-enhanced network embedding) model to learn structural graphs and side information jointly. The objective of the CENE model is to preserve the similarity between node pairs and node-content pairs. Zhang et al. [194] proposed the HSCA (Homophily, Structure, and Content Augmented network) model to learn the homophily property of node sequences. To gain the node sequences, HSCA uses the DeepWalk model to capture the short random-walk sampling, which could represent the node context. The model then learns node embeddings based on matrix factorization by decomposing the probability transition matrix.

Most models mentioned above mainly consider the edge’s existence and ignore the dissimilarities between edges. Beyond preserving the topology and proximity of the aforementioned nodes, there are a variety of studies on edge reconstruction. The main idea of edge initialization-based models is that the edge weights can be transformed as transition probability. Wu et al. [189] introduced the ProbWalk model to learn weighted edges based on random-walk paths for edges and the skip-gram model to learn edge embeddings. The advantage of random walk on weighted edges is that this could help the model to generate more accurate node sequences and capture more useful structural information. To calculate the probability of weighted edges in graphs, they introduced a joint distribution:

p (v_{1}, v_{2} \dots v_{k} | v_{i}) = \prod_{v_{j} \in C} \frac{e^{Z_{j} \cdot Z_{i}}}{\sum_{m = 1}^{n} e^{Z_{m} \cdot Z_{i}}}

(39)

where

v_{i}

is the target node,

C = {v_{1}, v_{2}, \dots, v_{k}}

is the context of node

v_{i}

, and

Z_{i}

is vector embedding of node

v_{i}

.

Alternatively, several tasks need to preserve the proximity between different relationship types of nodes. Qi et al. [190,191] proposed the NEWEE model to learn edge embeddings and then adopted a biased random-walk sampling to capture the graph structure. To learn edge embeddings, they first look for a self-centered network of each graph node. In this situation, the model could explore the similarity between edges in the self-centered network since their score tends to be higher than those in the different self-centered networks. Given a node

v_{i}

in G, the self-centered network is a set of nodes containing

v_{i}

and its neighbors. For example, Figure 12 depicts two self-centered networks

C_{1}

and

C_{2}

of node

v_{1}

. The objective of the model is to make all edges embedded in the same self-centered network should be close in the vector space. Therefore, given a self-centered network

G^{'} = (V^{'}, E^{'})

, the objective function aims to maximize the proximity between edges in the same network, which could be defined as:

\begin{matrix} L (E) = - \sum_{v_{i} \in V^{'}}^{} \sum_{\begin{matrix} e_{i j} \in E^{'} \\ e_{i k} \notin E^{'} \end{matrix}} log [σ (e_{i j}^{⊺} Z_{i})] + log [1 - σ (e_{i j}^{⊺} Z_{k})] \end{matrix}

(40)

where

e_{i j}

denotes the edge between node

v_{i}

and

v_{j}

in a self-centered network

G^{'}

,

e_{i k}

denotes a negative edge that

v_{i}

and

v_{k}

coming from different self-centered network.

In summary, compared with structure-preservation models, the proximity construction models bring several advantages:

Inter-graph proximity: Proximity-based models not only explore proximity between nodes in a single graph but can also are applied for proximity reconstruction across different graphs with common nodes [183]. These methods can preserve the structural similarity of nodes in other graphs which are entirely different from other models. In the case of models based on structure-preservation strategies, these must re-learn node embeddings in other graphs.
Proximity of nodes belonging to different clusters: In the context of clusters with different densities and sizes, proximity reconstruction-based models could capture nodes that are close to each other but in different clusters. This feature shows an advantage over structure reconstruction-based models, which tend to favor searching for neighboring nodes in the same cluster.
Link prediction and node classification problem: Since structural identity is based on proximity between nodes, two nodes with similar neighborhoods should be close in the vector space. For instance, the LINE model considered preserving the 1st-order and 2nd-order proximity between two nodes. As a result, proximity reconstruction provides remarkable results for link prediction and node classification tasks [16,76,77].

However, besides the advantages of these models, there are also a few disadvantages of the proximity-based models:

Weighted edges problems: Most proximity-based models do not consider the weighted edges between nodes. These models consider proximity based only on the number of connections shared without weights which could lead to structural loss.
Capturing the whole graph structure: Proximity-based models mostly focus on 1st-order and 2nd-order proximity which cannot specify the global structure of graphs. A few models try to capture the higher-order proximity of nodes in graphs, but there is a problem with the computational complexity.

To overcome these limitations, shallow models should be replaced by models based on deep neural networks. Deep neural network-based models can better generalize and capture more of graph entity relationships and graph structure.

3.4. Deep Neural Network-Based Models

In recent years, large-scale graphs have challenged the ability of numerous graph embedding models. Traditional models, such as shallow neural networks or statistical methods, cannot efficiently capture complex graph structures due to their simple architecture. Recently, there have been various studies on deep graph neural networks, which are exploding rapidly because of their ability to work with complex and large graphs [11,14,23,196]. Based on the model architecture, we separate deep graph neural networks into four main groups: graph autoencoders, recurrent GNNs, convolutional GNNs, and graph transformer models. This section provides a detailed picture of deep neural network-based methods.

Unlike earlier models, most deep neural network-based models adopt the graph structure (represented as A) and node attributes/features (represented as X) to learn node embeddings. For instance, users in the social network could have text data, such as profile information. For nodes with missing attribute information, the attributes/features could be represented as node degree or one-hot vectors [72].

3.4.1. Graph Autoencoders

Graph autoencoder models are unsupervised learning algorithms that aim to encode graph entities into the latent space and reconstruct these entities from the encoded information. Based on the encoder and decoder architecture, we can classify graph autoencoder models into multilayer perceptron-based models and recurrent graph neural networks.

Early-stage graph autoencoder models are primarily based on multilayer perceptron (MLP) to learn embeddings [50,51,196]. Table 8 lists a summary of fully connected graph autoencoder models. Daixin et al. [50] introduced the SDNE model (Structural Deep Network Embedding) to capture the graph structure based on autoencoder architecture. Similar to the LINE model, the SDNE model aimed to preserve the 1st-order and 2nd-order proximity between two nodes in graphs, but it used the autoencoder-based architecture. Figure 13 presents the general architecture of the SDNE model with the corresponding encoder and decoder layers. The joint loss function that combines two loss functions for 1st-order proximity and 2nd-order proximity can be formulated as:

\begin{matrix} L (Z, X) & = \sum_{i, j = 1}^{n} {∥(\hat{X} - X) ⊙ B∥}_{F}^{2} + λ \sum_{i, j = 1}^{n} s_{i j} {∥Z_{i} - Z_{j}∥}_{2}^{2} + L_{2} \end{matrix}

(41)

where

s_{i j}

denotes the proximity between two nodes

v_{i}

and

v_{j}

. However, the SDNE model has been proposed to learn node embeddings in homogeneous graphs. Extension of the SDNE model to heterogeneous graphs was suggested by several graph autoencoder models [51,196]. Ke et al. [51] presented the DHNE model (Deep Hyper-Network Embedding) to preserve neighborhood structures, ensuring that the nodes with similar neighborhood structures will have similar embeddings. The autoencoder layer adopts an adjacency matrix A of a hypergraph as an input, which can be formulated as:

A = H H^{⊺} - D_{v}

(42)

where

D_{v}

is the diagonal matrix of node degree, and H is a matrix of size

| V | \times | E |

presents the relation between nodes and hyperedges in graphs. The autoencoder includes two main layers: an encoder layer and a decoder layer. The encoder part takes the adjacency matrix as input and compresses it to generate node embeddings, and then the decoder part tries to reconstruct the input. Formally, the output of the encoder and decoder layer of node

v_{i}

could be defined as follows:

\begin{matrix} Z_{i} = σ (W A_{i} + b) {\hat{A}}_{i} = σ (\hat{W} Z_{i} + \hat{b}) . \end{matrix}

(43)

One of the limitations of SDNE and DHNE models is that these models cannot handle signed graphs. Shen and Chung [197] proposed the DNE-SBP model (Deep Network Embedding with Structural Balance Preservation) to preserve the proximity of nodes in signed graphs. The DNE-SBP model constructed the input and output of the autoencoder which could be defined as:

\begin{matrix} H^{(l)} = σ (X^{(l)} {(W_{1}^{(l)})}^{T} + B_{1}^{(l)}) {\hat{X}}^{(l)} = σ ({\hat{H}}^{(l)} {(W_{2}^{(l)})}^{T}) + B_{2}^{(l)} \end{matrix}

(44)

where

X^{(1)} = A

,

X^{(l)} = H^{(l - 1)}

, and

σ

is an activation function. The joint loss function is then composed of reconstruction errors with ML and CL pairwise constraints [200].

For dynamic graphs, graph autoencoder models take snapshots of graphs as inputs, and the model tries to rebuild snapshots. In several models, the output can predict future graphs by reconstructing coming snapshot graphs. Inspired by the SDNE model for static graphs, Palash et al. [198] presented the DynGEM model for dynamic graph embedding. Figure 14 presents the overview architecture of the DynGEM model. Given a sequence of graph snapshots

G = {G_{1}, G_{2}, \dots, G_{T}}

and a sequence of a mapping function

ϕ = {ϕ_{1}, ϕ_{2}, \dots, ϕ_{T}}

, the DynGEM model aims to generate an embedding

Z_{t + 1} = ϕ_{t + 1} (G_{t + 1})

. The stability of embeddings is the ratio of the difference between embeddings over the difference between adjacency matrices over time which could be defined as:

A_{a b s} (ϕ; t) = ∥\frac{Z_{t + 1} (V_{t}) - Z_{t} (V_{t})}{A_{t + 1} (V_{t}) - A_{t} (V_{t})}∥

(45)

where

A_{t}

is the weighted adjacency matrix of graph

G_{t}

,

Z_{t} (V_{t})

presents embeddings of all nodes

V_{t}

at time t. The model learns parameter

θ

for each graph snapshot

G_{t}

at time t. Similar to the SDNE model, the loss function of the DynGEM model could be defined as:

L (Z, X) = {∥(\hat{A} - A) ⊙ B∥}_{F}^{2} + λ \sum_{i, j = 1}^{n} s_{i j} {∥Z_{i} - Z_{j}∥}_{2}^{2} + L_{1} + L_{2}

(46)

where

L_{1}

and

L_{2}

are regularization terms to prevent the over-fitting, and

s_{i j}

is the similarity between

v_{i}

and

v_{j}

. Similar to SDNE, Palash et al. [201] used autoencoder architecture and adopted the adjacency matrix of graph snapshots as input of the encoder layer. However, they updated parameters

θ_{t}

at time t based on parameter

θ_{t - 1}

from the previous graph

G_{t - 1}

.

Unlike the aforementioned models, Wenchao et al. [199] presented the NetWalk model that composes initial embeddings first and then updates the embeddings by learning paths in graphs, which are sampled by a reservoir sampling strategy. NetWalk model sampled the graph structure using a random-walk strategy as input to the autoencoder model. If there are any changes in dynamic graphs, the Netwalk model first updates the list of neighbors for each node and corresponding edges and then only learns embeddings again for the changes.

The aforementioned autoencoder models, which are based on feedforward neural networks, only focus on preserving pairs of nodes in graphs. Several models focus on integrating recurrent neural networks and LSTM into the autoencoder architecture, bringing prominent results, which we cover in the following section.

3.4.2. Recurrent Graph Neural Networks

One of the first models applying deep neural networks to graph representation learning was based on graph neural networks (GNNs). The main idea of GNNs is that it considers messages shared between target nodes and their neighbors until a steady balance is acquired. Table 9 summarizes graph recurrent autoencoder models.

Scarselli et al. [44,45] proposed a GNN model which could learn embeddings directly for different graphs, such as acyclic/cyclic and directed/undirected graphs. These models assumed that if nodes are directly connected in graphs, the distance between them should be minimized in the latent space. The GNN models used a data diffusion mechanism to aggregate signals from neighbor nodes (units) to target nodes. Therefore, the state of a node describes the context of its neighbors and can be used to learn embeddings. Mathematically, given a node

v_{i}

in a graph, the state of

v_{i}

and its output can be defined as:

\begin{matrix} H_{i} = \sum_{v_{j} \in N (v_{i})} f_{w} (y_{i}, e_{i j}, H_{j}, y_{j}), Z_{i} = g_{w} (H_{i}, y_{i}), \end{matrix}

(47)

where

f_{w} (\cdot \cdot \cdot \cdot)

and

g_{w} (\cdot \cdot)

are transition functions, and

y_{i}

,

e_{i j}

denote the label of node

v_{i}

, edge

(v_{i}, v_{j})

, respectively. By considering the state

H_{i}

that is revised by the shift process,

H_{i}

and its output at layer l could be defined as:

\begin{matrix} {H_{i}}^{(l)} = f_{w} (y_{i}, e_{i j}, {H_{j}}^{(l - 1)}, y_{j}), {Z_{i}}^{(l)} = g_{w} ({H_{i}}^{(l)}, y_{i}) . \end{matrix}

(48)

However, one of the limitations of GNNs is that the model learns node embeddings as single output, which could cause problems with sequence output. Several studies tried to improve GNNs using recurrent graph neural networks [17,48,49]. Unlike the GNNs which could represent a single output for each entity in a graph, Li et al. [17] attempted to output sequences by applying gated recurrent units. The model used two gated graph neural networks

F_{x}^{(l)}

and

F_{o}^{(l)}

to predict the output

O^{l}

and the following hidden states. Therefore, the output of node

v_{i}

at layer

l + 1

could be computed as:

\begin{matrix} H_{i}^{(l + 1)} = σ (H_{i}^{(l)}, \sum_{v_{j} \in N (v_{i})} W H_{j}^{(l)}), \end{matrix}

(49)

where

N (v_{i})

denotes the set of neighbors of node

v_{i}

.

Wang et al. [49] proposed Topo-LSTM model to capture the diffusion structure by representing graphs as a diffusion cascade to capture active and inactive nodes in graphs. Given by a cascade sequence

s = {(v_{1}, 1) \dots (v_{T}, T)}

, the hidden state can be represented as follows:

\begin{matrix} h_{t}^{^{'} (p)} = ϕ (h_{v} | v \in P_{v}), \end{matrix}

(50)

\begin{matrix} h_{t}^{^{'} (q)} = ϕ (h_{v} | v \in Q_{v} \ P_{v}), \end{matrix}

(51)

where p and q denote the input aggregation for active nodes connected with

v_{t}

and not connected with the node

v_{t}

, respectively,

P_{v}

depicts the precedent sets of active nodes at time t, and

Q_{v}

depicts the set of activated nodes before time t. Figure 15 presents an example of the Topo-LSTM model. However, these models could not capture global graph structure since they only capture the graph structure within k-hop distance. Several models have been proposed by combining graph recurrent neural network architecture with random-walk sampling structure to capture higher structural information [48,93]. Huang et al. [93] introduced the GraphRNA model to combine a joint random-walk strategy on attributed graphs with recurrent graph networks. One of the powers of the random-walk sampling strategy is to capture the global structure. By considering the node attributes as a bipartite network, the model could perform joint random walks on the bipartite matrix containing attributes to capture the global structure of graphs. After sampling the node attributes and graph structure through joint random walks, the model uses graph recurrent neural networks to learn embeddings. Similar to GraphRNA model, Zhang et al. [48] presented the SHNE model to analyze the attributes’ semantics and global structure in attributed graphs. The SHNE model also used a random-walk strategy to capture the global structure of graphs. However, the main difference between SHNE and GraphRNA is that the SHNE model first applied GRU (gated recurrent units) model to learn the attributes and then combined them with graph structure via random-walk sampling.

Since the power of autoencoders architecture is to learn compressed representations, several studies [57,205] aimed to combine RGNNs and autoencoders with learning node embeddings in weighted graphs. For instance, Seo and Lee [57] adopted an LSTM autoencoder to learn node embeddings for weighted graphs. They used the BFS algorithm to travel nodes in graphs and extract the node-weight sequences of graphs as inputs for the LSTM autoencoder. The model then could leverage the graph structure reconstruction based on autoencoder architecture and the node attributes by the LSTM model. Figure 16 presents the sampling strategy of this model, which lists the nodes and their respective weighted edges. To capture local and global graph structure, Aynaz et al. [205] proposed a sequence-to-sequence autoencoder model, which could represent inputs with arbitrary lengths. The LSTM-based autoencoder model architecture consists of two main parts: The encoder layer

{LSTM}_{e n c}

and the decoder layer

{LSTM}_{d e c}

. For the sequence-to-sequence autoencoder, at each time step l, the hidden vectors in the encoder and decoder layers can be defined as:

\begin{matrix} h_{e n c}^{t} = {LSTM}_{e n c} (Z_{i}^{(t)}, h_{e n c}^{t - 1}), h_{d e c}^{t} = {LSTM}_{d e c} (Z_{i}^{(t - 1)}, h_{d e c}^{t - 1}) \end{matrix}

(52)

where

h_{e n c}^{t}

and

h_{d e c}^{t}

are the hidden states at step t in the encoder and decoder layers, respectively. To generate the sequences of nodes, the model implemented different sampling strategies, including random walks, shortest paths, and breadth-first search with the WL algorithm to encode the information of node labels.

Since the aforementioned models learn node embeddings for static graphs, Shima et al. [203] presented an LSTM-Node2Vec model by combining an LSTM-based autoencoder architecture with the Node2Vec model with learning embeddings for dynamic graphs. The idea of the LSTM-Node2Vec model is that it uses an LSTM autoencoder to preserve the history of node evolution with a temporal random-walk sampling. It then adopted the Node2Vec model to generate the vector embeddings for the new graphs. Figure 17 presents a temporal random-walk sampling strategy to travel a dynamic graph.

Jinyin et al. [204] presented the E-LSTM-D model (Encoder-LSTM-Decoder) to learn embeddings for dynamic graphs by combining autoencoder architecture and LSTM layers. Given by a set of graph snapshots

S = {G_{t - k}, G_{t - k + 1}, \dots, G_{t - 1}}

, the objective of the model is to learn a mapping function

ϕ : ϕ (S) \to G_{t}

. The model takes the adjacency matrix as the input of the autoencoder model, and the output of the encoder layer could be defined as:

\begin{matrix} H_{e, i}^{(1)} & = & ReLU (W_{e}^{(1)} s_{i}^{} + b_{i}^{(1)}) \end{matrix}

(53)

\begin{matrix} H_{e, i}^{(l)} & = & ReLU (W_{e}^{(l)} b_{e, i}^{(l - 1)} + b_{e}^{(l)}) \end{matrix}

(54)

\begin{matrix} H_{e}^{(l)} & = & [H_{e, 0}^{(l)}, H_{e, 1}^{(l)}, \dots, H_{e, N - 1}^{(l)}] \end{matrix}

(55)

where

s_{i}

denotes the i-th graph in the series of graph snapshots,

ReLU (\cdot) = m a x (0, \cdot)

is the activation function. For the decoder layer, the model tried to reconstruct the original adjacency matrix from vector embeddings, which could be defined as follows:

\begin{matrix} H_{d}^{(1)} & = & ReLU (W_{d}^{(1)} H_{e} + b_{d}^{(1)}) \end{matrix}

(56)

\begin{matrix} H_{d}^{(l)} & = & ReLU (W_{d}^{(l)} H_{d}^{(l - 1)} + b_{d}^{(l)}) \end{matrix}

(57)

where

H_{e}

depicts the output of the stacked LSTM model, which captures the current graph’s structure

G_{t}

. Similar to E-LSTM-D model, Palash et al. [201] proposed a variant of Dyngraph2Vec model, named Dyngraph2VecAERNN (Dynamic Graph to Vector Autoencoder Recurrent Neural Network) which also considers the adjacency matrix as input for the model. However, the critical difference between the E-LSTM-D model and the Dyngraph2VecAERNN model is that they feed the LSTM layers directly into the encoder part to learn embeddings. The decoder layer is composed of fully connected neural network layers to reconstruct the inputs.

There are several advantages of recurrent graph neural networks compared to shallow learning techniques:

Diffusion pattern and multiple relations: RGNNs show superior learning ability when dealing with diffuse information, and they can handle multi-relational graphs where a single node has many relations. This feature is achieved due to the ability to update the states of each node in each hidden layer.
Parameter sharing: RGNNs could share parameters across different locations, which could be able to capture the sequence node inputs. This advantage could reduce computational complexity during the training process with fewer parameters and increase the performance of the models.

However, one of the disadvantages of the RGNNs is that these models use recurrent layers with the same weights during the weight update process. This leads to inefficiencies in representing different relationship constraints between neighbor and target nodes. To overcome the limitation of RGNNs, convolutional GNNs have shown remarkable ability in recent years when it uses different weights in each hidden layer.

3.4.3. Convolutional Graph Neural Networks

CNNs have achieved remarkable success in the image processing area. Since image data can be considered to be a special case of graph data, convolution operators can be defined and applied to graph mining. There are two strategies to implement when applying convolution operators to the graph domain. The first strategy is based on graph spectrum theory which transforms graph entities from the spatial domain to the spectral domain and applies convolution filters on the spectral domain. The other strategy directly employs the convolution operators in the graph domain (spatial domain). Table 10 summarizes spectral CGNN models.

When computing power is insufficient for implementing convolution operators directly on the graph domain, several studies focus on transforming graph data to the spectral domain and applying filtering operators to reduce computational time [18,55,213]. The signal filtering process acts as the feature extraction on the Laplacian matrix. Most models adopted single and undirected graphs and presented graph data as a Laplacian matrix:

\begin{matrix} L = I_{n} - D^{- \frac{1}{2}} A D^{- \frac{1}{2}} \end{matrix}

(58)

where D denotes the diagonal matrix of the node degree, A is the adjacency matrix. The matrix L is a symmetric positive definite matrix describing the graph structure. Considering a matrix U as a graph Fourier basis, the Laplacian matrix then could be decomposed into three components:

L = U Λ U^{⊺}

where

Λ

is the diagonal matrix which denotes the spectral representation of graph topology and

U = [u_{0}, u_{1}, \dots, u_{n - 1}]

is eigenvectors matrix. The filter function

g_{θ}

resembles a k-order polynomial, and the spectral convolution acts as diffusion convolution in graph domains. The spectral graph convolution given by an input x with a filter

g_{θ}

is defined as:

\begin{matrix} g_{θ} * x = U_{g_{θ}} U^{⊺} x \end{matrix}

(59)

where ∗ is the convolution operation. Bruna et al. [56] transformed the graph data to the spectral domain and applied filter operators on a Fourier basis. The hidden state at the layer l could be defined as:

\begin{matrix} H_{i}^{(l)} = σ (V \sum_{j = 1}^{c_{l - 1}} D_{i j}^{(l)} V^{⊺} H_{j}^{(l)}) \end{matrix}

(60)

where

D_{i j}^{(l)}

is a diagonal matrix at layer l,

c_{l - 1}

denotes the number of filters at layer

l - 1

, and V denotes the eigenvectors of the L matrix. Typically, most of the energy of the D matrix is concentrated in the first d elements. Therefore, we can obtain the first d values of the matrix V, and the number of parameters that should be trained is

c_{l - 1} \cdot c_{l} \cdot d

.

Several studies focused on improving spectral filters to reduce computational time and capture more graph structure in the spectral domain [210,216]. For instance, Defferrard et al. [216] presented a strategy to re-design convolutional filters for graphs. Since the spectral filter

g_{θ} (Λ)

indeed generates a kernel on graphs, the key idea is that they consider

g_{θ} (Λ)

as a polynomial which includes k-localized kernel:

\begin{matrix} g_{θ} (Λ) = \sum_{k = 0}^{K - 1} θ_{k} Λ^{(k)} \end{matrix}

(61)

where

θ

is a vector of polynomial coefficients. This k-localized kernel provides a circular distribution of weights in the kernel from a target node to k-hop nodes in graphs.

Unlike the above models, Zhuang and Ma [211] tried to capture the local and global graph structures by introducing two convolutional filters. The first convolutional operator, local consistency convolution, captures the local graph structure. The output of a hidden layer

Z^{l}

, then, could be defined as:

\begin{matrix} Z^{(l)} = σ ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} Z^{(l - 1)} W^{(l)}) \end{matrix}

(62)

where

\tilde{A} = A + I

denotes the self-loops adjacency matrix, and

{\tilde{D}}_{i . i} = \sum_{j} {\tilde{A}}_{i j}

is the diagonal matrix presenting the degree information of nodes. In addition to the first filter, the second filter aims to capture the global structure of graphs which could be defined as:

\begin{matrix} Z^{(l)} = σ (D^{- \frac{1}{2}} A P D^{- \frac{1}{2}} Z^{(l - 1)} W^{(l)}) \end{matrix}

(63)

where P denotes the PPMI matrix, which can be calculated via frequency matrix using random-walk sampling.

Most of the above models learn node embeddings by transforming graph data to signal domain and use convolutional filters which lead to increased computational complexity. In 2016, Kipf and Welling [18] introduced graph convolutional networks (GCNs), which were considered to be a bridge between spectral and spatial approaches. The spectral filter

g_{θ} (Λ)

and the hidden layers of the GCN model followed the layer-wise propagation rule can be defined as follows:

\begin{matrix} g_{θ^{'}} (Λ) \approx \sum_{k = 0}^{K} {θ^{'}}_{k} T_{k} (\tilde{Λ}) \end{matrix}

(64)

\begin{matrix} H^{(l + 1)} = σ ({\tilde{D}}^{- \frac{1}{2}} A {\tilde{D}}^{- \frac{1}{2}} H^{(l)} W^{(l)}) \end{matrix}

(65)

where

\tilde{Λ} = \frac{2}{λ_{max}} Λ - I_{N}

and

λ_{max}

is the largest eigenvalue of Laplacian matrix L,

θ^{'} \in R^{K}

is Chebyshev coefficients vector,

T_{k} (x)

is Chebyshev polynomials could be defined as:

\begin{matrix} T_{k} (x) = 2 x T_{k - 1} (x) - T_{k - 2} (x) \end{matrix}

(66)

where

T_{0} (x) = 1 and T_{1} (x) = x

. Consequently, the convolution filter of an input x is defined as:

\begin{matrix} g_{θ^{'}} * x \approx \sum_{k = 0}^{K} {θ^{'}}_{k} T_{k} (\tilde{L}) x, \tilde{L} = \frac{2}{λ_{max}} L - I_{N} . \end{matrix}

(67)

Although spectral CGNNs are effective in applying convolution filters on the spectral domain, they have several limitations as follows:

Computational complexity: The spectral decomposition of the Laplacian matrix into matrices containing eigenvectors is time-consuming. During the training process, the dot product of the U, $Λ$ , and $U^{T}$ matrices also increase the training time.
Difficulties for handling large-scale graphs: Since the number of parameters for the kernels also corresponds to the number of nodes in graphs. Therefore, spectral models could not be suitable for large-scale graphs.
Difficulties for considering graph dynamicity: To apply convolution filters to graphs and train the model, the graph data must be transformed to the spectral domain in the form of a Laplacian matrix. Therefore, when the graph data changes, in the case of dynamic graphs, the model is not applicable to capture changes in dynamic graphs.

Motivated by the limitations of spectral domain-based CGNNs, spatial models apply convolution operators directly to the graph domain and learn node embeddings in an effective way. Recently, various spatial CGNNs have been proposed showing remarkable results in handling different graph structures compared to spectral models [52,95]. Based on the mechanism of aggregation from graphs and how to apply the convolution operators, we divide CGNN models into the following main groups: (i) Aggregation mechanism improvement, (ii) Training efficiency improvement, (iii) Attention-based models, and (iv) Autoencoder-CGNN models. Table 11 and Table 12 present a summary of spatial CGNN models for all types of graphs ranging from homogeneous to heterogeneous graphs.

Gilmer et al. [222] presented the MPNN (Message-Passing Neural Network) model to employ the concept of messages passing over nodes in graphs. Given a pair of nodes

(v_{i}, v_{j})

, a message from

v_{j}

to

v_{i}

could be calculated by a message function

M_{i j}

. During the message-passing phase, a hidden state at layer l of a node

v_{i}

could be calculated based on the message-passing from its neighbors, which could be defined as:

\begin{matrix} m_{i}^{(l + 1)} = \sum_{v_{j} \in N (v_{i})} M_{(l)} (h_{i}^{(l)}, h_{j}^{(l)}, e_{i j}), \end{matrix}

(68)

\begin{matrix} h_{i}^{(l + 1)} = σ (h_{i}^{(l)}, m_{i}^{(l + 1)}), \end{matrix}

(69)

where

M_{(l)}

denotes the message function at layer l which could be a MLP function,

σ

is an activation function, and

N (v_{i})

denotes the set of neighbors of node

v_{i}

.

Most previous graph embedding models work in transductive learning which cannot handle unseen nodes. In 2017, Hamilton et al. [22] introduced the GraphSAGE model (SAmple and aggreGatE) to generate inductive node embeddings in an unsupervised manner. The hidden state at layer

l + 1

of a node

v_{i}

could be defined as:

\begin{matrix} h_{i}^{(l + 1)} = {AGG}_{(l + 1)} ({h_{j}^{(l)}, \forall v_{j} \in N (v_{i})}) \end{matrix}

(70)

where

N (v_{i})

denotes the set of neighbors of node

v_{i}

,

h_{j}^{(l)}

is the hidden state of node

v_{j}

at layer l. The function

AGG (\cdot \cdot)

is a differentiable aggregator function. There are three aggregators (e.g., Mean, LSTM, and Pooling) to aggregate information from neighboring nodes and separate nodes into mini batches. Algorithm 1 presents the algorithm of the GraphSAGE model.

Algorithm 1: GraphSAGE algorithm. The model first takes the node features as inputs. For each layer, the model aggregates the information from neighbors and then updates the hidden state of each node

v_{i}

.

Input:

G = (V, E)

: The graph G with set of nodes V and set of edges E.

x_{i}

: The input features of node

v_{i}

L: The depth of hidden layers,

\forall l \in {1 \dots L}

{AGG}_{k}

: Differentiable aggregator functions

N (v_{i})

: The set of neighbors of node

v_{i}

.

Output:

Z_{i}

: Vector representations for

v_{i}

.

h_{i}^{0} \leftarrow x_{i}, \forall v_{i} \in V

Z_{i} \leftarrow h_{i}^{L}, \forall v_{i} \in V

Lo et al. [231] aimed to apply the GraphSAGE model to detect computer attackers in computer network systems, named E-graphSAGE. The main difference between the two models is that E-graphSAGE used the edges of graphs as aggregation information for learning embeddings. The edge information between two nodes is the data flow between two source IP addresses (Clients) and destination IP addresses (Servers).

By evaluating the contribution of neighboring nodes to target nodes, Tran et al. [229] proposed convolutional filters with different parameters. The key idea of this model is to rank the contributions of different distances from the set of neighbor nodes to target nodes using short path sampling. Formally, the hidden state of a node at layer

l + 1

could be defined as multiple graph convolutional filters:

\begin{matrix} h^{r, l + 1} {= ‖}_{j = 0}^{r} ({(D^{j})}^{- 1} S P^{j} h^{l} W^{j, l}) \end{matrix}

(71)

where ‖ denotes the concatenation, r and

S P^{j}

denote the r-hop distance and the shortest-path distance j, respectively. Ying et al. [225] considered random-walk sampling as the aggregation information that can be aggregated to the hidden state of CGNNs. To collect the neighbors of node v, the idea of the model is to gather a set consisting of random-walk paths from node v and then select the top k nodes with the highest probability.

For hypergraphs, several GNN models have been proposed to learn high-order graph structure [27,44,234]. Feng et al. [27] proposed HGNN (Hypergraph Neural Networks) model to learn hypergraph structure based on spectral convolution. They first learn each hyperedge feature by aggregating all the nodes connected by the hyperedge. Then, each node’s attribute is updated with a vector embedding based on all the hyperedges connecting to the nodes. By contrast, Yadati [234] presented the HyperGCN model to learn hypergraphs based on spectral theory. Since each hyperedge could connect several nodes between them, this model’s idea is to filter far apart nodes. Therefore, they adopt the Laplacian operator first to learn node embedding and filter edges, which connect two nodes at a high distance. The GCNs could then be used to learn node embeddings.

One of the limitations of GNN models is that the models consider the set of neighbors as permutation invariant. This limitation then makes the models cannot distinguish between isomorphic subgraphs. By considering the message-passing set from neighbors of nodes as permutation invariant, several works aimed to improve the message-passing mechanism by simple aggregation functions. Xu et al. [24] proposed GIN (Graph Isomorphism Network) model, which aims to learn vector embeddings as powerful as the 1-dimensional WL isomorphism test. Formally, the hidden state of node

v_{i}

at layer l could be defined as:

\begin{matrix} h_{i}^{(l)} = {MLP}^{(l)} ((1 + ε^{(l)}) \cdot h_{i}^{(l - 1)} + \sum_{v_{j} \in N (v_{i})} h_{j}^{(l - 1)}) \end{matrix}

(72)

where MLP denotes multilayer perceptions and

ε

is a parameter that could be learnable or fixed scalar. Another problem of GNNs is the over-smoothing problem when stacking more layers in the models. DeeperGCN [98] was a similar approach that aims to solve the over-smoothing problem by generalized aggregations and skip connections. The DeeperGCN model defined a simple normalized message-passing, which could be defined as:

\begin{matrix} m_{i j}^{(l)} = ReLU (h_{i}^{(l)} + 𝟙 (h_{e_{i j}}^{(l)} \cdot h_{e_{i j}}^{(l)})) + ε \end{matrix}

(73)

\begin{matrix} h_{i}^{(l + 1)} = MLP (h_{i}^{(l)} + s \cdot {∥h_{i}^{(l)}∥}_{2} \cdot \frac{m_{i}^{(l)}}{{∥m_{i}^{(l)}∥}_{2}}) \end{matrix}

(74)

where

m_{i j}

denotes the message-passing from node

v_{j}

to node

v_{i}

,

h_{e_{i j}}

is the edge feature of the edge

e_{i j}

,

𝟙 (\cdot)

presents an indicator procedure which is being 1 if two nodes

v_{i}

and

v_{j}

are connected. Le et al. [233] presented the PHC-GNN model, which improves the message-passing compared to the GIN model. The main difference between PHC-GNN and GIN models is that the PHC-GNN model added edge embeddings and a residual connection after the message-passing. Formally, the message-passing and hidden state of a node

v_{i}

at layer

l + 1

could be defined as:

\begin{matrix} {m_{i}}^{(l + 1)} = \sum_{v_{j} \in N (v_{i})} α_{i j}^{} (h_{i}^{(l)} + {h_{e}}_{i j}^{(l)}), \end{matrix}

(75)

\begin{matrix} {\tilde{h}}_{i}^{(l + 1)} = {MLP}^{(l + 1)} (h_{i}^{(l)} + m_{i}^{(l + 1)}), \end{matrix}

(76)

\begin{matrix} {h_{i}}^{(l + 1)} = {h_{i}}^{(a)} + {\tilde{h}}_{i}^{(l + 1)} . \end{matrix}

(77)

A few studies focused on building pre-trained GNN models, which could be used to initialize other tasks [209,246,247]. These pre-trained models are also beneficial to handle the little availability of node labels. For example, the main objective of the GPT-GNN model [247] is to reconstruct the graph structure and the node features by masking the attributes and edges. Given a permutated order, the model maximizes the node attributes based on observed edges and then generates the remaining edges. Formally, the conditional probability could be defined as:

\begin{matrix} p (X_{i}, E_{i} | X_{< i}, E_{< i}) = \sum_{m} p (X_{i}, E_{i, \neg m} | E_{i, m} X_{< i}, E_{< i}) \cdot p (E_{i, m} | X_{< i}, E_{< i}) \end{matrix}

(78)

where

E_{i, m}

and

E_{i, \neg m}

depict the observed and masked edges, respectively.

Since learning node embeddings in the whole graphs is time-consuming, several approaches aim to apply standard cluster algorithms (e.g., METIS, K-means, etc.) to cluster nodes into different subgraphs, then use GCNs to learn node embeddings. Chiang et al. [95] proposed a Cluster-GCN model to increase the computational efficiency during the training of the CGNNs. Given a graph G, the model first separates G into c clusters

G = {G_{1}, G_{2}, \dots, G_{c}}

where

G_{i} = {V_{i}, E_{i}}

using Metis clustering algorithm [248]. The model then aggregates information within each cluster. GraphSAINT model [53] had a similar structure to Cluster-GCN and [249] model. GraphSAINT model aggregated neighbor information and samples nodes directly on a subgraph at each hidden layer. The probability of keeping a connection from a node u at layer l to a node v in layer

l + 1

could be based on the node degree. Figure 18 presents an example of aggregation strategy for the GraphSAINT model. By contrast, Jiang et al. [54] presented a hi-GCN model (hierarchical GCN) that could effectively model the brain network with two-level GCNs. Since individual brain networks have multiple functions, the first level GCN aims to capture the graph structure. The objective of the 2nd GCN level is to provide the correlation between network structure and contextual information to improve the semantic information. The work from Huang et al. [250] was similar to GraphSAGE and FastGCN models. However, instead of using node-wise sampling at each hidden layer, the model provided two strategies: a layer-wise sampling strategy and a skip-connection strategy that could directly share the aggregation information between hidden layers and improve message-passing. The main idea of the skip-connection strategy is to reuse the information from previous layers that could usually be forgotten in dense graphs.

One of the limitations of the CGNNs is that at the hidden layer, the model updates the state of all neighboring nodes. This can lead to slow training and updating because of inactive nodes. Some models aimed to enhance CGNNs by improving the sampling strategy [52,223,224]. For example, Chen et al. [52] presented a FastGCN model to improve the training time and the model performance compared to CGNNs. One of the problems with existing GNN models is scalability which expands the neighborhood and increases computational complexity. The model could learn neighborhood sampling at each convolution layer which mainly focuses on essential neighbor nodes. Therefore, the model could learn the essential neighbor nodes for every batch.

By considering each hidden layer as an embedding layer of independent nodes, FastGCN aims to subsample the receptive area at each hidden layer. For each layer, they chose

t_{k}

i.i.d. nodes

u_{1}^{(l)}, u_{2}^{(l)}, \dots, u_{k}^{(l)}

and compute the hidden state which could be defined as:

\begin{matrix} {\tilde{h}}_{k + 1}^{(l + 1)} (v) = \frac{1}{k} \sum_{j = 1}^{k} \tilde{A} (v, u_{j}^{(l)}) h_{k}^{(l)} u_{j}^{(l)} W_{}^{(l)} \end{matrix}

(79)

\begin{matrix} h_{k + 1}^{(l + 1)} (v) = σ ({\tilde{h}}_{k + 1}^{(l + 1)} (v)) \end{matrix}

(80)

where

\tilde{A} (v, u_{j}^{(l)})

denotes the kernel, and

σ

denotes the activation function. Wu et al. [214] introduced SGC (Simple Graph Convolution) model, which could improve 1st-order proximity in the GCN model. The model removed nonlinear activation functions at each hidden layer. Instead, they used a final SoftMax function at the last layer to acquire probabilistic outputs. Chen et al. [224] presented a model to improve the updating of the nodes’ state. Instead of collecting all the information from the neighbors of each node, the model proposed an option to keep track of the activation history states of the nodes to reduce the receptive scope. The model aimed to maintain the history state

{\bar{h}}_{v}^{(l)}

for each state

h_{v}^{(l)}

of each node v.

Similar to [250], Chen et al. [28] presented a GCNII model using an initial residual connection and identity mapping to overcome the over-smoothing problem. The GCNII model aimed to maintain the structural identity of target nodes to overcome the over-smoothing problem. They introduced an initial residual connection

H^{0}

at the first convolution layer and identity mapping

I_{n}

. Mathematically, the hidden state at layer

l + 1

could be defined as:

\begin{matrix} H^{(l + 1)} = σ (((1 - a_{l}) \tilde{P} \cdot H^{(l)} + a_{l} H^{(0)}) ((1 - b_{l}) I_{n} + b_{l} W^{(l)})) \end{matrix}

(81)

where

\tilde{P} = {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

denotes the convolutional filter with normalization. Adding two parameters

H^{(0)}

and

I_{n}

is for the purpose of tackling the over-smoothing problem.

Several models aim to maximize the node representation and graph structure by matching a prior distribution. There have been a few studies based on the idea of Deep Infomax [227] from image processing to learn graph embeddings [26,242]. For example, Velickovic et al. [26] introduced the Deep Graph Infomax (DGI) model, which could adopt the GCN as an encoder. The main idea of mutual information is that the model trains the GCN encoder to maximize the understanding of local and global graph structure in actual graphs and minimize that in fake graphs. There are four components in the DGI model, including:

A corruption function $C$ : This function aims to generate negative examples from an original graph with several changes in structure and properties.
An encoder $ϕ : R^{N \times M} \times R^{N \times N} \to R^{N \times D}$ . The goal of function $ϕ$ is to encode nodes into vector space so that $ϕ (X, A) = H = {h_{1}, h_{2}, \dots h_{N}}$ presents vector embeddings of all nodes in graphs.
Readout function $R : R^{N \times D} \to R^{D}$ . This function maps all embedding nodes into a single vector (supernode).
A discriminator $D : R^{M} \times R^{M} \to R$ compares vector embeddings against the global vector of the graph by calculating a score between 0 and 1 for each vector embedding.

One of the limitations of the DGI model is that it only works with attributed graphs. Several studies have improved DGI to work with heterogeneous graphs with attention and semantic mechanisms [242,243]. Similar to the DGI model, Park et al. [243] presented the DMGI model (Deep Multiplex Graph Infomax) for attributed multiplex graphs. Given a specific node with relation type r, the hidden state could be defined as:

\begin{matrix} H^{(r)} = σ ({\hat{D}}_{r}^{- \frac{1}{2}} {\hat{A}}^{(r)} {\hat{D}}_{r}^{- \frac{1}{2}} {XW}_{}^{r}) \end{matrix}

(82)

where

{\hat{A}}^{(r)} = A^{(r)} + α I_{n}

, and

{\hat{D}}_{i i}^{} = \sum_{j} {\hat{A}}_{i j}^{}

,

W_{}^{r} \in R^{n \times d}

is trainable weights, and

σ

is the activation function. Similar to the DGI model, the readout function and discriminator can be employed as:

\begin{matrix} S^{(r)} = Readout (H^{(r)}) = σ (\frac{1}{N} \sum_{i = 1}^{N} h_{i}^{(r)}) \end{matrix}

(83)

\begin{matrix} D (h_{i}^{(r)}, S^{(r)}) = σ (h {_{i}^{(r)}}^{T} M_{}^{(r)} s_{}^{(r)}) \end{matrix}

(84)

where

h_{i}^{(r)}

is the i-th vector of matrix

H^{(r)}

,

M^{r}

denotes a trainable scoring matrix,

S^{r}

is a function with

S^{r} = σ (\frac{1}{N} \sum_{i = 1}^{N} {h_{i}}^{r})

. The attention mechanism is adopted from [251], which could capture the importance of node type to generate the vector embeddings at the last layer. Similarly, Jing et al. [242] proposed HDMI (High-order Deep Multiplex Infomax) model, which is conceptually similar to the DGI model. The HDMI model could optimize the high-order mutual information to process different relation types.

Increasing the number of hidden layers to aggregate more structural information of graphs can lead to an over-smoothing problem [97,252]. Previous models have considered the weights of messages to be the same role in aggregating information from neighbors of nodes. In recent years, various studies have focused on attention mechanisms to extract valuable information from neighborhoods of nodes [19,253,254]. Table 13 presents a summary of attentive GNN models.

Velickovi et al. [19] presented the GATs (graph attention networks) model, one of the first models in applying attention mechanism to graph representation learning. The purpose of the attention mechanism is to compute a weighted message for each neighbor node during the message-passing of GNNs. Formally, there are three steps for GATs which can be explained as follows:

Attention score: At layer l, the model takes a set of features of a node as inputs $h = {h_{i} \in R^{d} | v_{i} \in V}$ and the output $h^{'} = {{h^{'}}_{i} \in R^{d^{'}} | v_{i} \in V}$ . An attention score measures the importance of neighbor nodes $v_{i}$ to the target node $v_{j}$ could be computed as:

$\begin{matrix} s_{i j} = σ (a^{⊺} (W^{} {h_{i}}^{} ∥ W {h_{j}}^{})) \end{matrix}$

(85)

where $a \in R^{2 d^{'}}$ , and $W^{(k)} \in R^{d^{'} \times d}$ are trainable weights, $∥$ denotes the concatenation.
Normalization: The score then is normalized comparable across all neighbors of node $v_{i}$ using the SoftMax function:

$\begin{matrix} α_{i j} = SoftMax ({s_{i j}}^{}) = \frac{exp (s_{i j})}{\sum_{v_{k} \in N (v_{i})} exp (s_{i k})} . \end{matrix}$

(86)
Aggregation: After normalization, the embeddings of node $v_{i}$ could be computed by aggregating states of neighbor nodes which could be computed as:

$\begin{matrix} {h_{i}}^{'} = σ (\sum_{v_{j} \in N (v_{i})} α_{i j} \cdot W h_{j}) . \end{matrix}$

(87)

Furthermore, the GAT model used multi-head attention to enhance the model power and stabilize the learning strategy. Since the GAT model takes the attention coefficient between nodes as inputs and ranks the attention unconditionally, this results in a limited capacity to summarize the global graph structure.

In recent years, various models have been proposed based on the GAT idea. Most of them aimed to improve the ability of the self-attention mechanism to capture more global graph structures [253,254]. Zhang et al. [253] presented GaAN (Gated Attention Networks) model to control the importance of neighbor nodes by controlling the amount of attention score. The main idea of GaAN is to measure the different weights that come to different heads in target nodes. Formally, the gated attention aggregator could be defined as follows:

\begin{matrix} {h_{i}}^{'} = {MLP}_{θ} (x_{i} \oplus \underset{m = 1}{\overset{M_{h e a d}}{∥}} (g_{i}^{(m)} \sum_{j \in N (v_{i})} w_{i j}^{(m)} {MLP}_{θ^{(m)}} (h_{i}))) \end{matrix}

(88)

\begin{matrix} g_{i}^{} = [g_{i}^{(1)}, g_{i}^{(2)} \dots g_{i}^{(M_{h e a d})}] \end{matrix}

(89)

where

MLP (\cdot)

denotes a simple linear transformation, and

g_{i}^{(m)}

is the gate value of m-th head of node

v_{i}

.

To capture a coarser graph structure, Kim and Oh [258] considered attention based on the importance of nodes to each other. The importance of nodes is based on whether the two nodes are directly connected. By defining the different attention from target nodes to context nodes, the model could solve the permutation equivalent and capture more global graph structure. Based on this idea, they proposed the SuperGAT model with two variants, scaled dot product (SD) and mixed GO and DP (MX), to enhance the attention span of the original model. The attention score

s_{i j}

between two nodes

v_{i}

and

v_{j}

can be defined as follows:

\begin{matrix} s_{i j, S D}^{(l + 1)} & = \frac{{(W^{(l + 1)} h_{i}^{(l)})}^{⊺} \times W^{(l + 1)} h_{j}^{(l)}}{\sqrt{d}} \end{matrix}

(90)

\begin{matrix} s_{i j, M X}^{(l + 1)} & = {(a^{(l + 1)})}^{⊺} [W^{(l + 1)} h_{i}^{(l)} | | W^{(l + 1)} h_{j}^{(l)}] \cdot σ ({(W^{(l + 1)} h_{i}^{(l)})}^{⊺} \times W^{(l + 1)} h_{j}^{(l)}) \end{matrix}

(91)

where d denotes the number of features at layer

l + 1

. The two attention scores can softly decline the number of nodes that are not connected to the target node

v_{i}

.

Wang et al. [259] aimed to introduce a margin-based constraint to control over-fitting and over-smoothing problems. By assigning the attention weight of each neighbor to target nodes across all nodes in graphs, the proposed model can adjust the influence of the smoothing problem and drop unimportant edges.

Extending the GAT model to capture more global structural information using attention, Haonan et al. [256] introduced the GraphStar model using a virtual node (a virtual start) to maintain global information at each hidden layer. The main difference between the GraphStar and GATs models is that they introduce three different types of relationships: node-to-node (self-attention), node-to-start (global attention), and node-to-neighbors (local attention). Using different types of relationships, GraphStar could solve the over-smoothing problem when staking more neural network layers. Formally, the attention coefficients could be defined as:

\begin{matrix} h_{i}^{(t + 1)} = \underset{m}{\overset{M_{h e a d}}{| |}} σ (\sum_{r \in R} \sum_{j \in N_{i}^{r}} α_{i j r}^{m} W_{1}^{m (t)} h_{j}^{t} + α_{i s, r = s}^{m} W_{2}^{m (t)} S_{}^{t} + α_{i 0, r = 0}^{m} W_{3}^{m (t)} h_{i}^{t}) \end{matrix}

(92)

where

W_{1}^{m (t)}

,

W_{2}^{m (t)}

, and

W_{3}^{m (t)}

denotes the node-to-node, node-to-start and node-to-neighbors relations at the m-th head of node

v_{i}

, respectively.

One of the problems with the GAT model is that the model only provides static attention which mainly focuses the high-weight attention on several neighbor nodes. As a result, GAT cannot learn universal attention for all nodes in graphs. Motivated by the limitations of the GAT model, Brody et al. [58] proposed the GATv2 model using dynamic attention which could learn graph structure more efficiently from a target node

v_{i}

to neighbor node

v_{j}

. The attention score can be computed with a slight modification:

\begin{matrix} s_{i j} = a^{⊺} σ (W \cdot [h_{i} | | h_{j}]) . \end{matrix}

(93)

Similar to Wang et al. [259], Zhang et al. [260] presented ADSF (ADaptive Structural Fingerprint) model, which could monitor attention weights from each neighbor of the target node. However, the difference between GraphStar [259] and the ADSF model is that the ADSF model introduced two attention scores

s_{i j}

and

e_{i j}

for each node

v_{i}

which can capture the graph structure and context, respectively.

Besides the GAT-based models applied to homogeneous graphs, several models tried to apply attention mechanism to heterogeneous and knowledge graphs [25,261,262]. For example, Wang et al. [25] presented hierarchical attention to learn the importance of nodes in graphs. One of the advantages of this model is to handle heterogeneous graphs with different types of nodes and edges by deploying local and global level attention. The model proposed two levels of attention: node and semantic-level attention. The node-level attention aims to capture the attention between two nodes in meta-paths. Given a node pair

(v_{i}, v_{j})

in a meta-path P, the attention score of P could be defined as:

\begin{matrix} s_{i j}^{P} = {Att}_{n o d e} (h_{i}^{^{'}}, h_{j}^{^{'}}; P) \end{matrix}

(94)

where

h_{i}^{'}

and

h_{j}^{'}

denote the original and projected features of node

v_{i}

and

v_{j}

via a projection function

M_{ϕ}

, respectively, and

{Att}_{n o d e}

is a function which scores the node-level attention. To make the coefficients across other nodes in a meta-path P which contain a set of neighbors

N_{i}^{P}

of a target node

v_{i}

, the attention score

α_{i j}^{P}

, and node embedding with k multi-head attention can be defined as:

\begin{matrix} α_{i j}^{P} = \frac{exp (σ (s_{i j}^{T} \cdot [h_{i}^{^{'}} | | h_{j}^{^{'}}]))}{\sum_{k \in N_{i}^{P}} exp (σ (s_{i k}^{T} \cdot [h_{i}^{^{'}} | | h_{k}^{^{'}}]))} \end{matrix}

(95)

\begin{matrix} z_{i}^{P} = \underset{k = 1}{\overset{K}{| |}} σ (\sum_{j \in N_{i}^{P}} α_{i j}^{P} {h^{'}}_{j}) . \end{matrix}

(96)

The score

z_{i}^{P}

indicates how the importance of the set of neighbors based on meta-path P contributes to node

v_{i}

. Furthermore, the semantic-level aggregation aims to score the importance of meta-paths. Given an attention coefficient

z_{i}^{P}

, the importance of meta-path P and its normalization could be defined as

w_{P}

:

\begin{matrix} w_{P} = \frac{1}{| V |} (\sum_{i \in V} q^{⊺} \cdot tanh (W \cdot z_{i}^{P} + b)) \end{matrix}

(97)

\begin{matrix} {\bar{w}}_{P} = \frac{exp (w_{P})}{\sum_{p = 1}^{l} exp (w_{p})} . \end{matrix}

(98)

In addition to applying CGNNs to homogeneous graphs, several studies focused on applying CGNNs for heterogeneous and knowledge graphs [224,241,243,263,264,266]. Since heterogeneous graphs have different types of edges and nodes, the main problem when applying CGNN models is the aggregation of messages based on different edge types. Schlichtkrull et al. [241] introduced the R-GCNs model (Relational Graph Convolutional Networks) to model relational entities in knowledge graphs. R-GCNs is the first model to be applied to learn node embeddings in heterogeneous graphs to several downstream tasks, such as link prediction and node classification. In addition, they also use parameter sharing to learn the node embedding efficiently. Formally, given a node

v_{i}

under relation

r \in R

, the hidden state at layer

l + 1

could be defined as:

\begin{matrix} h_{i}^{(l + 1)} = σ (\sum_{r \in R} \sum_{j \in N_{i}^{r}} \frac{1}{c_{i, r}} W_{r}^{(l)} h_{j}^{(l)} + W_{0}^{(l)} h_{i}^{(l)}), \end{matrix}

(99)

where

c_{i, r}

is the normalization constant, and

N_{i}^{r}

denotes the set of neighbors of node

v_{i}

with relation r. Wang et al. [265] introduced HANE (Heterogeneous Attributed Network Embedding) model to learn embeddings for heterogeneous graphs. The key idea of the HANE model is to measure attention scores for different types of nodes in heterogeneous graphs. Formally, given a node

v_{i}

, the attention coefficients

s_{i j}^{(l)}

, attention score

α_{i j}^{(l)}

, and the hidden state

h_{i}^{(l + 1)}

at layer

l + 1

could be defined as:

\begin{matrix} z_{i}^{(l)} & = W_{i}^{(l)} x_{i}^{(l)} s_{i j}^{(l)} = (z_{i}^{(l)} | | z_{j}^{(l)}) α_{i j}^{(l)} = \frac{exp (s_{i j}^{(l)})}{\sum_{v_{k} \in N (v_{i})} exp (s_{i k}^{(l)})} \end{matrix}

(100)

\begin{matrix} h_{i}^{(l + 1)} & = σ (z_{i}^{(l)}) \oplus (\sum_{v_{k} \in N (v_{i})} α_{i k}^{(l)} z_{l}^{(l)}) \end{matrix}

(101)

where

N (v_{i})

denotes the set of neighbors of node

v_{i}

,

x_{i}

denotes the feature of

v_{i}

, and

W_{i}^{(l)}

is the weighted matrix of each node type.

Several studies focused on applying CGNNs for recommendation systems [228,267,268,269]. For instance, Wang et al. [267] presented KGCN (Knowledge Graph Convolutional Networks) model to extract the user preferences in the recommendation systems. Since most existing models suffer from the cold start problem and sparsity of user–item interactions, the proposed model can capture users’ side information (attributes) on knowledge graphs. The users’ preferences, therefore, could be captured by a multilayer receptive field in GCN. Formally, given a user u, item v,

N_{v}

denotes the set of items connected to u, the user–item interaction score could be computed as:

\begin{matrix} {\tilde{π}}_{r_{v, e}}^{u} = \frac{exp (π_{r_{v, e}}^{u})}{\sum_{e \in N (v)} exp (π_{r_{v, e}}^{u})}, v_{N (v)}^{u} = \sum_{e \in N (v)} exp ({\tilde{π}}_{r_{v, e}}^{u} e) \end{matrix}

(102)

where

π_{r_{v, e}}^{u}

denotes an inner product where the score between user u and relation r, e is the representation of item v.

Since the power of the autoencoder architecture is to learn a low-dimensional node representation in an unsupervised manner, several studies focused on integrating the convolutional GNNs into autoencoder architecture to leverage the power of the autoencoder architecture [72,270]. Table 14 summarizes graph convolutional autoencoder models for static and dynamic graphs.

Most graph autoencoder models were designed based on VAE (variational autoencoders) architecture to learn embeddings [274]. Kipf and Welling [72] introduced the GAE model, one of the first studies on applying autoencoder architecture to graph representation learning. GAE model [72] aimed to reconstruct the adjacency matrix A and feature matrix X from original graphs by adopting the CGNNs as an encoder and an inner product as the decoder part. Figure 19 presents the detail of the GAE model. Formally, the output embedding Z and the reconstruction process of the adjacency matrix input could be defined as:

\begin{matrix} Z = GCN (X, A) error = σ (Z Z^{⊺}) \end{matrix}

(103)

where

GCN (\cdot, \cdot)

function could be defined by Equation (65), and

σ

is an activation function

ReLU (\cdot) = m a x (0, \cdot)

. The model aims to reconstruct the adjacency matrix A by an inner product decoder part:

\begin{matrix} p (A, Z) = \prod_{i = 1}^{N} \prod_{j = 1}^{N} p (A_{i j} | Z_{i}, Z_{j}), p (A_{i j} = 1 | Z_{i}, Z_{j}) = σ (Z_{i}^{⊺} Z_{j}) \end{matrix}

(104)

where

σ

is the sigmoid function and

A_{i j}

is the value at row i-th and column j-th in the adjacency matrix A. In the training process, the model tries to minimize the loss function by gradient descent:

\begin{matrix} L (θ) = E_{q (Z | X, A)} [l o g p (A | Z; θ)] - KL [q (Z | Z, A) | | p (Z)] \end{matrix}

(105)

where

KL [q (Z | Z, A) | | p (Z)]

is the Kullback–Leibler divergence between two distributions p and q.

Several models attempted to incorporate the autoencoder architecture into the GNN model to reconstruct graphs. For example, the MGAE model [270] combined the message-passing mechanism from GNNs and GAE architecture for graph clustering. The primary purpose of MGAE is to capture information about the features of the nodes by randomly removing several noise pieces of information from the feature matrix to train the GAE model.

The GNNs have shown outstanding performance in learning complex structural graphs that shallow models could not solve [245,275,276]. There are several main advantages of deep neural network models:

Parameter sharing: Deep neural network models share weights during the training phase to reduce training time and training parameters while increasing the performance of the models. In addition, the parameter-sharing mechanism allows the model to learn multi-tasks.
Inductive learning: The outstanding advantage of deep models over shallow models is that deep models can support inductive learning. This makes deep-learning models capable of generalizing to unseen nodes and having practical applicability.

However, the CGNNs are considered the most advantageous in the line of GNNs and have limitations in graph representation learning.

Over-smoothing problem: When capturing the graph structure and entity relationships, CGNNs rely on an aggregation mechanism that captures information from neighboring nodes for target nodes. This results in stacking multiple graph convolutional layers to capture higher-order graph structure. However, increasing the depth of convolution layers could lead to over-smoothing problems [252]. To overcome this drawback, models based on transformer architecture have shown several improvements compared to CGNNs using self-attention.
The ability on disassortative graphs: Disassortative graphs are graphs where nodes with different labels tend to be linked together. However, the aggregation mechanism in GNN samples all the features of the neighboring nodes even though they have different labels. Therefore, the aggregation mechanism is the limitation and challenge of GNNs for disassortative graphs in classification tasks.

3.4.4. Graph Transformer Models

Transformers [277] have gained tremendous success for many tasks in natural language processing [278,279] and image processing areas [280,281]. In documents, the transformer models could tokenize sentences into a set of tokens and represent them as one-hot encodings. With image processing, the transformer models could adopt image patches and use two-dimensional encoding to tokenize the image data. However, the tokenization of graph entities is non-trivial since graphs have irregular structures and disordered nodes. Therefore, applying transformers to graphs is still an open question of whether the graph transformer models are suitable for graph representation learning.

The transformer architecture consists of two main parts: a self-attention module and a position-wise feedforward network. Mathematically, the input of the self-attention model at layer l could be formulated as

H = [h_{1}^{l}, h_{2}^{l}, \dots, h_{N}^{l}]

where

h_{i}^{l}

denotes the hidden state of position of node

v_{i}

. Then, the self-attention could be formulated as:

\begin{matrix} Q = H W_{Q} K = H W_{K} V = H W_{V} \end{matrix}

(106)

\begin{matrix} S = \frac{Q K^{T}}{\sqrt{d_{K}}} S (H) = Softmax (S) V \end{matrix}

(107)

where Q, K, and V depict the query matrix, key matrix, and value matrix, respectively, and d is the hidden dimension embedding. The matrix S measures the similarity between the queries and keys.

The architecture of graph transformer models differs from GNNs. GNNs use message-passing to aggregate the information from neighbor nodes to target nodes. However, graph transformer models use a self-attention mechanism to capture the context of target nodes in graphs, which usually denotes the similarity between nodes in graphs. The self-attention mechanism could help capture the amount of information aggregated between two nodes in a specific context. In addition, the models use a multi-head self-attention that allows various information channels to pass to the target nodes. Transformer models then learn the correct aggregation patterns during training without pre-defining the graph structure sampling. Table 15 lists a summary of graph transformer models.

In this section, we divide graph transformer models for graph representation learning into three main groups based on the strategy of applying graph transformer models.

Structural encoding-based graph transformer: These models focus on various positional encoding schemes to capture absolute and relative information about entity relationships and graph structure. Structural encoding strategies are mainly suitable for tree-like graphs since the models should capture the hierarchical relations between the target nodes and their parents as well as the interaction with other nodes of the same level.
GNNs as an auxiliary module: GNNs bring a powerful mechanism in terms of aggregating local structural information. Therefore, several studies try integrating message-passing and GNN modules with a graph transformer encoder as an auxiliary.
Edge channel-based attention: The graph structure could be viewed as the combination of the node and edge features and the ordered/unordered connection between them. From this perspective, we do not need GNNs as an auxiliary module. Recently, several models have been proposed to capture graph structure in depth as well as apply graph transformer architecture based on the self-attention mechanism.

Several models tried to apply vanilla transformers to tree-like graphs to capture the node position [64,65,277,288]. Preserving tree structure depicts preserving a node’s relative and absolute structural positions in trees. Absolute structural position describes the positional relationship of the current node to the parent nodes (root nodes). In contrast, relative structural position describes the positional relationship of the current node to its neighbors.

Shiv and Quirk [64] proposed to build a positional encoding (PE) strategy for programming language translation tasks. The significant advantage of tree-based models is that they can explore nonlinear dependencies. By custom positional encodings of nodes in the graph in a hieratical manner, the model could strengthen the transformer model’s power to capture the relationship between node pairs in the tree. The key idea is to represent programming language data in the form of a binary tree and encode the target nodes based on the location of the parent nodes and the relationship with neighboring nodes at the same level. Specifically, they used binary matrices to encode the relationship of target nodes with their parents and neighbors.

Similarly, Wang et al. [65] introduced structural position representations for tree-like graphs. However, they combine sequential and structural positional encoding to enrich the contextual and structural language data. The absolute position and relative position encoding for each word

w_{i}

could be defined as:

\begin{matrix} {PE}_{i} = f (\frac{Abs (v_{i})}{{10,000}^{2 i / d}}) \end{matrix}

(108)

\begin{matrix} {PE}_{i j} = \frac{x_{i} W^{Q} {(x_{j} W^{K})}^{⊺} + x_{i} W^{Q} {(a_{i j}^{K})}^{⊺}}{\sqrt{d}} \end{matrix}

(109)

where Abs is the absolute position of the word in the sentence, d denotes the hidden size of K, Q matrix,

f (\cdot)

is the

s i n

/

c o s

function depending on the even/old dimension, respectively, and R is the matrix presenting relative position representation.

The sentences also are represented in an independent tree which could represent the structural relations between words. For structural position encoding, the absolute and relative structural position of a node

v_{i}

could be encoded as:

\begin{matrix} {PE}_{i} = d (v_{i}, r o o t) \end{matrix}

(110)

\begin{matrix} {PE}_{i j} = \{\begin{matrix} {PE}_{i} - {PE}_{j} & if (v_{i}, v_{j}) \in E . \\ {PE}_{i} + {PE}_{j} & if (v_{i}, v_{j}) \notin E, i > j . \\ - ({PE}_{i} + {PE}_{j}) & if (v_{i}, v_{j}) \notin E, i < j . \\ 0 & otherwise . \end{matrix} \end{matrix}

(111)

where

d (\cdot \cdot)

denotes the distance between the root node and the target nodes. They then use a linear function to combine sequential PE and structural PE as inputs to the transformer encoder.

To capture more global structural information in the tree-like graphs, Cai and Lam [282] also proposed an absolute position encoding to capture the relation between target and root nodes. Regarding the relative positional encoding, they use attention score to measure the relationship between nodes in the same shortest path sampled from graphs. The power of using the shortest path is that it can capture the hieratical proximity and the global structure of the graph. Given two nodes

v_{i}

and

v_{j}

, the attention score between two nodes can be calculated as:

\begin{matrix} S_{i j} = H_{i} W_{q}^{⊺} W_{k} H_{j} \end{matrix}

(112)

where

W_{q}

and

W_{k}

are trainable projection matrices,

H_{i}

and

H_{j}

depict the node presentation

v_{i}

and

v_{j}

, respectively. To define the relationship

r_{i \to j}

between two nodes

v_{i}

and

v_{j}

, they adopt a bi-directional GRUs model, which could be defined as follows:

\begin{matrix} {\vec{s}}_{i} = GRU ({\vec{s}}_{i - 1}, S P D_{i \to j}) \end{matrix}

(113)

\begin{matrix} {\overset{\leftarrow}{s}}_{i} = GRU ({\overset{\leftarrow}{s}}_{i + 1}, S P D_{i \to j}) \end{matrix}

(114)

where

S P D

denotes the shortest path from node

v_{i}

to node

v_{j}

,

{\vec{s}}_{i}

and

{\overset{\leftarrow}{s}}_{i}

are the states of the forward and backward GRU, respectively.

Several models tried to encode positional information of nodes based on subgraph sampling [63,283]. Zhang et al. [63] proposed a Graph-Bert model, which samples the subgraph structure using absolute and relative positional encoding layers. In terms of subgraph sampling, they adopt a top-k intimacy sampling strategy to capture subgraphs as inputs for positional encoding layers. Four layers in the model are responsible for positional encoding. Since several strategies were implemented to capture the structural information in graphs, the advantage of Graph-Bert is that it can be trainable with various types of subgraphs. In addition, Graph-Bert could be further fine-tuned to learn various downstream tasks. For each node

v_{i}

in a subgraph

G_{i} = (V_{i}, E_{i})

, they first embed raw feature

x_{i}

using a linear function. They then adopt three layers to encode the positional information of a node, including absolute role embedding, relative positional embedding, and hop-based relative distance embedding. Formally, the output of three embedding layers of the node

v_{i}

from subgraph

G_{i}

could be defined as follows:

\begin{matrix} {PE}_{i}^{(1)} = f (W L (v_{i})), \end{matrix}

(115)

\begin{matrix} {PE}_{i}^{(2)} = f (P (v_{i})), \end{matrix}

(116)

\begin{matrix} {PE}_{i}^{(3)} = f (H (v_{j}, v_{i})), \end{matrix}

(117)

\begin{matrix} f (x_{i}) = {[s i n (\frac{x_{i}}{{10,000}^{2 l / d}}), c o s (\frac{x_{i}}{{10,000}^{2 l + 1 / d}})]}_{l = 0}^{⌊\frac{d}{2}⌋}, \end{matrix}

(118)

where

W L (v_{i})

denotes the WL code that labels node

v_{i}

, which can be calculated from whole graphs, l and d are the numbers of interactions throughout all nodes, and the vector dimension of nodes,

P (\cdot)

is a position metric,

H (\cdot \cdot)

denotes the distance metric between two nodes, and

{PE}_{i}^{(1)}

,

{PE}_{i}^{(2)}

,

{PE}_{i}^{(3)}

denote the absolute, relative structure intimacy, and relative structure hop PE, respectively. They then aggregate all the vector embeddings together as initial embedding vectors for the graph transformer encoder. Mathematically, the transformer architecture could be explained as follows:

\begin{matrix} h_{i}^{(0)} = {PE}_{i}^{(1)} + {PE}_{i}^{(2)} + {PE}_{i}^{(3)} + X_{i} \end{matrix}

(119)

\begin{matrix} H^{(l)} = Transformer (H^{(l - 1)}) \end{matrix}

(120)

\begin{matrix} Z_{i} = Fusion (H^{(l)}) . \end{matrix}

(121)

Similar to Graph-Bert, Jeon et al. [283] tried to present subgraphs for the paper citation network and capture the contextual citation of each paper. Each paper is considered a subgraph with nodes as reference papers. To extract the citation context, they encode the order of the referenced papers in the target paper based on the position and order of the referenced papers. In addition, they use the WL label to capture the structural role of the references. The approach by Liu et al. [289] was conceptually similar to [283]. However, there is a significant difference between them. They proposed an MCN sampling strategy to capture the contextual neighbors from a subgraph. The purpose of MCN sampling is based on the importance of the target node based on the frequency of occurrence when sampling.

In several types of graphs, such as molecular networks, the edges could bring features presenting the chemical connections between atoms. Several models adopted Laplacian eigenvectors to encode the positional node information with edge features [29,284]. Dwivedi and Bresson [29] proposed the positional encoding strategy using node position and edge channel as inputs to the transformer model. The idea of this model is to use Laplacian eigenvectors to encode the node position information from graphs and then define edge channels to capture the global graph structures. The advantage of using the Laplacian eigenvector is that it can help the transformer model learn the proximity of neighbor nodes by maximizing the dot product operator between Q and K matrix. They first pre-computed Laplacian eigenvectors from the Laplacian matrix that could be calculated as:

\begin{matrix} Δ = I - D^{- \frac{1}{2}} A D^{- \frac{1}{2}} = U^{⊺} Λ U \end{matrix}

(122)

where

Δ

is the Laplacian matrix, and

Λ

and U denote the eigenvalues and eigenvectors, respectively. The Laplacian eigenvectors

λ_{i}

then could denote the positional encoding for node

v_{i}

. Given node

v_{i}

with feature

x_{i}

and the edge feature

e_{i j}

, the first hidden layer and edge channel could be defined as:

\begin{matrix} h_{i}^{(0)} = A^{0} x_{i} + λ_{i}^{(0)} + a^{(0)} \end{matrix}

(123)

\begin{matrix} e_{i j}^{(0)} = B^{(0)} e_{i j} + b^{(0)} . \end{matrix}

(124)

The hidden layers

{\hat{h}}_{i}^{(l + 1)}

of node

v_{i}

and the edge channel

{\hat{e}}_{i}^{(l + 1)}

at layer

l + 1

could be defined as follows:

\begin{matrix} {\hat{h}}_{i}^{(l + 1)} = O_{h}^{(l)} \underset{k = 1}{\overset{H}{| |}} (\sum_{j \in N_{i}} A_{i j}^{k, l} V^{k, l} h_{j}^{l}), \end{matrix}

(125)

\begin{matrix} {\hat{e}}_{i}^{(l + 1)} = O_{e}^{(l)} \underset{k = 1}{\overset{H}{| |}} (A_{i j}^{k, l}), \end{matrix}

(126)

\begin{matrix} S_{i j}^{k, l} = (\frac{Q^{k, l} h_{i}^{l} \cdot K^{k, l} h_{j}^{l}}{\sqrt{d_{k}}}) \cdot E^{k, l} e_{i j}^{l} \end{matrix}

(127)

where Q, K, V, E are learned output projection matrices, H denotes the number of attention head.

Similar to [29], Kreuzer at al. [284] aimed to add edge channels to all pairs of nodes in an input graph. However, the critical difference between them is that they combine full-graph attention with sparse attention. One of the advantages of the model is that it could capture more global structural information since they implement self-attention to nodes in the sparse graph. Therefore, they use two different types of similarity matrices to guide the transformer model to distinguish the local and global connections between nodes in graphs. Formally, they re-define the similarity matrix for pair of connected and disconnected nodes, which could be defined as follows:

\begin{matrix} {\hat{S}}_{i j}^{k, l} = \{\begin{matrix} \frac{Q^{1, k, l} h_{i}^{l} \cdot K^{1, k, l} h_{i}^{l} E^{1, k, l} e_{i j}}{\sqrt{d}} & if (v_{i}, v_{j}) \in E . \\ \frac{Q^{2, k, l} h_{i}^{l} \cdot K^{2, k, l} h_{i}^{l} E^{2, k, l} e_{i j}}{\sqrt{d}} & otherwise . \end{matrix} \end{matrix}

(128)

where

{\hat{S}}_{i j}^{k, l}

denotes the similarity between two nodes

v_{i}

and

v_{j}

,

(Q^{1}, K^{1}, E^{1})

and

(Q^{2}, K^{2}, E^{2})

are the keys, queries, and edge projections of connected and disconnected pair nodes, respectively.

In some specific cases where graphs are sparse, small, or fully connected, the self-attention mechanism could lead to the over-smoothing problem and structure loss since it cannot learn the graph structure. To overcome these limitations, several models adopt GNNs as an auxiliary model to maintain the local structure of the target nodes [99,100,285]. Rong et al. [100] proposed the Grover model, which integrates the message-passing mechanism into the transformer encoder for self-supervised tasks. They used the dynamic message-passing mechanism to capture the number of hops compatible with different graph structures. To avoid the over-smoothing problem, they used a long-range residual connection to strengthen the awareness of local structures.

Several models attempted to integrate GNNs on top of the multi-attention sublayers to preserve local structure between nodes neighbors [63,99,290]. For instance, Lin et al. [99] presented Mesh Graphormer model to capture the global and local information from 3D human mesh. Unlike the Grover model, they inserted a sublayer graph residual block with two GCN layers on top of the multi-head attention layer to capture more local connections between connected pair nodes. Hu et al. [285] integrated message-passing with a transformer model for heterogeneous graphs. Since heterogeneous graphs have different types of node and edge relations, they proposed an attention score, which could capture the importance of nodes. Given a source node

v_{i}

and a target node

v_{j}

with the edge

e_{i j}

, the attention score could be defined as:

\begin{matrix} S (v_{i}, e_{i j}, v_{j}) = Softmax (\overset{M_{h e a d}}{\underset{m = 1}{| |}} α^{m} (v_{i}, e_{i j}, v_{j})) \end{matrix}

(129)

\begin{matrix} α^{m} (v_{i}, e_{i j}, v_{j}) = K_{}^{m} (v_{i}) W_{τ (e_{i j})} Q_{}^{m} (v_{j}) \frac{μ}{\sqrt{d}} \end{matrix}

(130)

where

α^{m} (\cdot, \cdot, \cdot)

denotes the m-th attention head,

W_{τ (e_{i j})}

is the attentive trainable weights for each edge types, K and Q are linear projection of all type of source node

v_{i}

and

v_{j}

, respectively, and

μ

is the importance of each relationship.

Nguyen et al. [61] introduced the UGformer model, which uses a convolution layer on top of the transformer layer to work with sparse and small graphs. Applying only self-attention could result in structure loss in several small-sized and sparse graphs. A GNN layer is stacked after the output of the transformer encoder to maintain local structures in graphs. One of the advantages of the GNN layer is that it can help the transformer model retain the local structure information since all the nodes in the input graph are fully connected.

In graphs, the nodes are arranged chaotically and non-ordered compared to sentences in documents and pixels in images. They can be in a multidimensional space and interact with each other through connection. Therefore, the structural information around a node can be extracted by the centrality of the node and its edges without the need for a positional encoding strategy. Recently, several proposed studies have shown remarkable results in understanding graph structure.

Several graph transformer models have been proposed to capture the structural relations in the natural language processing area. Zhu et al. [62] presented a transformer model to encode abstract meaning representation (AMR) graphs to word sequences. This is the first transformer model that aims to integrate structural knowledge in AMR graphs. The model aims to add a sequence of edge features to the similarity matrix and attention score to capture the graph structure. Formally, the attention score and the vector embedding could be defined as:

\begin{matrix} S_{i j} = \frac{(x_{i} W^{Q}) {(x_{j} W^{K} + r_{i j} W^{R})}^{⊺}}{\sqrt{d}} \end{matrix}

(131)

\begin{matrix} Z_{i} = \sum_{j = 1}^{n} Softmax (S_{i j}) (x_{j} W^{V} + r_{i j} W^{F}) \end{matrix}

(132)

where

W^{R}

and

W^{F}

are parameter matrices,

r_{i j}

is the vector representation for the relation between

v_{i}

and

v_{j}

, which could be computed by several methods, such as average values or summation. Khoo et al. [286] introduced the StA-PLAN model, which aims to detect fake news on social networking sites. Given a node

v_{i}

, the attention score and the node embedding could be defined as:

\begin{matrix} S_{i j} = \frac{q_{i} K_{j}^{⊺} + a_{i j}^{K}}{\sqrt{d}} \end{matrix}

(133)

\begin{matrix} Z_{i} = \sum_{j = 1}^{n} Softmax (S_{i j}) (Z_{j} + a_{i j}^{V}) \end{matrix}

(134)

where

a_{i j}^{K}

and

a_{i j}^{V}

denotes the learned parameter vectors, which represent the relation types between

v_{i}

and

v_{j}

. The

a_{i j}^{K}

matrix aims to capture the structural information surrounding target nodes, while the purpose of

a_{i j}^{V}

matrix is to spread to other nodes.

The study from [66] aims to add the edge information between nodes to the similarity matrix. However, the difference is that they add a label information matrix combined with node features as input for the graph transformer. Formally, the feature propagation at the first layer could be defined as:

\begin{matrix} H^{(l + 1)} = σ (((1 - β) \tilde{A} + β I) H^{(l)}) H^{(0)} = X + \hat{Y} W_{d} \end{matrix}

(135)

where X and

\hat{Y}

denote the input feature and partially labeled matrix, respectively.

\tilde{A} = D^{- 1} A

and

β

is a predefined hyper-parameter. They then put a message-passing layer on top of the multi-head attention layers to capture the local graph structures.

Schmitt et al. [47] proposed a model that adds the relative position embedding parameter to the proximity and attention score matrices in the graph-to-text problem. The main objective of this model is to define the attention score of relationships between nodes based not only on the topology of the nodes but also on their connection weights extracted from shortest paths. Specifically, the proximity matrix of a node and its attention score can be defined as:

\begin{matrix} S_{i j} = \frac{H_{i} K_{}^{Q} (H_{j} W_{}^{K})}{\sqrt{d}} + γ R_{i j} \end{matrix}

(136)

where

γ

denotes a scalar embedding, and

R_{i j}

presents a relative positional encoding between node

v_{i}

and

v_{j}

which are sampled from shortest paths P.

Ying et al. [67] introduced the Graphormer model, which aims to encode effectively graph structures. The model first captures the importance of nodes in graphs by describing the node centrality.The hidden state at the first layer of a node

v_{i}

could be defined as:

\begin{matrix} h_{i}^{(0)} = x_{i} + z_{{deg}^{-} (v_{i})}^{-} + z_{{deg}^{+} (v_{i})}^{+} \end{matrix}

(137)

where

z_{{deg}^{-} (v_{i})}^{-}

and

z_{{deg}^{+} (v_{i})}^{+}

depict the embedding vectors of in-degree and out-degree of node

v_{i}

, respectively. To capture the global structure and the connection between nodes, they add more information about the node pairwise and edge features to the similarity matrix S. Mathematically, the similarity matrix S that captures the relation between keys and queries matrix could be defined as:

\begin{matrix} S_{i j} = \frac{(h_{i} W_{Q}) {(h_{j} W_{K})}^{⊺}}{\sqrt{d}} + b_{φ (v_{i}, v_{j})} + c_{i j} \end{matrix}

(138)

\begin{matrix} c_{i j} = \frac{1}{N} \sum_{n = 1}^{N} x_{e_{n}} {(w_{n}^{E})}^{⊺} \end{matrix}

(139)

where

b_{φ (v_{i}, v_{j})}

is the learnable scalar indexed by the shortest-path distance from node

v_{i}

to node

v_{j}

,

(w_{n}^{E})

denotes the weight embedding of edge, and

x_{e_{n}}

denotes the n-th edge feature in the shortest path from

v_{i}

to

v_{j}

. Using the centrality encoding strategy, the Graphormer model could capture the importance of nodes in graphs that are significant in several graphs, such as the social network. Furthermore, the spatial encoding based on the shortest path could help the model capture the local and global structural information in graphs.

By contrast, Hussain et al. [30] proposed EGT (Edge-augmented Graph Transformer) model to capture the graph structure more in-depth by only using edge channels. The main idea of this model is to consider the proximity in an input graph matrix of size k-hop. In this case, the self-attention captures the edge information channels obtained using the shortest-path distance (SPD) between two nodes in an input matrix. They added edge channels to the proximity matrix of the two nodes and hidden layers for each target node. The attention matrix at layer l-th and m-th attention head could be defined as:

\begin{matrix} A^{m, l} = Softmax (H_{}^{m, l}) ⊙ σ (G^{m, l}) \end{matrix}

(140)

\begin{matrix} H_{}^{m, l} = (\frac{Q^{m, l} {(K^{m, l})}^{⊺}}{\sqrt{d_{k}}}) + E^{m, l} \end{matrix}

(141)

where m, l denote the m-th attention head and l-th hidden layer, respectively,

G^{m, l}

and

E^{m, l}

are the two matrices obtained from edge channels between two nodes by a linear function,

σ (\cdot)

is the sigmoid function. To capture the importance of nodes, they introduced a centrality score for each node which could be obtained from a k-hop distance. The main idea is to make the model capable of distinguishing non-isomorphic subgraphs, and the model’s performance is at least better than the 1-WL test. Formally, the centrality scaler matrix could be defined as:

\begin{matrix} s_{i}^{m, l} = ln (1 + \sum_{j = 1}^{N} σ (G^{m, l} e_{i j})) \end{matrix}

(142)

where N denotes the number of nodes in a matrix input and

e_{i j}

is the edge between two nodes

v_{i}

and

v_{j}

. In addition, they also added positional encoding, which is based on SVD. They first decompose the adjacency matrix

A \approx \hat{U} {\hat{V}}^{⊺}

, then concatenate two matrices U and V as positional encoding. However, the experimental results show that the model’s performance is not significantly improved compared to the original version.

To sum up, since the graph structure differs from the text and images mentioned above, various models have adjusted self-attention to apply the transformer to graph data. Moreover, it can also be considered that graph transformer architecture is a GAT in fully connected graphs. Therefore, in some specially structured graphs, ideal models combining GNNs as an auxiliary module for transformers also yield remarkable results. Several models [30,47] showed remarkable success in an in-depth understanding of the graph structure using edge channels based on the shortest-path distance. These results could bring a new approach to applying transformer architecture to graph representation learning.

3.5. Non-Euclidean Models

Graph representation learning models in Euclidean space have shown significant results for various applications [4,291]. In Euclidean space, graph representation learning models aim to map the graph entities to low-dimensional vector points. However, in the real world, graphs could have complex structures and various shapes, and the number of nodes could increase exponentially over time [292]. Representing such graphs in Euclidean space could lead to an incomplete representation of the graph structure and information loss [68,293]. Several recent studies have focused on representing complex structural graphs in non-Euclidean space and different metrics, which yielded some desirable results [68,70,102,293]. Each type of geometry has the advantage of describing differently shaped graph structures. For graph representation in non-Euclidean spaces, there are two typical spaces, spherical and hyperbolic, each one has its advantages. Spherical space could represent graph structures with large cycles, while hyperbolic space is suitable for hierarchical graph structures. Another method is Gaussian-based models, which could learn embeddings as a probability distribution in a latent space. This can be appropriate with the distribution in several graphs since a node could belong to different clusters based on probability density. This section covers various models in non-Euclidean space and Gaussian models.

3.5.1. Hyperbolic Embedding Models

Hyperbolic geometry has the advantage of representing hierarchical graph data, which is tree-like and mostly obeys the power law [292]. Since the Euclidean operators could not be implemented directly in hyperbolic space, most models focus on transforming the properties of models from hyperbolic space (e.g., operators, optimization) to a tangent space where we are familiar with Euclidean operators. We first briefly introduce some basic notions and definitions of hyperbolic geometry and then cover graph embedding models later.

Definition 9

(Hyperbolic space [102]). A hyperbolic space (sometimes called Bolyai–Lobachevsky space) is an n-dimensional Riemannian manifold of constant negative curvature. When

n = 2

, it is also called the hyperbolic plane.

Due to the complex structure of hyperbolic space, the visual representation of data and implementing operators in hyperbolic space seems complicated. Most models use a tangent space to approximate a manifold as an n-dimensional vector space. Formally, the manifold and tangent space could be defined as follows:

Definition 10

(Manifold and Tangent space [293]). A manifold

M

of multi-dimension n is a topological space where the Euclidean space

R^{n}

could locally approximate its neighborhood. When

n = 2

, it is also called surfaces. A tangent space

T_{v} M

is a Euclidean space

R^{n}

that approximates the manifold

M

at any node v in graphs.

The hyperbolic space is a smooth Riemannian manifold, considered a locally Euclidean space where we could generate Euclidean operations. The Riemannian manifold could be defined as follows:

Definition 11

(Riemannian manifold [293]). A Riemannian manifold is defined as a tuple

(M, g)

, where g denotes a Riemannian metric which is a smooth collection of inner products on the associated tangent space:

{〈\cdot \cdot〉}_{v} : T_{v} M \times T_{v} M

. The metric space g denotes curvature properties, such as the angle and the volume.

There are several isometric models which are different metrics. However, two hyperbolic models, Poincaré and Lorentz, are widely studied in graph representation learning. Mathematically, the Poincaré and Lorentz models could be defined as:

Definition 12

(Poincaré Model [70]). A Poincaré ball is a Riemannian manifold with a tuple

(B_{c}^{n}, g_{x}^{B})

, where c is a negative curvature, and

B_{c}^{n} = \{x \in R^{n} : {∥x∥}^{2} < - \frac{1}{c}\}

is a open ball with radius

r = 1 / \sqrt{| c |}

. The matrix tensor

g_{x}^{B} = {(λ_{x}^{2})}^{2} g_{}^{E}

denotes a conformal factor

λ_{x}^{c} = \frac{2}{1 + c {∥x∥}_{2}^{2}}

and

g_{}^{E}

is a Euclidean matrix. Since

R^{2}

could present a single hierarchical structure sufficiently, the Poincaré disk

B_{2}^{n}

is commonly used to define hyperbolic geometry.

Unlike the Poincaré disk, the Lorentz model is suitable for representing cyclic graphs. The Lorentz model has different characteristics from the Poincaré disk, but they are equivalent and could be transformed into each other. Mathematically, the Lorentz model is defined as follows:

Definition 13

(Lorentz/hyperboloid Model [102]). A Lorentz or hyperboloid model is a Riemannian manifold with a tuple

(L_{c}^{n}, g_{x}^{L})

, where

L_{c}^{n} = \{x \in R^{n + 1} : {〈x, x〉}_{L} < \frac{1}{c}\}

with a negative curvature c, and

g_{c}^{n} = d i a g {([- 1 1 1 \dots 1])}_{n}

.

Most studies flatten a hyperbolic manifold and then apply graph operations in tangent space, which are similar to Euclidean space. Once the results are available, they will be mapped back into the hyperbolic space. The projection of components from hyperbolic space to the manifold and back projection is handed through exponential and logarithmic mapping functions, which will be shown in the models below in detail. Table 16 summarizes hyperbolic models for graphs.

Nickel Kiela [70] was among the first studies to learn graph embeddings in Poincaré ball based on similarities and hierarchies of nodes. They first put all nodes in graphs into the Poincaré disk and optimize the distance between pairwise nodes. Mathematically, the distance measure in Poincaré disk between two nodes

v_{i}

and

v_{j}

could be defined as:

\begin{matrix} d (Z_{i}, Z_{j}) = arccosh (1 + 2 \frac{{∥Z_{i} - Z_{j}∥}^{2}}{(1 - {∥Z_{i}∥}^{2}) (1 - {∥Z_{j}∥}^{2})}) . \end{matrix}

(143)

They then define operators and compute the loss function on the tangent space. The loss function can be minimized using Riemannian SGD (RSGD) optimization. Formally, the loss function is defined as:

\begin{matrix} L (V) = \sum_{(v_{i}, v_{j}) \in E} log \frac{e^{- d (v_{i}, v_{j})}}{\sum_{v_{k} \in N_{n e g} (v_{i})} e^{- d (v_{i}, v_{k})}} . \end{matrix}

(144)

Similarly, the study of Nickel and Kiela [102] tried to improve embeddings in the Poincaré model by learning pairwise hierarchical relations in graphs. However, the difference between [70,102] is that they adopted the Lorentz model to learn embeddings. Wang et al. [71] tried to learn embeddings of heterogeneous graphs in the Poincaré disk. The meta-paths are generated using random-walk sampling strategies. They then use Equation (143) to calculate the distance between two nodes in the vector space. The Riemannian stochastic gradient descent (RSGD) is used to optimize the objective function, which minimizes the proximity between target nodes and their neighbors. Mathematically, given a node

v_{i}

and set of its neighbors

N (v_{i})

, the objective function could be defined as:

\begin{matrix} L (V) = \sum_{(v_{i}, v_{j}) \in E} [α log σ (Z_{i}^{⊺} Z_{j}) + (1 - α) \sum_{k = 1}^{n} (E_{v_{k} \sim P (v_{i})} log σ (Z_{i}^{⊺} Z_{k}))] . \end{matrix}

(145)

Since there is no definition of GNN operations in the hyperbolic space, most models tried to transform GNN operators from the hyperbolic space to the tangent manifold and performed the operators in this space. The work of Chami et al. [293] aimed to transform features from the Euclidean space to a tangent manifold and perform aggregation and activation functions on this space. The results are then projected to the H space. Exponential and logarithmic functions are used to map between T and H space. Given a vector

x^{0, E} \in R^{d}

in Euclidean space, the mapping features from Euclidean space into hyperboloid manifold could be defined as:

\begin{matrix} x^{0, H} = {exp}_{o}^{C} (0, x^{0, E}) = (\sqrt{C} cosh (\frac{| | x^{0, E} | |_{2}}{\sqrt{K}}), \sqrt{C} sinh (\frac{| | x^{0, E} | |_{2}}{\sqrt{K}}) \frac{x^{0, E}}{| | x^{0, E} | |_{2}}) \end{matrix}

(146)

where

o : = \{\sqrt{C}, 0, \dots, 0\} \in H^{d, C}

denotes the original pole in the hyperbolic space. The model defines trainable curves C at different layers and mapping operations between the hyperbolic space and manifold. After mapping input features into hyperbolic space, the definition operators for the message mapping mechanism can be defined as:

\begin{matrix} h_{i}^{l, H} = (W^{l} \otimes^{K_{l - 1}} x_{i}^{l - 1, H}) \oplus^{C_{l - 1}} b^{l} \end{matrix}

(147)

\begin{matrix} m_{i}^{l, H} = {AGG}^{C_{l - 1}} {(h_{}^{l, H})}_{i} \end{matrix}

(148)

\begin{matrix} Z_{i}^{l, H} = σ^{\otimes^{C_{l - 1}, C_{l}}} (m_{i}^{l, H}) \end{matrix}

(149)

where

AGG (\cdot)

denotes the hyperbolic aggregation, which is based on the attention mechanism and could be calculated as:

\begin{matrix} {AGG}^{C} {(x_{}^{H})}_{i} = {exp}_{x_{i}^{H}}^{C} (\sum_{v_{j} \in N (v_{i})} w_{i j} {log}_{x_{i}^{H}}^{C} x_{j}^{H}) . \end{matrix}

(150)

Similarly, Zhang et al. [68] used Gyrovector space to build GNN layers in hyperbolic space. The Gyrovector space is an open d-dimensional ball which could be defined as:

\begin{matrix} D_{c}^{d} : = \{x \in R^{d} : c {∥x∥}^{2} < 1\} \end{matrix}

(151)

where c denotes the radius of the ball. They first put input features x from Euclidean space into the Gyrovector ball by an exponential mapping:

\begin{matrix} x^{H} = {exp}_{o}^{c} (x) = tanh (\sqrt{c} ∥x∥) \frac{x}{(\sqrt{c} ∥x∥)} . \end{matrix}

(152)

After exponential mapping, a linear transform is used as a latent representation of each node. Formally, the hidden state of a node is obtained by applying a shared linear transformation matrix M:

\begin{matrix} h_{i} = M \otimes^{c} x^{H} = \frac{1}{\sqrt{c}} tanh (\frac{∥M x^{H}∥}{∥x^{H}∥} {tanh}^{- 1} (\sqrt{c} ∥x∥)) . \end{matrix}

(153)

Liu et al. [294] employed a similar approach for HGNNs. However, the main objective of this study aimed to compare which space could be suitable for graph data representation between Poincaré disk and Lorentz space in terms of implementing GNN models. Zhang et al. [69] proposed an LGCN model to learn embeddings on the Lorentzian model. They first map input features from Euclidean space to hyperbolic space and rebuild GNN operators, such as dot product and linear transformation. In addition, they aggregate information from neighborhood nodes by computing the centroid of nodes in the hyperbolic space. Given a node

v_{i}

and its feature

h_{i}^{d, C} \in H^{d \times C}

and a set of its neighbors

N (v_{i})

, finding a centroid of nodes could be considered as an optimization problem:

\begin{matrix} {\bar{c}}^{d, C} = \underset{c^{d, C} \in H^{d, C}}{arg min} \sum_{j \in N (v_{i})} w_{i j} d_{L}^{2} (h_{j}^{d, C}, c^{d, C}) \end{matrix}

(154)

\begin{matrix} c^{d, C} = \sqrt{C} \frac{\sum_{j \in N (v_{i})} w_{i j} h_{j}^{d, C}}{|{∥\sum_{j \in N (v_{i})} w_{i j} h_{j}^{d, C}∥}_{L}|} \end{matrix}

(155)

where

w_{i j}

denotes the weights that could be normalized and computed via an attention coefficient

μ

as:

\begin{matrix} w_{i j} = \frac{exp (μ_{i j})}{\sum_{v_{m} \in N (v_{i})} exp (μ_{i m})} \end{matrix}

(156)

\begin{matrix} μ_{i j} = - d_{L}^{2} (M \otimes^{C} h_{i}^{d, C}, M \otimes^{C} h_{j}^{d, C}) \end{matrix}

(157)

where

d_{L}^{2}

denotes a squared Lorentzian distance [295], M is a matrix to transform node feature to attention-based space.

3.5.2. Spherical Embedding Models

Spherical geometry is a topological space that could represent graph structure with large cycles [296]. A spherical space is an n-dimensional Riemannian manifold of constant positive curvature (

c > 0

). The implementation of operators is similar to hyperbolic space. For each point x in the spherical space S, the connection between the spherical space S and a tangent space

T_{x} S_{c}^{n}

could be computed through exponential and logarithmic mapping, which could be defined as:

\begin{matrix} {exp}_{x}^{c} (v) = x \oplus^{c} (tanh (\sqrt{c} \frac{λ_{x} ∥v∥}{2}) \frac{v}{\sqrt{c} ∥v∥}) \end{matrix}

(158)

\begin{matrix} {log}_{x}^{c} (y) = \frac{2}{\sqrt{c} λ_{x}} {tanh}^{- 1} (\sqrt{c} ∥- x \oplus^{c} y∥) \frac{- x \oplus^{c} y}{∥- x \oplus^{c} y∥} \end{matrix}

(159)

where x and y are two points in the S space and

v \in T_{x} S_{c}^{n}

. The distance between x and y, and the operator

\oplus_{c}

is the Möbius addition for any

x, y \in S

which could be defined as:

\begin{matrix} d_{c} (x, y) & = \frac{2}{\sqrt{c}} {tanh}^{- 1} (\sqrt{c} ∥- x \oplus^{c} y∥) \end{matrix}

(160)

\begin{matrix} x \oplus^{c} y & = \frac{(1 + 2 c 〈x, y〉 + c {∥y∥}^{2}) x + (1 - c {∥x∥}^{2}) y}{1 + 2 c 〈x, y〉 + c^{2} {∥x∥}^{2} {∥y∥}^{2}} \end{matrix}

(161)

A few studies on spherical space have yielded promising results in recent years [103,297]. For instance, Cao et al. [103] proposed combining the representation of the knowledge graphs into three different spaces, including Euclidean, hyperbolic, and spherical spaces. Specifically, each entity e of the knowledge graph could be presented by three embeddings: Euclidean space

E_{e}

, hyperbolic space

E_{h}

, and hypersphere space

E_{s}

. For a triplet

(h, r, t)

denotes the head, relation, and tail, respectively, in the knowledge graph, the embedding of an entity e in the hyperbolic and hypersphere space could be defined as:

\begin{matrix} H_{e} & = r \otimes^{v} {exp}_{o}^{v} (e) \end{matrix}

(162)

\begin{matrix} S_{e} & = r \otimes^{u} {exp}_{o}^{u} (e) \end{matrix}

(163)

where

H_{e}

and

S_{e}

denote the embedding of entity e in the hyperbolic and hypersphere space with two negative and positive curvatures u and v, respectively. They then can obtain the embedding for each entity by combining embedding components from different spaces through the exponential function.

3.5.3. Gaussian Embedding Models

Most of the aforementioned graph embedding models represent graph entities as vector points in latent space. However, several models proposed using probability distributions to learn embeddings, considering each entity as density-based embedding. Unlike vector-point embedding models, density-based models learn embeddings as continuous density in latent space. Vector embeddings could be represented as a multivariate Gaussian distribution

P \sim N (μ, Σ)

. Table 17 presents a summary of Gaussian embedding models for various types of graphs.

Most Gaussian embedding models are inspired by the Word2Gauss approach [300] in natural language processing. Each word is projected into an infinite-dimensional space rather than a vector which could enable a rich geometry for better quantification of the word-type properties in the latent space. Kipf and Welling [72] introduced a VGAE (Variational Graph Autoencoder) model based on an autoencoder architecture. The encoder part includes two convolutional graph layers. The model takes an adjacency matrix A and features X as input for GCNs layers. Mathematically, the

μ

and

log Σ^{2}

parameters can be defined as:

\begin{matrix} μ & = σ_{μ} (H^{0}, A) = \tilde{A} H^{0} W_{1} \end{matrix}

(164)

\begin{matrix} log Σ^{2} & = σ_{Σ} (H^{0}, A) = \tilde{A} H^{0} W_{1} \end{matrix}

(165)

The vector embedding

Z_{i}

for each node

v_{i}

could be defined as:

\begin{matrix} q (Z_{i} | X, A) = N (Z_{i} | μ_{i}, d i a g (Σ_{i}^{2})) . \end{matrix}

(166)

Zhu et al. [298] proposed DVNE (Deep Variational Network Embedding) model to preserve the similarity between the distributions based on autoencoder architecture. The DVNE model aims to preserve 1st-order and 2nd-order proximity in Wasserstein space. The main objective is to minimize the Wasserstein distance between distributions over the Gaussian distribution. For

p \in [0, \infty)

, the Wasserstein p-distance between two distributions P and Q could be defined as:

\begin{matrix} d_{p} (P, Q) : = \underset{}{inf_{x, y} {∥x - y∥}_{p}} \end{matrix}

(167)

where

(x, y)

is all pairs of random variables. Since Gaussian distribution is used to present the uncertainty of nodes in latent space, they aim to preserve the Wasserstein distance, which could be formulated as:

\begin{matrix} W_{2} {(N (μ_{1}, Σ_{1}); N (μ_{2}, Σ_{2}))}^{2} = {∥μ_{1} - μ_{2}∥}_{2}^{2} + {∥Σ_{1} - Σ_{2}∥}_{F}^{2} \end{matrix}

(168)

where

Σ_{1}

and

Σ_{2}

are diagonal covariance matrices. They use the square-exponential loss to minimize the proximity and the reconstruction loss that could be defined as:

\begin{matrix} L (V) = L_{1} (V) + α L_{2} (V) \end{matrix}

(169)

\begin{matrix} L_{1} (V) = \sum_{v_{i}, v_{j}, v_{k} \in V} E_{i j}^{2} + exp (- E_{i k}) \end{matrix}

(170)

\begin{matrix} L_{2} (V) = \underset{X, \hat{X}}{i n f} {∥X \circ (X - \hat{X})∥}_{2}^{2} \end{matrix}

(171)

where

(i, j, k)

denotes a tuple

(v_{i}, v_{j}, v_{k})

from k-hop neighborhood of

v_{i}

with constraints defined in Equation (174).

Santos et al. [104] targeted node representation associated with the uncertainty in classification tasks for heterogeneous graphs. Specifically, each node

v_{i}

is projected into latent space, which is followed by a Gaussian distribution

Z_{i} = N (μ_{i}, Σ_{i})

. The key objective of the model is to minimize the loss function for the classification problem and regularization for structural loss using stochastic gradient descent. In terms of structure preservation, they aim to preserve the 1-hop distance for each target node in graphs. They use KL Divergence to minimize the difference between two probability distributions which could be defined as:

\begin{matrix} L (V) = \sum_{v_{i} \in V} \sum_{v_{j} \in N (v_{i})} w_{i j} D_{K L} (Z_{i} | | Z_{j}), \end{matrix}

(172)

\begin{matrix} D_{K L} (Z_{i} | | Z_{j}) = \frac{1}{2} (t r (Σ_{i}^{- 1} Σ_{j}^{}) + {(μ_{i} - μ_{j})}^{⊺} Σ_{i}^{- 1} (μ_{i} - μ_{j}) - d - log \frac{det (Σ_{j}^{})}{det (Σ_{i}^{})}) \end{matrix}

(173)

where

w_{i j}

denotes the weight of

e_{i j}

.

Similar to [104], Bojchevski and Gunnemann [23] proposed a G2G (Graph2Gauss) model, an idea of learning node embeddings as uncertain. The difference between G2G and [104] is that the G2G model could preserve up to k-hop neighborhood proximity, which captures the global graph structure. Given a target node

v_{i}

and set of its neighbors within k-hop distance

N_{i k}

, the objective of G2G is to build a set of constraints that the dissimilarity measure from node

v_{i}

to all nodes in

N_{i 1}

should be smaller compared to all nodes in

N_{i 2}

and so on, up to k-hop. Mathematically, the pairwise constraints could be defined as:

\begin{matrix} E (Z_{i}, Z_{j}) < E (Z_{i}, Z_{m}), \forall v_{i} \in V, \forall v_{m} \in N_{i k}, \forall j < m . \end{matrix}

(174)

Similar to [104], Equation (173) is used to measure the dissimilarity between two distributions, and they adopt square-exponential loss for optimization. Since there has been an uncertain lack of information about node embedding in latent space, the G2G model could learn node embeddings efficiently by representing nodes as Gaussian distribution. In addition, the personalized ranking could learn the order of nodes in graphs and the distance between them, eventually capturing local and global structural information.

To learn embeddings in knowledge graphs, He et al. [299] proposed the KG2E model to learn the certainty of entities and relations in knowledge graphs. This first study aims to learn node embeddings based on density in knowledge graphs. Furthermore, KG2E adopted two methods to measure the scores of triplets to learn embeddings based on symmetric and asymmetric similarity. For each triplet

(h, r, t)

which denotes head, relation, and tail, respectively, there are three different Gaussian distributions

H \sim N (μ_{h}, Σ_{h})

,

R \sim N (μ_{r}, Σ_{r})

,

T \sim N (μ_{t}, Σ_{t})

.

The score function of the KG2E model could be defined as:

\begin{matrix} s (h, r, t) = s (P_{e}, P_{r}) = D_{K L} (P_{e}, P_{r}) \end{matrix}

(175)

where

P_{e}

denotes probability distribution

P_{e} \sim N (μ_{h} - μ_{t}, Σ_{h} - Σ_{t})

, and

D_{K L} (\cdot \cdot)

is defined in Equation (173).

4. Applications

This section focuses on practical applications of graph representation learning in various fields. We first explain how a graph can be constructed in different contexts and then discuss how graph-based models could be applied in practice. In several areas, graph embedding models may not be applied directly to solve specific tasks in the real world. However, they could act as auxiliary modules to help improve the performance of specific tasks.

4.1. Computer Vision

In image processing, a graph could be constructed for image processing problems by representing each pixel as a node and each edge describing the relationship between nodes. Several CGNNs have been proposed for the task of learning convolutional filters in the frequency domain [56,96,301,302] for classification tasks. For instance, Defferrard et al. [96] transform images from the spatial domain to the spectral domain using a Fourier transform. They then learn the convolution filter on the frequency domain to produce a sparse Laplacian matrix as input for classification tasks.

Each image segment or an entire image could be considered to be nodes and edges describing the relationships between them. Several graph-based methods adopt this strategy for the clustering tasks [303,304]. For example, Yang et al. [304] first extract image features from a CNN model and then build a large face image dataset. By considering k nearest neighbors as super nodes and the relationship between them as edges, they could construct graphs and use CGNNs to learn the cluster labels.

By considering each object in images as nodes and the relations between them as edges, several GNNs are applied to learn the proximity between the objects [305,306]. The graph embedding models can help image processing algorithms understand images’ semantic relationships and spatial structure more deeply. CGNNs could aid in connecting relationships between objects in images and scene graphs [307,308]. For example, Johnson et al. [308] used scene graphs to predict corresponding layouts by calculating embeddings for objects and their relationships in the image. The model is used to learn the vector embeddings for objects. CGNNs could also help build a reasoning network for objects in images to capture the interaction between objects [309,310]. Chen et al. [309] proposed CGNNs for relation reasoning for new actions, which should be more friendly in interaction space. CGNNs with a self-attention mechanism could help enhance the object representation in images combined with text guidance [310]. This strategy could capture relations between arbitrary regions in images and the interactions between objects in images.

Graphs are also constructed by combining general knowledge of text and images with image-question-facts. Specifically, each node in the graph is an embedding processed from the word and image processing algorithms, and edges represent the relationship between them. CGNNs are used to learn embeddings to retrieve the correct fact. For instance, Cui et al. [306] build a joint model by combining the semantic and spatial scene graphs to find internal correlations across object instances in images. They first use object detection approaches to detect objects in the images. Then, a semantic graph is constructed with nodes as objects, and edges connect objects in the image.

The sequence of skeletons is treated as a dynamic graph consisting of a sequence of snapshots. Each snapshot corresponds to a skeleton frame where each node is a joint, and the edge describes the connection of bones. Several graph-based models effectively learn features containing joint and bone information and their dependencies, which can facilitate action recognition. For instance, spatial and motion information in skeleton data could be presented in graphs for pose prediction [311]. CGNNs could also help to understand and recognize action sequences in videos and object relationships [312,313,314]. The models could assign candidate moments by structural reasoning to model relations between moments in videos. Each moment could be considered to be a node, and the edges are relations between them.

4.2. Natural Language Processing

A graph can be built by considering each word/document as a node, and edges could describe the relationship between the nodes or their occurrence frequency in a given context. Recently, graph-based models, which are mainly based on GNNs have attracted much attention in several applications to text classification tasks [16,18,22,211]. These models can capture the rich relational structure and preserve global structure information of documents. For instance, the DGCN model [211] was proposed to classify scientific publications by considering each paper as a node and edges as reference citations. Hamilton et al. [22] build document graphs from Reddit post data and citation data to predict paper and post categories.

Each sentence could be represented as a graph, with each node being a word and an edge describing the dependency between them. Recently, the graph-based models applied in machine translation show the potential of syntax-aware feature representations of words [315,316]. For instance, CGNNs could be used to predict syntactic dependency trees of source sentences to produce representations of words [315,316]. Bastings et al. [315] first transformed sentences into syntactic dependency trees. They then use convolution layers to learn dependency relation types which could support language models to understand the meanings of words in depth. SynGCN [317] could capture the structural relation between words in sentences from a dependency graph. They consider nodes as words and edges as the co-occurrence frequencies of two words in the entire corpus. The structural semantics could then be used to improve the performance of the Elmo model. F-GCN model (fusion GCN) from [318] can help a dialog system deal with diagram questions. They then use RNNs to capture the meaning of answers by considering the answer representation obtained from F-GCN as inputs.

4.3. Computer Security

The development of technology has increased the cyber security risk that is a social concern. Researchers have proposed various solutions, such as firewalls and intrusion detection systems against network attacks. The intrusion detection system could be divided into two main approaches: predefined rule-based and artificial intelligence-based. In recent years, several GNNs have also been applied to improve the detection of network attacks [231,319].

A graph network is constructed by nodes that are IP addresses and edges that are packet data flows exchanged between IP addresses. Hao et al. [319] proposed a Packet2Vec model to capture the proximity features based on graph representation to build an intrusion detection system. They consider each network traffic flow as a graph where nodes are packets and edges denote the similarity between two packets. They then prune the relational graph to obtain a local proximity feature for each graph and use this as input for an autoencoder that could learn embeddings for each network flow. By contrast, Lo et al. [231] built an intrusion detection system by improving the GraphSAGE model for building intrusion detection systems. They construct a computer graph by considering each IP address as a node and the edges as links between IP addresses. By constructing the computer network graph, they can train the model with packet information from clients to the server to detect anomalous information.

Since the source code can be represented as an abstract syntax tree, several graph embedding models have been proposed to help detect malware code by learning dependency graphs. The dependency graph is built with API function nodes and directed edges representing other functional queries from the current function [320,321]. For instance, Narayanan et al. [320] built rooted subgraphs that capture the connection between API functions in source code. The model learns latent representations of rooted subgraphs and detects malware code in an Android operating system.

4.4. Bioinformatics

Drug discovery is vital in finding new chemical properties to treat diseases. A graph could represent the interaction between drug–drug, drug–target, and protein–protein by considering each node as a drug or a protein, and the edges describe the interaction between them. Since searching for successful drug candidates is challenging, graph-based models can aid experiments in the chemistry area. Several models [322,323,324] use a matrix-factorization-based model to predict the interaction between the clinical manifestations of diseases and their molecular signatures. This could contribute to predicting potential diseases based on human genomic databases. Yoshihiro et al. [325] constructed a bipartite graph as a chemical and genomic space to capture the interaction between drug and protein nodes. The matrix factorization-based model is used to learn embeddings and detect potential drug interactions [326,327]. The matrix factorization-based model is also used to project drugs and targets into a common low-rank feature space and create new drugs and targets for predicting drug–target interactions [328,329,330].

For protein–protein interaction presentation, the atoms could be considered to be nodes and edges are bonds that link two atoms. CGNNs help to predict the properties of molecular and classification tasks [323,324]. The attention-based CGNN model could predict chemical stability [331]. For identifying drug targets, several CGNNs are used to present the structure of protein–protein interaction assessment and function prediction [332]. The DeepWalk model measures similarities within a miRNA-disease association network [333].

In recent years, various GNN-based models have been proposed to predict drug–drug interactions [334,335,336,337]. A knowledge graph is constructed by a set of entity-relation-entity triples that describe the interactions between drug–drug nodes. Most knowledge graphs comprise drug features gained from DrugBank or KEGG dataset. GNN-based models could then explore the topological structure of drugs in the knowledge graph to predict the potential drug–drug interactions. For example, Lin et al. [337] proposed a GNN model to learn drug features and knowledge graph structure to predict the drug–drug interaction. Su et al. [334] proposed a DDKG model based on attentive GNNs to learn the drug embedding. The key idea of this model is first to initialize the node features based on SMILE sequences gained from a random-walk sampling strategy. This could construct the node features, bringing a global structure at the initial step. The model then learns node embeddings based on attention from the neighborhood and triple facts.

For drug–target interaction, which is a crucial area in drug discovery, several graph-based models could help to predict the drug–target interactions [325,338,339,340]. For example, Hao et al. [338] proposed a GNN-based model to learn drug–target interaction. A heterogeneous graph is constructed by nodes denoting a drug–target pair, and the edges describe the connection strength between the pairs. The model then applies the graph convolution filter to learn the feature of drug–protein pairs. Peng et al. [340] introduced EEG-DTI (end-to-end graph drug–target interactions) model to predict the relations between drugs and targets based on the GCN model. A heterogeneous graph represents the interactions between drugs and targets (e.g., drug–drug interaction and drug–protein interaction). Each edge type denotes the interactions between two entities in the heterogeneous graph, computed based on Jaccard similarity. GCN model then could help to learn node representation and predict the drug–target relation.

4.5. Social Media Analysis

Social networks have played an essential role in communication among users worldwide. Various graph embedding models have been applied to social media to learn embeddings [72,216]. In social networks, most graphs are initialized by defining nodes as users and edges describing user relationships (e.g., messages). Several GNNs are applied to help detect fake news shared on social networking platforms [341,342]. Nguyen et al. [343] employed GraphSAGE to classify fake news in social media.

For social interaction network representation, directed graphs can be built with nodes as users and edges describing user social relationships or action interactions [344,345]. GAT model [346] is used to predict the influence of essential users in the social network. Piao et al. [344] proposed a motif-based graph attention network to predict the social relationships between customers and companies. CGNNs [345] could classify relations between political and regular news media users.

4.6. Recommendation Systems

Bipartite graphs could be used to represent user–item interactions in recommendation systems. In the graph, nodes can be presented as users and categories, and directed edges denote interactions between users and items. Several traditional models based on matrix factorization have been applied to help the system understand the predictions of users’ ratings on items or click actions [347,348].

The side information is mainly the attributes of categories and users. This information helps to represent the relationship between users and items [349,350]. Heterogeneous graphs with properties of nodes and relationship types have been proposed to represent side information. Several shallow models [351,352] and GNNs [353,354] have been proposed to capture the interaction between users and items with side information.

Knowledge graphs can represent entities and their relationships from the knowledge base. Knowledge graphs, therefore, can collect high-order proximity between items and user interactions [267,355]. Exploiting social correlations such as homophily and social influence can improve the performance of online recommendation systems. Several applications put user and item interaction into CGNNs to learn embeddings and solve collaborative filtering problems [356,357].

4.7. Smart Cities

People encounter current traffic-related issues in big cities, such as traffic jams and difficulty finding parking spaces. Addressing these issues that play an essential role in building smart cities and transportation has been studied in the literature. Traffic forecasting is one of the crucial factors in improving traffic efficiency and solving related problems.

In this context, a graph can be considered a whole city map with nodes as intersections and edges describing paths connecting the nodes [358]. For the traffic prediction problem, nodes and edges can have properties that describe the traffic state. Besides static graphs, dynamic graphs with dynamic adjacency matrices are also used to describe the dynamic state of overtime traffic. In recent years, GNNs have been widely applied to predict traffic conditions [359,360,361]. CGNNs are applied to predict traffic flow conditions in big cities. For example, the attention-based GNNs are applied to predict traffic congestion status [359]. The self-attention mechanism can capture the state around the target vehicles by considering connections to its ego network.

Dynamic graphs can represent a spatial-temporal dependency. Several applications are also practical to spatial-temporal transportation networks [358,360,361] to predict traffic flow. A study from [360] applied CGNNs to capture the traffic’s current state and historical conditions to predict the next state of the traffic condition. They construct a dynamic graph including a collection of snapshots, and each snapshot is the current state of the traffic. The model then uses temporal convolution layers to learn dynamic node features.

There are several applications of graph embedding for energy-related problems, such as predicting electricity consumption and predicting wind and solar energy through IoT systems [362,363,364]. For example, in the problem of solar irradiance forecasting, a graph can be presented with nodes being the locations of energy measurements and edges describing the correlation between them according to historical data. By contrast, with wind speed forecasting systems, nodes describe wind farms, and edges represent two nodes as neighbors. For instance, a convolutional graph autoencoder-based model is used to help predict the radiative state of solar energy [362]. Khodayar et al. [363] presented a CGNN model to predict wind speed and direction.

4.8. Computational Social Science

The analysis of social issues and human behavior has been expanded due to the increased availability of big data. The application of computational science has created new opportunities for researchers in social science to achieve more detailed information by examining the trends and patterns of social phenomena.

Graph-based models provide an improved understanding of social issues, ranging from social inequity to the spread of child maltreatment across generations, using data, theory, and diverse media sources. In existing studies, directed acyclic graphs are typically used to represent the research hypotheses about causal relationships among variables based on existing literature [365]. They encode nodes in DAG graphs using color and predicting factors affecting children’s psychology.

The graph-based models have also been applied to political problems to explore the phenomena and trends of influence of political populations in social networks [366,367,368]. For example, the Community2Vec model [369] is used in [366] to identify political populations in a community. They measure the similarity between politically different communities and identify changes and trends in the community.

4.9. Digital Humanity

There is a growing interest in computational narrative analysis in the field of digital humanities. A character graph is one of the essential ways of expressing narratives, representing various relationships formed between characters as the story progresses. There are various methods of constructing a character graph. Typically, they use conversations in the story [370,371], consider events that make up the story [372,373], or are based on the co-occurrence of characters [374,375]. Recently, high-quality distributed representations of characters have been attempted for efficient and easy machine learning of character graphs. Lee and Jung [168] applied a subgraph-based graph embedding model to the dynamic networks of movie characters to compare similarities between stories. Inoue et al. [376] presented GNNs that could help to learn character embedding. If the characters in different works share similar properties, their connection relationships can be represented. Kounelis et al. [377] presented the movie’s plot to improve the movie recommendation system’s performance using the Graph2Vec model. First, a character relationship graph containing all necessary information for plot representation was built using the movie script. Graph embedding was then generated from the character relationship graph through the embedding method.

Since the digitization of large-scale literature works enables computer analysis of narratives, character graph embedding can be used in various ways in digital humanities. First, it is easy to measure similarities between stories. Second, since the unique aesthetic characteristics of a specific writer can be identified through machine learning on character graph embedding, it can be used to compare the styles of writers or to develop a story generation system that imitates the writing style of a specific writer. Third, characters can be classified based on their roles and personalities through character graph embedding. Fourth, character graph embedding can play an essential role in improving the computer’s narrative understanding in research on the narrative intelligence of computers, which has been attracting significant interest in recent years. Riedl [378] defined narrative intelligence as the ability to create and understand stories and argued that when computers are equipped with narrative intelligence, systems benefit humans, such as human-computer dialog systems can be developed.

4.10. Semiconductor Manufacturing

Recently, graph representation learning models have expanded their field of applications to semiconductor research and development, including semiconductor material screening [379], circuit design [380,381], chip design [382], and semiconductor manufacturing and supply chain management [383,384]. A graph could be constructed from crystal networks with nodes being atoms and edges describing the relation between them. GNNs could help to predict material properties for the fast screening of candidate materials. A tuples graph neural network exhibits an improved generalization capability for unseen data for bandgap prediction in perovskite crystals, 2D material, materials for solar cells, and binary and ternary inorganic compound semiconductors [379].

For circuit [380,381] (or chip [382]) design tasks, a graph could be constructed with nodes being transistors (or macro-cells/blocks) and edges being wires (or routings). A computer chip could be considered to be a hypergraph of circuit components as a netlist graph. Chip designers adopted GNNs to unleash themselves from extensive design space exploration, i.e., running many parallel physical design implementations to achieve the best timing closure [385]. It can be significantly fast and efficient by combining the GNN and LSTM, responsible for netlist encoding and sequential flow modeling [382].

For semiconductor manufacturing tasks, a graph could be constructed as nodes representing an operation of a job on a device and directed edges representing a relation between nodes (e.g., process flow). Graph2Vec model was adopted to learn fab states, which are the processing of lots on machines and transfer between machines and setup and maintenance activities [386].

4.11. Weather Forecasting

Graph-based models have shown great effectiveness in learning correlations of spatial and temporal features for weather prediction tasks. Typically, a graph is built with nodes that describe stations that collect information in different geographical locations, edges that describe the spatial neighbors of the stations, and attributes that describe meteorological variables. Meteorological variables include measurements over a specified time period, such as temperature, humidity, soil moisture, seismic source, etc. Several CGNNs have been proposed to capture spatial relations between different geographical locations [363,387,388]. The models could help to combine with an LSTM model to process temporal time series in solar radiation prediction.

Since the interactions of meteorological variables at different locations could show dynamic behaviors and mutual influence, several graph-based models could help to capture these dynamic influences. For example, Lira et al. [389] proposed spatio-temporal attention-based GNNs to predict frost by capturing the influences between round environmental sensors (nodes). GNNs could help to capture the spatial dependency patterns for predicting several weather tasks (e.g., temperature and humidity prediction) [390]. Jeon et al. [391] proposed the MST-GCN model (Multi-attributed Spatio-Temporal GCN) to predict hourly solar irradiance using GCNs to learn the spatio-temporal correlations between meteorological variables (e.g., temperature, wind speed, relative humidity, etc.). A graph could be constructed by considering each station as a node, and edges were defined in two ways: distances between stations and correlations between historical meteorological variables of stations.

For air quality prediction, several graph-based models could help to predict air quality by learning the correlations between air pollution variables (e.g., CO

_{2}

, O

_{3}

, etc.) and meteorological variables. Since the diffusion of air pollutants is affected by multiple factors (e.g., meteorological conditions, vehicle emissions, and industrial sources), Xiao et al. [392] used CGNNs to help predict the diffusion of PM

_{2.5}

concentration. A dynamic-directed graph could be constructed by considering nodes as stations, and edges denote the distance of stations that denotes the edges’ strength. Several studies [393,394] used a heterogeneous graph to represent the type of each station as a node type and the connection between them as an edge. They then adopt RGNNs to learn spatial and temporal correlations to predict air quality.

Graph-based models could also help to predict surface-related tasks, such as seismic source characterization, seismic wave analysis, and earthquakes [395,396,397]. A graph could be constructed by nodes as stations and edges are the relationships of nodes if seismic events can occur simultaneously. For example, GNNs could help to estimate earthquake location by leveraging waveform information from multiple stations [397].

Several graph-based models could help predict sea surface temperature (SST), which plays an important role in various ocean-related predictions (e.g., global warming, oceanic environmental protection, and disaster reduction) [398,399,400]. A graph could be constructed as longitude and latitude grids where nodes are coordinates and edges represent the relationship between nodes. For example, GCNs [401] could help to learn temporal shifts to predict the sea surface temperature.

A graph can be constructed as a hierarchical tree representing different variables’ influences on global-scale weather forecasting. Lam et al. [402] transformed the 3D data into a multi-resolution icosahedral network as a mesh hierarchy. GNNs could help to capture long-range spatial interactions for modeling global forecasting systems. Shi et al. [403] designed an adaptive mesh grid based on Voronoi polygons for ocean simulations and used GNNs to investigate environmental parameters for arbitrary visual mapping.

For El Niño-Southern Oscillation (ENSO) prediction and global ocean-atmosphere interaction, graph-based models could help to improve climate prediction tasks. For example, Cachay et al. [404] constructed the climate graph that defines each grid cell as a node, and the edge denotes the similarity between nodes. GNNs could help to capture correlations between spatio-temporal samples to improve the El Niño forecasting task. CGNNs is used to capture interactions of different air-sea coupling strengths in various period of time [405].

5. Evaluation Methods

Since we cannot evaluate the performance of learned graph embedding models, numerous benchmarks have been used to investigate the performance of various models to solve specific downstream tasks. A good graph embedding model should provide vector representations of graph entities that preserve the graph structure and entity relationship. In this section, we first discuss benchmark datasets and then examine typical downstream tasks such as classification, ranking, and regression tasks.

5.1. Benchmark Datasets

The goal of benchmark datasets is the standard for developing, evaluating, and comparing graph representation learning models. Table 18 presents a summary of benchmark datasets for graph embedding models. Typically, the benchmark datasets are categorized into four main groups: citation networks, social networks, webpages, and biochemical networks.

Citation networks depict a network of documents linked together in a particular manner. The citation graph could be constructed by considering each node as a document, and each edge of two nodes describes the citation. Since citations are directed from a source document to a destination document, citation graphs usually are directed graphs. Since the labels of the citation network could represent document topics, there are several downstream tasks for citation network analysis, such as link prediction and node classification.

The social networking datasets describe the connections between users on social networking sites such as Facebook [409], Twitter, or blog forums [422]. An online social network describes the links between users or groups, usually through the link of adding friends. In addition, the user properties could also be included in the graphs. Due to privacy policies, several user information could be hidden in social networks. Therefore, there are several downstream tasks for social network analysis, such as missing node classification and link prediction.

Webpage datasets are a term used to refer to a collection of webpages of information organized and linked together to represent information such as text and images. A webpage can be an article, a category, or any information page. For instance, Wikipedia dataset [406] in Table 18 is a directed network with 2405 nodes and 17,981 edges linking nodes. There are several downstream tasks for webpage analysis, such as node classification and link prediction.

Biochemical networks are data sources containing information in the field of biochemistry area. Several downstream tasks are used for the biochemical networks, such as predicting the composition of cancer classification proteins [423] or drug–drug interaction prediction. Protein dataset [411], for example, are biochemical graph sets with 1113 graphs. The protein dataset includes more than 435,000 nodes and 1,621,000 links between nodes.

5.2. Downstream Tasks and Evaluation Metrics

After the models learn vector embeddings, various downstream tasks can benefit from such embeddings, such as classification tasks, regression tasks, and prediction tasks. Therefore, we first discuss the downstream tasks and then examine the standard evaluation metrics for each task.

The classification problem denotes the graph entities classification tasks, including node classification, edge classification, subgraph classification, and graph classification. There are also link prediction tasks that can be considered to be classification problems where the output is discrete. The goal of classification tasks is to predict the classes of unlabeled graph entities given a set of labeled entities. For example, in the Cora citation network, the task of node classification is to classify publications grouped into seven main classes that correspond to the research area. Several evaluation metrics could be used for classification tasks, such as Accuracy (A), Precision (P), Recall (R), and

F_{β}

score.

Consider a dataset consisting n multi-label examples

D = {x_{i}, Y_{i}}

where

1 \leq i \leq n

and

Y_{i} = {0, 1}^{m}

with a labelset

L

:

|L| = m

. Let C be a multi-label classifier and

{\hat{Y}}_{i} = C (x_{i}) = {0, 1}^{m}

denotes the set of the label for the classification of the sample

x_{i}

. Accuracy measures the number of correct classifications over all the number (predicted and actual) of labels for that instance. The higher the accuracy, the more accurate the models. The precision metric P is measured as the ratio of predicted correct labels to the total number of actual labels. The Recall metric R is measured as the ratio of predicted correct labels to the total number of predicted labels. In several classification tasks, where both Precision and Recall metrics are important in the model evaluation, a common metric that combines both Recall and Precision is called

F_{β}

-score. Mathematically, the Accuracy (A), Precision (P), Recall (R), and

F_{β}

score for all instances could be computed as:

\begin{matrix} A = \frac{1}{n} \sum_{i = 1}^{n} \frac{|Y_{i} \cap {\hat{Y}}_{i}|}{|Y_{i} \cup {\hat{Y}}_{i}|}, P = \frac{1}{n} \sum_{i = 1}^{n} \frac{|Y_{i} \cap {\hat{Y}}_{i}|}{|{\hat{Y}}_{i}|}, R = \frac{1}{n} \sum_{i = 1}^{n} \frac{|Y_{i} \cap {\hat{Y}}_{i}|}{|Y_{i}|}, F_{β}^{} = {(1 + β)}^{2} \frac{P R}{β^{2} P + R}, \end{matrix}

(176)

where

β

denotes a positive factor to change the impact between P score and R score. Besides measuring based on samples, we could measure the performance based on label evaluation. This could be beneficial when the number of labels is large, and it is challenging to compute a performance snapshot. Therefore, we can compute the score in each class label first and then average over all classes (macro averaging) or across all the classes and samples (micro averaging).

Several regression metrics could be used for rating prediction in recommendation systems to evaluate the user–item interaction pairs [424,425]. Similar to graph classification, graph regression problems aim to predict the labels of entities in a graph by learning neighbor node labels. However, the difference between classification and regression problems is that the metrics for regression problems are explained in error, which measures the difference between predicted and actual labels. Another metric that is widely used for measuring the performance of regression models is the Coefficient of Discrimination (

R^{2}

).

R^{2}

measures the ratio between the unexplained variations over total variations. The standard metrics, which are Mean Square Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE), and

R^{2}

could be computed as:

\begin{matrix} M S E = \frac{{∥Y_{i} - {\hat{Y}}_{i}∥}_{2}^{2}}{N}, M A E = \sum_{i = 1}^{n} |Y_{i}^{} - {\hat{Y}}_{i}^{}|, \end{matrix}

(177)

\begin{matrix} R M S E = \sqrt{\frac{{∥Y_{i} - {\hat{Y}}_{i}∥}_{2}^{2}}{N}}, R^{2} = 1 - \frac{{∥Y_{i} - {\hat{Y}}_{i}∥}_{2}^{2}}{{∥Y_{i} - \bar{Y}∥}_{2}^{2}}, \end{matrix}

(178)

where

\bar{Y}

denotes the mean of the dependent variable in the dataset.

In graph ranking tasks, the models try to predict the rank (or relevance index) of a list of items for a particular task. The models can learn the order of the predicted labels for multi-label classification problems where each sample has more than one label. For example, in the case of most recommendation systems, a user could have more than one preference. Several commonly used metrics evaluate model performance for the raking problems, including Mean Reciprocal Rank (

M R R

),

P @ k

,

M A P @ k

, and

R @ k

.

The Mean Reciprocal Rank (MRR) metric is one of the simplest metrics in evaluating ranking models. The MRR metric calculates the average of the corresponding terms of the first related item for a set of queries Q, which can be defined as:

M R R = \frac{1}{|Q|} \sum_{i = 1}^{Q} \frac{1}{r a n k_{i}} .

(179)

One of the limitations of the MRR metric is that it only counts from the first item to the rank of actual labels in the query list. Precision at k (

P @ k

) is a metric that could compute the proportion of the number of the first k predicted labels in the actual labelset over the k. The predicted label order is not taken into account in the

P @ k

metric. Similar to the

P @ k

metric,

R e c a l l @ k

is a metric that computes the proportion of the number of the first k predicted labels in the actual labelset over all relevant items.

P @ k = \frac{|{Y_{i}} \cap {{\hat{Y}}_{i} [: k]}|}{|{{\hat{Y}}_{i} [: k]}|} R @ k = \frac{|{Y_{i}} \cap {{\hat{Y}}_{i} [: k]}|}{|{{\hat{Y}}_{i}}|} .

(180)

Mean Average Precision (

M A P @ k

) can be applied to the entire dataset because of the stability in ranking the labels. Compared to

P @ k

, MAP focuses more on how many predicted labels are in the actual labelset, where the order of predicted labels is taken into account. Mathematically,

M A P @ k

is the average across all instances, which could be calculated as:

M A P = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{K} \sum_{k = 1}^{K} P @ k \times r e l (k)

(181)

where

r e l (k)

denotes the relevance at k for each sample.

5.3. Libraries for Graph Representation Learning

Several libraries provide state-of-the-art graph representation learning models which have a variety of sampling strategies and downstream tasks. To ease researchers to develop graph representation learning models, this section introduces a collection of libraries, which are summarized in Table 19.

PyTorch Geometric (PyG) [426] is a graph neural network framework based on PyTorch. PyG can handle and process large-scale graph data, multi-GPU training, multiple classic graph neural network models, and multiple commonly used graph neural network training datasets. PyG already contains numerous benchmark datasets, including Cora, Citeseer, etc. It is also effortless to initialize such a dataset, which will automatically download the corresponding dataset and process it into the required format for various GNNs. Furthermore, many real-world datasets are stored as heterogeneous graphs, which prompted the introduction of specialized functions in PyG.

Deep Graph Library (DGL) [427] is an easy-to-use, high-performance, scalable Python package for building graph representation learning models. DGL has better memory management for GNNs that can be expressed as sparse matrix multiplication. Therefore, the DGL library provides flexible, efficient strategies for building new GNN layers. Furthermore, DGL has a programming interface for flexible applications, which helps researchers understand the process of designing GNNs for large graphs.

OpenNE is a standard Network Representation Learning framework that enables graph embedding models with multi-GPU training. Most of the graph embedding models in OpenNE framework are matrix factorization-based and shallow models, including DeepWalk, LINE, Node2Vec, GraRep, TADW, GCN, HOPE, GF, and SDNE. Furthermore, the framework could also provide dimension-reduction techniques, such as t-SNE and PCA, for visualization.

Developed by Tsinghua University, CogDL [428] framework could integrate various downstream tasks and match evaluation methods. Therefore, the framework could help researchers efficiently run the results of various baseline models and develop new graph embedding models. Furthermore, the framework could integrate algorithms task-oriented and assigns each algorithm to one or more tasks. In addition, CogDL also supports researchers in customizing models and datasets and is embedded in the overall framework of CogDL to help them improve efficiency.

For complex downstream tasks, such as graph generation and graph neural network interpretability, DIG [429] provides APIs for data interfaces, commonly used algorithms and evaluation standards. DIG is designed to make it easy for researchers to develop algorithms and conduct experimental comparisons with benchmark models. The framework could help researchers solve tasks, including graph generation, graph self-supervised learning, graph neural network interpretability, and 3D graph deep-learning tasks.

GraphVite [430] is a general-purpose graph embedding framework to help researchers learn embeddings with high speed and large scale. One of the advantages of the framework is that GraphVite can support multi-GPU parallelism. Therefore, the framework could quickly handle large-scale graphs with millions of nodes and learn the node representation. GraphVite provides complete training and evaluation for various types of graphs, including homogeneous and knowledge graphs.

GraphLearn [431] is a graph learning framework designed to develop and apply large-scale GNN models in practical situations. The framework could help researchers parallel negative sampling from industrial application scenarios to speed up training. Therefore, the framework could implement sampling optimization, sparse scene model optimization, and GPU acceleration for PyTorch. As a result, GraphLearn has been successfully applied in Alibaba and several scenarios, such as recommendation systems and security risks.

Another library for graph representation learning is Connector which can help researchers develop new graph embedding models efficiently. The framework provides various widespread graph representation learning models, such as matrix factorization-based, shallow, and GNN models. Furthermore, Connector can analyze various types of graphs, ranging from homogeneous and heterogeneous graphs to knowledge graphs with different sampling processes. Therefore, Connector could help researchers efficiently construct various baseline models and design new graph embedding models.

6. Challenges and Future Research Directions

Graph representation learning models have gained significant results recently, showing the model’s power and practical applications in the real world. However, there are still several challenges for existing models since graph data are complicated (e.g., nodes are disordered and have a complex relationship). Therefore, this section presents challenges and promising directions for future research. The main challenges and future research directions of graph embedding models are summarized as follows:

Graph representation in a suitable geometric space: Euclidean space may not capture the graph structure sufficiently and lead to structural information loss.
The trade-off between the graph structure and node features: Most graph embedding models suffer from noise from non-useful neighbor node features. This could lead to a trade-off between structure preservation and node feature representation, which can be the future research direction.
Dynamic graphs: Many real-world graphs show dynamic behaviors representing entities’ dynamic structure and properties, bringing a potential research direction.
Over-smoothing problem: Most GNN models suffer from this problem. The graph transformer model could only handle the over-smoothing problem in several cases.
Disassortative graphs: Most graph representation learning models suffer from this problem. Several solutions have been proposed but have yet to fully solve to the whole extent.
Pre-trained models: Pre-trained models could be beneficial to handle the little availability of node labels. However, a few graph embedding models have been pre-trained on specific tasks and small domains.

The performance of graph embedding models is determined by how well the geometric space for graph representation matches the graph structure [292]. Therefore, choosing a suitable geometric space to represent the graph structure is a crucial step in building efficient graph representation learning models. Most existing graph embedding models represent the graph structure in Euclidean space, which defines the similarity between entities by the inner product, Euclidean distance, and so on. However, representing the graph structure in Euclidean space may not capture the graph structure sufficiently and lead to structural information loss [432]. For example, models in Euclidean space fail to represent adequate tree-like graph data where the nodes grow exponentially and follow the power law. In the case of webpage networks with millions of nodes, there are a few important websites that are hubs and dominate the network, while most other websites have few connections, which leads to most existing models in the Euclidean space failing to learn embeddings. Recently, several studies have been trying to represent graph data in the non-Euclidean space, and the results are relatively promising [69,103,432]. Nevertheless, it still needs to be resolved whether representing graph data in non-Euclidean space is more efficient and significantly improves accuracy. One major issue is the choice of suitable isometric models, and the reasons why and when to use the models are still an open question that existing models have yet to analyze to a whole extent [294]. Another problem is that developing operators and optimization in the non-Euclidean space for deep neural networks is challenging. Most existing models aim to approximate graph data in a tangent space where we are familiar with Euclidean operators. However, several studies presented that tangent space approximation could negatively influence the training phase [293,433]. Therefore, developing operators, manifold space, and optimization for various embedding models are significant problems for implementing models in non-Euclidean space.

A good graph representation learning model should preserve the graph structure and represent appropriate features for nodes in graphs. This inspires many shallow models to explore various substructures of graph data (e.g., random walk [4,14], k-hop distance [16], motifs [87,89,90,91], subgraphs [145], graphlets [88], and roles [21]). Several of these sampling strategies ignore the substructures surrounding target nodes [4,14,16], while others omit the node features which could also carry significant information [145]. Recently, models based on message-passing mechanisms effectively capture graph structures and represent node feature embeddings. The message-passing could suffer from noise coming from non-useful neighbor node features, which cause a barrier to the downstream tasks and eventually reduce the performance of models. There are several studies have been proposed to overcome weaknesses of message-passing, such as structural identity [60], and dropout [434,435]. However, collecting sufficient structural topology and a trade-off between structure preservation and node feature representation still needs to be explored to a full extent.

Most existing graph embedding models work with static graphs where the graph structure and entity properties do not change over time [4,14]. However, in the real world, graphs are dynamic, consisting of both graph structure and properties that evolve over time [10,82]. There are several dynamic behaviors of graph evolution, including topological evolution (the set of nodes and edges change over time), feature evolution (the node and edge feature or its label changes over time), degree distribution, and the node role changes over time. However, most existing models only aim to find out which patterns of evolution should be captured and represented that do not represent fully dynamic behaviors in general [10,112]. For example, in the case of social networks, users could change personal attributes such as hometown, occupation, and their role in a specific small group over time. This leads to how models can represent the dynamic structure and properties of entities bringing a potential research direction.

Graph neural networks have shown significant advantages in working with large-scale graphs for specific tasks. However, these existing models still have limitations regarding the over-smoothing problem when stacking more GNN layers. Recently, several works have attempted to handle the over-smoothing problem, such as adding initial residual connection [28], using dropout [436], and PageRank [437]. However, most of them need to be effectively adaptable to a wide and diverse scope of various graph structures. Several graph transformer models have been proposed in recent years to overcome the limitation of the message-passing mechanism by self-attention [63,438]. However, the self-attention mechanism considers input graphs as fully connected graphs that have yet to entirely solve the over-smoothing problems, especially in small and sparse graphs [61]. Therefore, building a deep-learning model to address the over-smoothing problem is still an open question and a promising research direction.

Another challenge for graph embedding models is the problem of working with disassortative graphs for various downstream tasks, especially classification tasks. Disassortative graphs are graphs where pairs of nodes with different labels tend to be connected. For example, in the case of amino acid networks, amino acids with different labels tend to be connected by peptide bonds [439]. Looking back at the sampling mechanism of GNNs and graph transformer models, the target nodes update the vector embeddings based on the k-hop neighbor features [24,310]. This is a problem for classification tasks where the aggregation mechanisms assume that interconnected nodes should have the same label, which is completely different from the disassortative graph structure. Several methods have been proposed in recent years to overcome classification problems for disassortative graphs [58,440]. However, the message-passing-based mechanisms are still a problem and challenge when working with disassortative graphs.

Another problem in challenging deep-learning models is to pre-train the graph embedding models and then fine-tune the models on various downstream tasks. Most current models are designed independently to be suitable for some specific tasks that have yet to be generalized, even with graphs in the same domain [8]. Although several graph transformer models have been pre-trained on related tasks, the transfer of the models across other tasks is still limited in a few specific graph data [30,63]. This leads to the problem that the models must train from scratch when we have new graph data and other tasks, which is time-consuming and limits practical applicability. The pre-trained models are also beneficial to handle the little availability of node labels. Therefore, if the graph embedding models are pre-trained, they could be transferred and used to handle new tasks.

7. Conclusions

This paper has presented a comprehensive view of graph representation learning. Specifically, most models have been discussed, ranging from traditional models, such as graph kernels and matrix factorization models, to deep-learning models with various graphs. One of the most thriving models is the GNN with the power of an aggregation mechanism in learning the local and global structures of the graph. The achievements of GNN-based models have been seen in various real-world tasks with large-scale graphs. Recently, graph transformer models have shown promising results in applying self-attention to learn embeddings. However, the self-attention mechanism need also be improved to solve the over-smoothing problem to a whole extent.

Practical applications in various fields are also presented, showing the contribution of graph representation learning to society and related areas. Our paper not only shows the applications of graph embedding models but also describes how a graph is initialized in each specific domain and the application of the graph embedding model to each application. In addition, evaluation metrics and downstream tasks were also discussed to understand more about graph embedding models. Although deep graph embedding models have shown great success in recent years, they still have several limitations. The balance between the graph structure and the node features is still challenging for deep graph embedding models in various downstream tasks. Our paper also points out the current challenges and future directions of promising research.

Author Contributions

Conceptualization, V.T.H. and O.-J.L.; Methodology, V.T.H., H.-J.J., E.-S.Y., Y.Y., S.J. and O.-J.L.; Writing—original draft, V.T.H., H.-J.J., E.-S.Y., Y.Y. and S.J.; Writing—review and editing, V.T.H. and O.-J.L.; Project administration, O.-J.L.; Supervision, O.-J.L.; Funding acquisition, H.-J.J., S.J. and O.-J.L.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1F1A1065516) (O.-J.L.), in part by the Research Fund, 2022 of The Catholic University of Korea (M-2022-B0008-00153) (O.-J.L.), in part by the R&D project “Development of a Next-Generation Data Assimilation System by the Korea Institute of Atmospheric Prediction System (KIAPS),” funded by the Korea Meteorological Administration (KMA2020-02211) (H.-J.J.), in part by the National Research Foundation of Korea (NRF) under grant NRF-2022M3F3A2A01076569 (S.J.), and in part by the Advanced Institute of Convergence Technology under grant (AICT-2022-0015) (S.J.).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Open-Source Implementations

We deliver a summary of open-source implementations of graph embedding models described in Section 3. Table A1 provides open-source implementations of graph kernels (Section 3.1), matrix factorization-based (Section 3.2), and shallow models (Section 3.3). Table A2 provides open-source implementations of deep-learning-empowered models (Section 3.4). Table A3 presents open-source implementations of non-Euclidean models (Section 3.5).

Table A1. A summary of open-source implementations of graph kernels, matrix factorization-based, and shallow models, which are introduced in Section 3.1, Section 3.2, and Section 3.3, respectively. The accessibility of URLs for the open-source implementations have been checked on 16 April 2023.

Model	Category	URL
[84]	Graph kernels	https://github.com/BorgwardtLab/WWL
[111]	Graph kernels	https://github.com/hasanmdal/GraTFEL-Source
[113]	Graph kernels	https://github.com/ferencberes/online-node2vec
[36]	Graph kernels	https://github.com/yeweiysh/MSPG
[35]	Graph kernels	https://github.com/haidnguyen0909/weightedWWL
[34]	Graph kernels	https://github.com/chrsmrrs/glocalwl
[7]	Matrix factorization-based models	https://github.com/andompesta/ComE.git
GLEE [122]	Matrix factorization-based models	https://github.com/DefuLian/lightne
GraRep [15]	Matrix factorization-based models	https://github.com/ShelsonCao/GraRep
HOPE [5]	Matrix factorization-based models	https://github.com/ZW-ZHANG/HOPE
ProNE [43]	Matrix factorization-based models	https://github.com/THUDM/ProNE
TADW [131]	matrix factorization-based models	https://github.com/thunlp/tadw
PME [138]	matrix factorization-based models	https://github.com/TimDettmers/ConvE
DeepWalk [14]	Shallow models	https://github.com/phanein/deepwalk
Node2vec [4]	Shallow models	https://github.com/aditya-grover/node2vec
Node2Vec+ [148]	Shallow models	https://github.com/krishnanlab/node2vecplus_benchmarks
Struct2Vec [21]	Shallow models	https://github.com/leoribeiro/struc2vec
Gat2Vec [154]	Shallow models	https://github.com/snash4/GAT2VEC
NME [160]	Shallow models	https://github.com/HKUST-KnowComp/MNE
[162]	Shallow models	http://www3.ntu.edu.sg/home/aspatra/research/Yongjin_BI2010.zip
[166]	Shallow models	https://github.com/higd963/Multi-resolution-Network-Embedding
EvoNRL [170]	Shallow models	https://github.com/farzana0/EvoNRL
STWalk [172]	Shallow models	https://github.com/supriya-pandhre/STWalk
[173]	Shallow models	https://github.com/urielsinger/tNodeEmbed
LINE [16]	Shallow models	https://github.com/tangjianpku/LINE
DNGR [196]	Shallow models	https://github.com/ShelsonCao/DNGR
TriDNR [441]	Shallow models	https://github.com/shiruipan/TriDNR
[188]	Shallow models	https://github.com/fuguoji/event2vec

Table A2. A summary of open-source implementations of deep-learning-empowered graph embedding models discussed in Section 3.4. The accessibility of URLs for the open-source implementations have been checked on 16 April 2023.

Model	Category	URL
SDNE [50]	Graph autoencoder	http://nrl.thumedia.org/structural-deep-network-embedding
[57]	Graph autoencoder	https://github.com/minkky/Graph-Embedding
Topo-LSTM [49]	Graph autoencoder	https://github.com/vwz/topolstm
GCN [18]	Spectral GNNs	https://github.com/tkipf/gcn
[211]	Spectral GNNs	https://github.com/ZhuangCY/Coding-NN
[214]	Spectral GNNs	https://github.com/Tiiiger/SGC
FastGCN [52]	Spatial GNNs	https://github.com/matenure/FastGCN
GraphSAINT [53]	Spatial GNNs	https://github.com/GraphSAINT/GraphSAINT
Hi-GCN [54]	Spatial GNNs	https://github.com/haojiang1/hi-GCN
GIN [24]	Spatial GNNs	https://github.com/weihua916/powerful-gnns
ST-GDN [220]	Spatial GNNs	https://github.com/jill001/ST-GDN
SACNNs [223]	Spatial GNNs	https://github.com/vector-1127/SACNNs
[227]	Spatial GNNs	https://github.com/rdevon/DIM
[229]	Spatial GNNs	https://github.com/dinhinfotech/PGC-DGCNN
PHC-GNNs [233]	Spatial GNNs	https://github.com/bayer-science-for-a-better-life/phc-gnn
Dyn-GRCNN [236]	Spatial GNNs	https://github.com/RingBDStack/GCNN-In-Traffic
DMGI [243]	Spatial GNNs	https://github.com/pcy1302/DMGI
EvolveGCN [245]	Spatial GNNs	https://github.com/IBM/EvolveGCN
GAT [19]	Attentive GNNs	https://github.com/PetarV-/GAT
GATv2 [58]	Attentive GNNs	https://github.com/tech-srl/how_attentive_are_gats
SuperGAT [258]	Attentive GNNs	https://github.com/dongkwan-kim/SuperGAT
GraphStar [256]	Attentive GNNs	https://github.com/graph-star-team/graph_star
HAN [25]	Attentive GNNs	https://github.com/Jhy1993/HAN
[260]	Attentive GNNs	https://github.com/AvigdorZ/ADaptive-Structural-Fingerprint
DualHGCN [264]	Attentive GNNs	https://github.com/xuehansheng/DualHGCN
MHGCN [266]	Attentive GNNs	https://github.com/NSSSJSS/MHGCN
[334]	Attentive GNNs	https://github.com/Blair1213/DDKG
Graformer [47]	Graph transformer	https://github.com/mnschmit/graformer
Graph-Bert [63]	Graph transformer	https://github.com/jwzhanggy/Graph-Bert
EGT [30]	Graph transformer	https://github.com/shamim-hussain/egt
UGformer [61]	Graph transformer	https://github.com/daiquocnguyen/Graph-Transformer
Graphormer [67]	Graph transformer	https://github.com/microsoft/MeshGraphormer
Yao et al. [101]	Graph transformer	https://github.com/QAQ-v/HetGT
[282]	Graph transformer	https://github.com/jcyk/gtos
[29]	Graph transformer	https://github.com/graphdeeplearning/graphtransformer
SAN [284]	Graph transformer	https://github.com/DevinKreuzer/SAN
HGT [285]	Graph transformer	https://github.com/UCLA-DM/pyHGT
NI-CTR [287]	Graph transformer	https://github.com/qwerfdsaplking/F2R-HMT

Table A3. A summary of open-source implementations of non-Euclidean graph embedding models, which are described in Section 3.5. The accessibility of URLs for the open-source implementations have been checked on 16 April 2023.

Model	Category	URL
[70]	Hyperbolic space	https://github.com/facebookresearch/poincare-embeddings
[293]	Hyperbolic space	http://snap.stanford.edu/hgcn/
[294]	Hyperbolic space	https://github.com/facebookresearch/hgnn
Graph2Gauss [23]	Gaussian embedding	https://github.com/abojchevski/graph2gauss

References

Rossi, R.A.; Ahmed, N.K. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the 29th Conference on Artificial Intelligence (AAAI 2015), Austin, TX, USA, 25–30 January 2015; AAAI Press: Austin, TX, USA, 2015; pp. 4292–4293. [Google Scholar]
Ahmed, Z.; Zeeshan, S.; Dandekar, T. Mining biomedical images towards valuable information retrieval in biomedical and life sciences. Database J. Biol. Databases Curation 2016, 2016, baw118. [Google Scholar] [CrossRef]
Hamilton, W.L. Graph Representation Learning. In Synthesis Lectures on Artificial Intelligence and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar] [CrossRef]
Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining (SIGKDD 2016), San Francisco, CA, USA, 13–17 August 2016; ACM: San Francisco, CA, USA, 2016; pp. 855–864. [Google Scholar] [CrossRef]
Ou, M.; Cui, P.; Pei, J.; Zhang, Z.; Zhu, W. Asymmetric Transitivity Preserving Graph Embedding. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD 2016), San Francisco, CA, USA, 13–17 August 2016; ACM: San Francisco, CA, USA, 2016; pp. 1105–1114. [Google Scholar] [CrossRef]
Sun, J.; Bandyopadhyay, B.; Bashizade, A.; Liang, J.; Sadayappan, P.; Parthasarathy, S. ATP: Directed Graph Embedding with Asymmetric Transitivity Preservation. In Proceedings of the 33th Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Honolulu, HI, USA, 2019. [Google Scholar] [CrossRef]
Cavallari, S.; Zheng, V.W.; Cai, H.; Chang, K.C.; Cambria, E. Learning Community Embedding with Community Detection and Node Embedding on Graphs. In Proceedings of the 26th Conference on Information and Knowledge Management (CIKM 2017), Singapore, 6–10 November 2017; ACM: Singapore, 2017; pp. 377–386. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Networks Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Liu, W.; Pei, Y.; Wang, L.; Zha, D.; Fu, T. Dynamic Network Embedding by Semantic Evolution. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2019), Budapest, Hungary, 14–19 July 2019; IEEE: Budapest, Hungary, 2019; pp. 1–8. [Google Scholar] [CrossRef]
Mitrovic, S.; De Weerdt, J. Dyn2Vec: Exploiting dynamic behaviour using difference networks-based node embeddings for classification. In Proceedings of the 15th International Conference on Data Science (ICDATA 2019), Las Vegas, NV, USA, 29 July–1 August 2019; CSREA Press: Las Vegas, NV, USA, 2019; pp. 194–200. [Google Scholar]
Lee, J.B.; Rossi, R.A.; Kim, S.; Ahmed, N.K.; Koh, E. Attention Models in Graphs: A Survey. ACM Trans. Knowl. Discov. Data 2019, 13, 62.1–62.25. [Google Scholar] [CrossRef]
Xia, F.; Sun, K.; Yu, S.; Aziz, A.; Wan, L.; Pan, S.; Liu, H. Graph Learning: A Survey. IEEE Trans. Artif. Intell. 2021, 2, 109–127. [Google Scholar] [CrossRef]
Chen, F.; Wang, Y.C.; Wang, B.; Kuo, C.C. Graph representation learning: A survey. APSIPA Trans. Signal Inf. Process. 2020, 9, e15. [Google Scholar] [CrossRef]
Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online learning of social representations. In Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining (KDD 2014), New York, NY, USA, 25–30 January 2015; ACM: New York, NY, USA, 2015; pp. 701–710. [Google Scholar] [CrossRef]
Cao, S.; Lu, W.; Xu, Q. GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th International Conference on Information and Knowledge Management (CIKM 2015), Melbourne, VIC, Australia, 19–23 October 2015; ACM: Melbourne, VIC, Australia, 2015; pp. 891–900. [Google Scholar] [CrossRef]
Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW 2015), Florence, Italy, 18–22 May 2015; ACM: Florence, Italy, 2015; pp. 1067–1077. [Google Scholar] [CrossRef]
Li, Y.; Tarlow, D.; Zemel, R.S. Gated Graph Sequence Neural Networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), San Juan, PR, USA, 2–4 May 2016; ICLR: San Juan, PR, USA, 2016. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017; OpenReview.net: Toulon, France, 2017. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
Dong, Y.; Chawla, N.V.; Swami, A. metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In Proceedings of the 23rd International Conference on Knowledge Discovery and Data Mining (SIGKDD 2017), Halifax, NS, Canada, 13–17 August 2017; ACM: Halifax, NS, Canada, 2017; pp. 135–144. [Google Scholar] [CrossRef]
Ribeiro, L.F.R.; Saverese, P.H.P.; Figueiredo, D.R. struc2vec: Learning Node Representations from Structural Identity. In Proceedings of the 23rd International Conference on Knowledge Discovery and Data Mining (KDD 2017), Halifax, NS, Canada, 13–17 August 2017; ACM: Halifax, NS, Canada, 2017; pp. 385–394. [Google Scholar] [CrossRef]
Hamilton, W.L.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; NeurIPS: Long Beach, CA, USA, 2017; pp. 1024–1034. [Google Scholar]
Bojchevski, A.; Günnemann, S. Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018; OpenReview.net: Vancouver, BC, Canada, 2018. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019; OpenReview.net: New Orleans, LA, USA, 2019. [Google Scholar]
Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P.S. Heterogeneous Graph Attention Network. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; ACM: San Francisco, CA, USA, 2019; pp. 2022–2032. [Google Scholar] [CrossRef]
Velickovic, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019; OpenReview.net: New Orleans, LA, USA, 2019. [Google Scholar]
Feng, Y.; You, H.; Zhang, Z.; Ji, R.; Gao, Y. Hypergraph Neural Networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Honolulu, HI, USA, 2019; pp. 3558–3565. [Google Scholar] [CrossRef]
Chen, M.; Wei, Z.; Huang, Z.; Ding, B.; Li, Y. Simple and Deep Graph Convolutional Networks. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual Event, 13–18 July 2020; PMLR: Virtual Event, 2020; Volume 119, pp. 1725–1735. [Google Scholar]
Dwivedi, V.P.; Bresson, X. A Generalization of Transformer Networks to Graphs. arXiv 2020, arXiv:2012.09699. [Google Scholar]
Hussain, M.S.; Zaki, M.J.; Subramanian, D. Global Self-Attention as a Replacement for Graph Convolution. In Proceedings of the 28th Conference on Knowledge Discovery and Data Mining (KDD 2022), Washington, DC, USA, 14–18 August 2022; ACM: Washington, DC, USA, 2022; pp. 655–665. [Google Scholar] [CrossRef]
Weisfeiler, B.; Leman, A. The reduction of a graph to canonical form and the algebra which appears therein. NTI Ser. 1968, 2, 12–16. [Google Scholar]
Nikolentzos, G.; Siglidis, G.; Vazirgiannis, M. Graph Kernels: A Survey. J. Artif. Intell. Res. 2021, 72, 943–1027. [Google Scholar] [CrossRef]
Shervashidze, N.; Schweitzer, P.; van Leeuwen, E.J.; Mehlhorn, K.; Borgwardt, K.M. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 2011, 12, 2539–2561. [Google Scholar]
Morris, C.; Kersting, K.; Mutzel, P. Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs. In Proceedings of the International Conference on Data Mining (ICDM 2017), New Orleans, LA, USA, 18–21 November 2017; IEEE Computer Society: New Orleans, LA, USA, 2017; pp. 327–336. [Google Scholar] [CrossRef]
Nguyen, D.H.; Nguyen, C.H.; Mamitsuka, H. Learning Subtree Pattern Importance for Weisfeiler-Lehman Based Graph Kernels. Mach. Learn. 2021, 110, 1585–1607. [Google Scholar] [CrossRef]
Ye, W.; Tian, H.; Chen, Q. Graph Kernels Based on Multi-scale Graph Embeddings. arXiv 2022, arXiv:2206.00979. [Google Scholar] [CrossRef]
Orsini, F.; Frasconi, P.; Raedt, L.D. Graph Invariant Kernels. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015; AAAI Press: Buenos Aires, Argentina, 2015; pp. 3756–3762. [Google Scholar]
Belkin, M.; Niyogi, P. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems (NIPS 2001), Vancouver, BC, Canada, 3–8 December 2001; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Vancouver, BC, Canada, 2001; pp. 585–591. [Google Scholar]
Gong, C.; Tao, D.; Yang, J.; Fu, K. Signed Laplacian Embedding for Supervised Dimension Reduction. In Proceedings of the 28th Conference on Artificial Intelligence (AAAI 2014), Québec City, QC, Canada, 27–31 July 2014; AAAI Press: Québec City, QC, Canada, 2014; pp. 1847–1853. [Google Scholar]
Allab, K.; Labiod, L.; Nadif, M. A Semi-NMF-PCA Unified Framework for Data Clustering. IEEE Trans. Knowl. Data Eng. 2017, 29, 2–16. [Google Scholar] [CrossRef]
Belkin, M.; Niyogi, P. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Comput. 2003, 15, 1373–1396. [Google Scholar] [CrossRef]
He, X.; Niyogi, P. Locality Preserving Projections. In Proceedings of the 16th Neural Information Processing Systems (NIPS 2003), Vancouver and Whistler, BC, Canada, 8–13 December 2003; MIT Press: Vancouver/Whistler, BC, Canada, 2003; pp. 153–160. [Google Scholar]
Zhang, J.; Dong, Y.; Wang, Y.; Tang, J.; Ding, M. ProNE: Fast and Scalable Network Representation Learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 10–16 August 2019; ijcai.org: Macao, China, 2019; pp. 4278–4284. [Google Scholar] [CrossRef]
Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2005), Montreal, QC, Canada, 31 July–4 August 2005; IEEE: Montreal, QC, Canada, 2005; Volume 2, pp. 729–734. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
Zhao, J.; Li, C.; Wen, Q.; Wang, Y.; Liu, Y.; Sun, H.; Xie, X.; Ye, Y. Gophormer: Ego-Graph Transformer for Node Classification. arXiv 2021, arXiv:2110.13094. [Google Scholar]
Schmitt, M.; Ribeiro, L.F.R.; Dufter, P.; Gurevych, I.; Schütze, H. Modeling Graph Structure via Relative Position for Better Text Generation from Knowledge Graphs. arXiv 2020, arXiv:2006.09242. [Google Scholar]
Zhang, C.; Swami, A.; Chawla, N.V. SHNE: Representation Learning for Semantic-Associated Heterogeneous Networks. In Proceedings of the 12th International Conference on Web Search and Data Mining (WSDM 2019), Melbourne, VIC, Australia, 11–15 February 2019; ACM: Melbourne, VIC, Australia, 2019; pp. 690–698. [Google Scholar] [CrossRef]
Wang, J.; Zheng, V.W.; Liu, Z.; Chang, K.C. Topological Recurrent Neural Network for Diffusion Prediction. In Proceedings of the International Conference on Data Mining (ICDM 2017), New Orleans, LA, USA, 18–21 November 2017; Raghavan, V., Aluru, S., Karypis, G., Miele, L., Wu, X., Eds.; IEEE Computer Society: New Orleans, LA, USA, 2017; pp. 475–484. [Google Scholar] [CrossRef]
Wang, D.; Cui, P.; Zhu, W. Structural Deep Network Embedding. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining (SIGKDD 2016), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1225–1234. [Google Scholar] [CrossRef]
Tu, K.; Cui, P.; Wang, X.; Wang, F.; Zhu, W. Structural Deep Embedding for Hyper-Networks. In Proceedings of the 32nd Conference on Artificial Intelligence (AAAI 2018), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: New Orleans, LA, USA, 2018; pp. 426–433. [Google Scholar]
Chen, J.; Ma, T.; Xiao, C. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018; OpenReview.net: Vancouver, BC, Canada, 2018. [Google Scholar]
Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; Prasanna, V.K. GraphSAINT: Graph Sampling Based Inductive Learning Method. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020; OpenReview.net: Addis Ababa, Ethiopia, 2020. [Google Scholar]
Jiang, H.; Cao, P.; Xu, M.; Yang, J.; Zaïane, O.R. Hi-GCN: A hierarchical graph convolution network for graph embedding learning of brain network and brain disorders prediction. Comput. Biol. Med. 2020, 127, 104096. [Google Scholar] [CrossRef]
Li, R.; Huang, J. Learning Graph While Training: An Evolving Graph Convolutional Neural Network. arXiv 2017, arXiv:1708.04675. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral Networks and Locally Connected Networks on Graphs. In Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 14–16 April 2014; ICLR: Banff, AB, Canada, 2014. [Google Scholar]
Seo, M.; Lee, K.Y. A Graph Embedding Technique for Weighted Graphs Based on LSTM Autoencoders. J. Inf. Process. Syst. 2020, 16, 1407–1423. [Google Scholar] [CrossRef]
Brody, S.; Alon, U.; Yahav, E. How Attentive are Graph Attention Networks? In Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), Virtual Event, 25–29 April 2022; OpenReview.net: Virtual Event, 2022. [Google Scholar]
Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Akoglu, L.; Koutra, D. Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual Event, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; NeurIPS: Virtual Event, 2020. [Google Scholar]
Suresh, S.; Budde, V.; Neville, J.; Li, P.; Ma, J. Breaking the Limit of Graph Neural Networks by Improving the Assortativity of Graphs with Local Mixing Patterns. In Proceedings of the 27th Conference on Knowledge Discovery and Data Mining (KDD 2021), Singapore, 14–18 August 2021; Zhu, F., Ooi, B.C., Miao, C., Eds.; ACM: Singapore, 2021; pp. 1541–1551. [Google Scholar] [CrossRef]
Nguyen, D.Q.; Nguyen, T.D.; Phung, D. Universal Graph Transformer Self-Attention Networks. In Proceedings of the Companion Proceedings of the Web Conference 2022 (WWW 2022), Lyon, France, 25–29 April 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 193–196. [Google Scholar] [CrossRef]
Zhu, J.; Li, J.; Zhu, M.; Qian, L.; Zhang, M.; Zhou, G. Modeling Graph Structure in Transformer for Better AMR-to-Text Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 5458–5467. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, H.; Xia, C.; Sun, L. Graph-Bert: Only Attention is Needed for Learning Graph Representations. arXiv 2020, arXiv:2001.05140. [Google Scholar]
Shiv, V.L.; Quirk, C. Novel positional encodings to enable tree-based transformers. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems 2019 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; NeurIPS: Vancouver, BC, Canada, 2019; pp. 12058–12068. [Google Scholar]
Wang, X.; Tu, Z.; Wang, L.; Shi, S. Self-Attention with Structural Position Representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 1403–1409. [Google Scholar] [CrossRef]
Shi, Y.; Huang, Z.; Feng, S.; Zhong, H.; Wang, W.; Sun, Y. Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI 2021), Montreal, QC, Canada, 19–27 August 2021; Zhou, Z., Ed.; ijcai.org: Montreal, QC, Canada, 2021; pp. 1548–1554. [Google Scholar] [CrossRef]
Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T. Do Transformers Really Perform Badly for Graph Representation? In Proceedings of the 34th Annual Conference on Neural Information Processing Systems 2021 (NeurIPS 2021), Virtual Event, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W., Eds.; nips.cc: Virtual Event, 2021; pp. 28877–28888. [Google Scholar]
Zhang, Y.; Wang, X.; Shi, C.; Jiang, X.; Ye, Y. Hyperbolic Graph Attention Network. IEEE Trans. Big Data 2022, 8, 1690–1701. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Shi, C.; Liu, N.; Song, G. Lorentzian Graph Convolutional Networks. In Proceedings of the The Web Conference (WWW 2021), Ljubljana, Slovenia, 19–23 April 2021; ACM/IW3C2: Ljubljana, Slovenia, 2021; pp. 1249–1261. [Google Scholar] [CrossRef]
Nickel, M.; Kiela, D. Poincaré Embeddings for Learning Hierarchical Representations. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; nips.cc: Long Beach, CA, USA, 2017; pp. 6338–6347. [Google Scholar]
Wang, X.; Zhang, Y.; Shi, C. Hyperbolic Heterogeneous Information Network Embedding. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Honolulu, HI, USA, 2019; pp. 5337–5344. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar]
Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Network Representation Learning: A Survey. IEEE Trans. Big Data 2020, 6, 3–28. [Google Scholar] [CrossRef]
Hamilton, W.L.; Ying, R.; Leskovec, J. Representation Learning on Graphs: Methods and Applications. arXiv 2017, arXiv:1709.05584. [Google Scholar]
Cai, H.; Zheng, V.W.; Chang, K.C. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications. IEEE Trans. Knowl. Data Eng. 2018, 30, 1616–1637. [Google Scholar] [CrossRef]
Zhou, C.; Liu, Y.; Liu, X.; Liu, Z.; Gao, J. Scalable Graph Embedding for Asymmetric Proximity. In Proceedings of the 31th Conference on Artificial Intelligence (AAAI 2017), San Francisco, CA, USA, 4–9 February 2017; AAAI Press: San Francisco, CA, USA, 2017; pp. 2942–2948. [Google Scholar]
Man, T.; Shen, H.; Liu, S.; Jin, X.; Cheng, X. Predict Anchor Links across Social Networks via an Embedding Approach. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; IJCAI/AAAI Press: New York, NY, USA, 2016; pp. 1823–1829. [Google Scholar]
Gui, H.; Liu, J.; Tao, F.; Jiang, M.; Norick, B.; Han, J. Large-Scale Embedding Learning in Heterogeneous Event Data. In Proceedings of the 16th IEEE International Conference on Data Mining (ICDM 2016), Barcelona, Spain, 12–15 December 2016; IEEE Computer Society: Barcelona, Spain, 2016; pp. 907–912. [Google Scholar] [CrossRef]
Goyal, P.; Ferrara, E. Graph embedding techniques, applications, and performance: A survey. Knowl.-Based Syst. 2018, 151, 78–94. [Google Scholar] [CrossRef]
Casteigts, A.; Flocchini, P.; Quattrociocchi, W.; Santoro, N. Time-varying graphs and dynamic networks. Int. J. Parallel Emergent Distrib. Syst. 2012, 27, 387–408. [Google Scholar] [CrossRef]
Li, J.; Dani, H.; Hu, X.; Tang, J.; Chang, Y.; Liu, H. Attributed Network Embedding for Learning in a Dynamic Environment. In Proceedings of the Conference on Information and Knowledge Management (CIKM 2017), Singapore, 6–10 November 2017; ACM: Singapore, 2017; pp. 387–396. [Google Scholar] [CrossRef]
Chen, C.; Tao, Y.; Lin, H. Dynamic Network Embeddings for Network Evolution Analysis. arXiv 2019, arXiv:1906.09860. [Google Scholar]
Shervashidze, N.; Vishwanathan, S.V.N.; Petri, T.; Mehlhorn, K.; Borgwardt, K.M. Efficient graphlet kernels for large graph comparison. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2009), Clearwater Beach, FL, USA, 16–18 April 2009; JMLR.org: Clearwater Beach, FL, USA, 2009; Volume 5, pp. 488–495. [Google Scholar]
Togninalli, M.; Ghisu, M.E.; Llinares-López, F.; Rieck, B.; Borgwardt, K.M. Wasserstein Weisfeiler-Lehman Graph Kernels. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; NeurIPS: Vancouver, BC, Canada, 2019; pp. 6436–6446. [Google Scholar]
Kang, U.; Tong, H.; Sun, J. Fast Random Walk Graph Kernel. In Proceedings of the 12th International Conference on Data Mining (ICDM 2012), Anaheim, CA, USA, 26–28 April 2012; SIAM/Omnipress: Anaheim, CA, USA, 2012; pp. 828–838. [Google Scholar] [CrossRef]
Zhang, Z.; Cui, P.; Wang, X.; Pei, J.; Yao, X.; Zhu, W. Arbitrary-Order Proximity Preserved Network Embedding. In Proceedings of the 24th International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; ACM: London, UK, 2018; pp. 2778–2786. [Google Scholar] [CrossRef]
Dareddy, M.R.; Das, M.; Yang, H. motif2vec: Motif Aware Node Representation Learning for Heterogeneous Networks. In Proceedings of the International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, 9–12 December 2019; IEEE: Los Angeles, CA, USA, 2019; pp. 1052–1059. [Google Scholar] [CrossRef]
Tu, K.; Li, J.; Towsley, D.; Braines, D.; Turner, L.D. gl2vec: Learning feature representation using graphlets for directed networks. In Proceedings of the 19th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, BC, Canada, 27–30 August 2019; ACM: Vancouver, BC, Canada, 2019; pp. 216–221. [Google Scholar] [CrossRef]
Yu, Y.; Lu, Z.; Liu, J.; Zhao, G.; Wen, J. RUM: Network Representation Learning Using Motifs. In Proceedings of the 35th International Conference on Data Engineering (ICDE 2019), Macao, China, 8–11 April 2019; IEEE: Macao, China, 2019; pp. 1382–1393. [Google Scholar] [CrossRef]
Hu, Q.; Lin, F.; Wang, B.; Li, C. MBRep: Motif-Based Representation Learning in Heterogeneous Networks. Expert Syst. Appl. 2022, 190, 116031. [Google Scholar] [CrossRef]
Shao, P.; Yang, Y.; Xu, S.; Wang, C. Network Embedding via Motifs. ACM Trans. Knowl. Discov. Data 2021, 16, 44. [Google Scholar] [CrossRef]
De Winter, S.; Decuypere, T.; Mitrović, S.; Baesens, B.; De Weerdt, J. Combining Temporal Aspects of Dynamic Networks with Node2Vec for a More Efficient Dynamic Link Prediction. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (IEEE/ACM 2018), Barcelona, Spain, 28–31 August 2018; IEEE Press: Barcelona, Spain, 2018; pp. 1234–1241. [Google Scholar]
Huang, X.; Song, Q.; Li, Y.; Hu, X. Graph Recurrent Networks With Attributed Random Walks. In Proceedings of the 25th International Conference on Knowledge Discovery & Data Mining (KDD 2019), Anchorage, AK, USA, 4–8 August 2019; ACM: Anchorage, AK, USA, 2019; pp. 732–740. [Google Scholar] [CrossRef]
Sperduti, A.; Starita, A. Supervised neural networks for the classification of structures. IEEE Trans. Neural Netw. 1997, 8, 714–735. [Google Scholar] [CrossRef] [PubMed]
Chiang, W.; Liu, X.; Si, S.; Li, Y.; Bengio, S.; Hsieh, C. Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. In Proceedings of the 25th International Conference on Knowledge Discovery & Data Mining (KDD 2019), Anchorage, AK, USA, 4–8 August 2019; ACM: Anchorage, AK, USA, 2019; pp. 257–266. [Google Scholar] [CrossRef]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NeurIPS 2016), Barcelona, Spain, 5–10 December 2016; NeurIPS: Barcelona, Spain, 2016; pp. 3837–3845. [Google Scholar]
Li, Q.; Han, Z.; Wu, X. Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning. In Proceedings of the 32nd Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA, 2–7 February 2018; McIlraith, S.A., Weinberger, K.Q., Eds.; AAAI Press: New Orleans, LA, USA, 2018; pp. 3538–3545. [Google Scholar]
Li, G.; Xiong, C.; Thabet, A.K.; Ghanem, B. DeeperGCN: All You Need to Train Deeper GCNs. arXiv 2020, arXiv:2006.07739. [Google Scholar]
Lin, K.; Wang, L.; Liu, Z. Mesh Graphormer. In Proceedings of the International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; IEEE: Montreal, QC, Canada, 2021; pp. 12919–12928. [Google Scholar] [CrossRef]
Rong, Y.; Bian, Y.; Xu, T.; Xie, W.; Wei, Y.; Huang, W.; Huang, J. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual Event, 6–12 December 2020; NeurIPS: Virtual event, 2020. [Google Scholar]
Yao, S.; Wang, T.; Wan, X. Heterogeneous Graph Transformer for Graph-to-Sequence Learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Virtual Event, 5–10 July 2020; ACL: Virtual Event, 2020; pp. 7145–7154. [Google Scholar] [CrossRef]
Nickel, M.; Kiela, D. Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; mlr.press: Stockholm, Sweden, 2018; Volume 80, pp. 3776–3785. [Google Scholar]
Cao, Z.; Xu, Q.; Yang, Z.; Cao, X.; Huang, Q. Geometry Interaction Knowledge Graph Embeddings. In Proceedings of the 36th Conference on Artificial Intelligence (AAAI 2022), Virtual Event, 22 February–1 March 2022; AAAI Press: Virtual Event, 2022; pp. 5521–5529. [Google Scholar]
Santos, L.D.; Piwowarski, B.; Gallinari, P. Multilabel Classification on Heterogeneous Graphs with Gaussian Embeddings. In Proceedings of the Machine Learning and Knowledge Discovery in Databases—European Conference (ECML PKDD 2016), Riva del Garda, Italy, 19–23 September 2016; Springer: Riva del Garda, Italy, 2016; Volume 9852, pp. 606–622. [Google Scholar] [CrossRef]
Kriege, N.M. Comparing Graphs: Algorithms & Applications. Ph.D. Thesis, Technische Universität, Dortmund, Germany, 2015. [Google Scholar]
Kondor, R.; Shervashidze, N.; Borgwardt, K.M. The graphlet spectrum. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), Montreal, QC, Canada, 14–18 June 2009; ACM: Montreal, QC, Canada, 2009; Volume 382, pp. 529–536. [Google Scholar] [CrossRef]
Feragen, A.; Kasenburg, N.; Petersen, J.; de Bruijne, M.; Borgwardt, K.M. Scalable kernels for graphs with continuous attributes. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013; Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q., Eds.; NeurIPS: Lake Tahoe, NV, USA, 2013; pp. 216–224. [Google Scholar]
Morris, C.; Kriege, N.M.; Kersting, K.; Mutzel, P. Faster Kernels for Graphs with Continuous Attributes via Hashing. In Proceedings of the 16th International Conference on Data Mining (ICDM 2016), Barcelona, Spain, 12–15 December 2016; IEEE Computer Society: Barcelona, Spain, 2016; pp. 1095–1100. [Google Scholar] [CrossRef]
Borgwardt, K.M.; Kriegel, H. Shortest-Path Kernels on Graphs. In Proceedings of the 5th International Conference on Data Mining (ICDM 2005), Houston, TX, USA, 27–30 November 2005; IEEE Computer Society: Houston, TX, USA, 2005; pp. 74–81. [Google Scholar] [CrossRef]
Kashima, H.; Tsuda, K.; Inokuchi, A. Marginalized Kernels Between Labeled Graphs. In Proceedings of the 20th International Conference Machine Learning (ICML 2003), Washington, DC, USA, 21–24 August 2003; AAAI Press: Washington, DC, USA, 2003; pp. 321–328. [Google Scholar]
Rahman, M.; Hasan, M.A. Link Prediction in Dynamic Networks Using Graphlet. In Proceedings of the Machine Learning and Knowledge Discovery in Databases—European Conference (ECML PKDD 2016), Riva del Garda, Italy, 19–23 September 2016; Springer: Riva del Garda, Italy, 2016; Volume 9851, pp. 394–409. [Google Scholar] [CrossRef]
Rahman, M.; Saha, T.K.; Hasan, M.A.; Xu, K.S.; Reddy, C.K. DyLink2Vec: Effective Feature Representation for Link Prediction in Dynamic Networks. arXiv 2018, arXiv:1804.05755. [Google Scholar]
Béres, F.; Kelen, D.M.; Pálovics, R.; Benczúr, A.A. Node embeddings in dynamic graphs. Appl. Netw. Sci. 2019, 4, 64.1–64.25. [Google Scholar] [CrossRef]
Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013; NeurIPS: Lake Tahoe, NV, USA, 2013; pp. 2292–2300. [Google Scholar]
Floyd, R.W. Algorithm 97: Shortest path. Commun. ACM 1962, 5, 345. [Google Scholar] [CrossRef]
Kriege, N.M.; Johansson, F.D.; Morris, C. A survey on graph kernels. Appl. Netw. Sci. 2020, 5, 6. [Google Scholar] [CrossRef]
Urry, M.; Sollich, P. Random walk kernels and learning curves for Gaussian process regression on random graphs. J. Mach. Learn. Res. 2013, 14, 1801–1835. [Google Scholar]
Bunke, H.; Irniger, C.; Neuhaus, M. Graph Matching—Challenges and Potential Solutions. In Proceedings of the 13th Image Analysis and Processing— International Conference (ICIAP 2005), Cagliari, Italy, 6–8 September 2005; Springer: Cagliari, Italy, 2005; Volume 3617, pp. 1–10. [Google Scholar] [CrossRef]
Hofmann, T.; Buhmann, J.M. Multidimensional Scaling and Data Clustering. In Proceedings of the 7th Advances in Neural Information Processing Systems (NIPS 1994), Denver, CO, USA, 1 January 1994; Tesauro, G., Touretzky, D.S., Leen, T.K., Eds.; MIT Press: Denver, CO, USA, 1994; pp. 459–466. [Google Scholar]
Han, Y.; Shen, Y. Partially Supervised Graph Embedding for Positive Unlabelled Feature Selection. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; IJCAI/AAAI Press: New York, NY, USA, 2016; pp. 1548–1554. [Google Scholar]
Cai, D.; He, X.; Han, J. Spectral regression: A unified subspace learning framework for content-based image retrieval. In Proceedings of the 15th International Conference on Multimedia 2007 (ICM 2007), Augsburg, Germany, 24–29 September 2007; ACM: Augsburg, Germany, 2007; pp. 403–412. [Google Scholar] [CrossRef]
Torres, L.; Chan, K.S.; Eliassi-Rad, T. GLEE: Geometric Laplacian Eigenmap Embedding. J. Complex Netw. 2020, 8, cnaa007. [Google Scholar] [CrossRef]
Li, Y.; Wang, Y.; Zhang, T.; Zhang, J.; Chang, Y. Learning Network Embedding with Community Structural Information. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 10–16 August 2019; ijcai.org: Macao, China, 2019; pp. 2937–2943. [Google Scholar] [CrossRef]
Coskun, M. A high order proximity measure for linear network embedding. NiğDe öMer Halisdemir üNiversitesi MüHendislik Bilim. Derg. 2022, 11, 477–483. [Google Scholar] [CrossRef]
Ahmed, A.; Shervashidze, N.; Narayanamurthy, S.M.; Josifovski, V.; Smola, A.J. Distributed large-scale natural graph factorization. In Proceedings of the 22nd International World Wide Web Conference (WWW 2013), Rio de Janeiro, Brazil, 13–17 May 2013; International World Wide Web Conferences Steering Committee/ACM: Rio de Janeiro, Brazil, 2013; pp. 37–48. [Google Scholar] [CrossRef]
Lian, D.; Zhu, Z.; Zheng, K.; Ge, Y.; Xie, X.; Chen, E. Network Representation Lightening from Hashing to Quantization. IEEE Trans. Knowl. Data Eng. 2022, 35, 5119–5131. [Google Scholar] [CrossRef]
Katz, L. A new status index derived from sociometric analysis. Psychometrika 1953, 18, 39–43. [Google Scholar] [CrossRef]
Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web; Technical Report 1999-66; Previous number = SIDL-WP-1999-0120; Stanford InfoLab: Stanford, CA, USA, 1999. [Google Scholar]
Adamic, L.A.; Adar, E. Friends and neighbors on the Web. Soc. Netw. 2003, 25, 211–230. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, Z.; Sun, M. Representation Learning for Measuring Entity Relatedness with Rich Information. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015; AAAI Press: Buenos Aires, Argentina, 2015; pp. 1412–1418. [Google Scholar]
Yang, C.; Liu, Z.; Zhao, D.; Sun, M.; Chang, E.Y. Network Representation Learning with Rich Text Information. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015; AAAI Press: Buenos Aires, Argentina, 2015; pp. 2111–2117. [Google Scholar]
Yang, R.; Shi, J.; Xiao, X.; Yang, Y.; Bhowmick, S.S. Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank. Proc. Vldb Endow. 2020, 13, 670–683. [Google Scholar] [CrossRef]
Qiu, J.; Dong, Y.; Ma, H.; Li, J.; Wang, C.; Wang, K.; Tang, J. NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 1509–1520. [Google Scholar] [CrossRef]
Rossi, R.A.; Gallagher, B.; Neville, J.; Henderson, K. Modeling dynamic behavior in large evolving graphs. In Proceedings of the 6th International Conference on Web Search and Data Mining (WSDM 2013), Rome, Italy, 4–8 February 2013; ACM: Rome, Italy, 2013; pp. 667–676. [Google Scholar] [CrossRef]
Yu, W.; Aggarwal, C.C.; Wang, W. Temporally Factorized Network Modeling for Evolutionary Network Analysis. In Proceedings of the 10th International Conference on Web Search and Data Mining (WSDM 2017), Cambridge, UK, 6–10 February 2017; ACM: Cambridge, UK, 2017; pp. 455–464. [Google Scholar] [CrossRef]
Zhu, L.; Guo, D.; Yin, J.; Steeg, G.V.; Galstyan, A. Scalable Temporal Latent Space Inference for Link Prediction in Dynamic Social Networks. IEEE Trans. Knowl. Data Eng. 2016, 28, 2765–2777. [Google Scholar] [CrossRef]
Yu, W.; Cheng, W.; Aggarwal, C.C.; Chen, H.; Wang, W. Link Prediction with Spatial and Temporal Consistency in Dynamic Networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 19–25 August 2017; ijcai.org: Melbourne, Australia, 2017; pp. 3343–3349. [Google Scholar] [CrossRef]
Chen, H.; Yin, H.; Wang, W.; Wang, H.; Nguyen, Q.V.H.; Li, X. PME: Projected Metric Embedding on Heterogeneous Networks for Link Prediction. In Proceedings of the 24th International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; ACM: London, UK, 2018; pp. 1177–1186. [Google Scholar] [CrossRef]
Xu, L.; Wei, X.; Cao, J.; Yu, P.S. Embedding of Embedding (EOE): Joint Embedding for Coupled Heterogeneous Networks. In Proceedings of the 10th International Conference on Web Search and Data Mining (WSDM 2017), Cambridge, UK, 6–10 February 2017; ACM: Cambridge, UK, 2017; pp. 741–749. [Google Scholar] [CrossRef]
Shi, Y.; Gui, H.; Zhu, Q.; Kaplan, L.M.; Han, J. AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks. In Proceedings of the International Conference on Data Mining (SDM 2018), San Diego, CA, USA, 3–5 May 2018; SIAM: San Diego, CA, USA, 2018; pp. 144–152. [Google Scholar] [CrossRef]
Matsuno, R.; Murata, T. MELL: Effective Embedding Method for Multiplex Networks. In Proceedings of the Companion Proceedings of the Web Conference 2018 (WWW 2018), Lyon, France, 23–27 April 2018; International World Wide Web Conferences Steering Committee: Geneva, Switzerland; ACM: Lyon, France, 2018; pp. 1261–1268. [Google Scholar] [CrossRef]
Ren, X.; He, W.; Qu, M.; Voss, C.R.; Ji, H.; Han, J. Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining (KDD 2016), San Francisco, CA, USA, 13–17 August 2016; ACM: San Francisco, CA, USA, 2016; pp. 1825–1834. [Google Scholar] [CrossRef]
Dettmers, T.; Minervini, P.; Stenetorp, P.; Riedel, S. Convolutional 2D Knowledge Graph Embeddings. In Proceedings of the 32nd Conference on Artificial Intelligence (AAAI 2018), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: New Orleans, LA, USA, 2018; pp. 1811–1818. [Google Scholar]
Safavi, T.; Koutra, D. CoDEx: A Comprehensive Knowledge Graph Completion Benchmark. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Virtual Event, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; ACL: Virtual Event, 2020; pp. 8328–8350. [Google Scholar] [CrossRef]
Narayanan, A.; Chandramohan, M.; Chen, L.; Liu, Y.; Saminathan, S. subgraph2vec: Learning Distributed Representations of Rooted Sub-graphs from Large Graphs. arXiv 2016, arXiv:1606.08928. [Google Scholar]
Wang, C.; Wang, C.; Wang, Z.; Ye, X.; Yu, P.S. Edge2vec: Edge-based Social Network Embedding. ACM Trans. Knowl. Discov. Data 2020, 14, 45.1–45.24. [Google Scholar] [CrossRef]
Perozzi, B.; Kulkarni, V.; Chen, H.; Skiena, S. Don’t Walk, Skip!: Online Learning of Multi-scale Network Embeddings. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining 2017 (ASONAM 2017), Sydney, Australia, 31 July–3 August 2017; ACM: Sydney, Australia, 2017; pp. 258–265. [Google Scholar] [CrossRef]
Liu, R.; Hirn, M.J.; Krishnan, A. Accurately Modeling Biased Random Walks on Weighted Graphs Using Node2vec+. arXiv 2021, arXiv:2109.08031. [Google Scholar]
Jeong, J.; Yun, J.; Keam, H.; Park, Y.; Park, Z.; Cho, J. div2vec: Diversity-Emphasized Node Embedding. In Proceedings of the 14th Workshops on Recommendation in Complex Scenarios and the Impact of Recommender Systems co-located with 14th ACM Conference on Recommender Systems (RecSys 2020), Virtual Event, 25 September 2020; CEUR-WS.org: Virtual Event, 2020; Volume 2697. [Google Scholar]
Zhang, Y.; Shi, Z.; Feng, D.; Zhan, X.X. Degree-Biased Random Walk for Large-Scale Network Embedding. Future Gener. Comput. Syst. 2019, 100, 198–209. [Google Scholar] [CrossRef]
Ahmed, N.K.; Rossi, R.A.; Lee, J.B.; Willke, T.L.; Zhou, R.; Kong, X.; Eldardiry, H. Role-Based Graph Embeddings. IEEE Trans. Knowl. Data Eng. 2022, 34, 2401–2415. [Google Scholar] [CrossRef]
Khosla, M.; Leonhardt, J.; Nejdl, W.; Anand, A. Node Representation Learning for Directed Graphs. In Proceedings of the Machine Learning and Knowledge Discovery in Databases—European Conference (ECML PKDD 2019), Dublin, Ireland, 10–14 September 2018; Springer: Würzburg, Germany, 2019; Volume 11906, pp. 395–411. [Google Scholar] [CrossRef]
Adhikari, B.; Zhang, Y.; Ramakrishnan, N.; Prakash, B.A. Sub2Vec: Feature Learning for Subgraphs. In Proceedings of the 22nd Advances in Knowledge Discovery and Data Mining - Pacific-Asia Conference (PAKDD 2018), Melbourne, VIC, Australia, 3–6 June 2018; Springer: Melbourne, VIC, Australia, 2018; Volume 10938, pp. 170–182. [Google Scholar] [CrossRef]
Sheikh, N.; Kefato, Z.; Montresor, A. Gat2vec: Representation Learning for Attributed Graphs. Computing 2019, 101, 187–209. [Google Scholar] [CrossRef]
Dou, W.; Zhang, W.; Weng, Z. An Attributed Network Representation Learning Method Based on Biased Random Walk. Procedia Comput. Sci. 2020, 174, 291–298. [Google Scholar] [CrossRef]
Narayanan, A.; Chandramohan, M.; Venkatesan, R.; Chen, L.; Liu, Y.; Jaiswal, S. graph2vec: Learning Distributed Representations of Graphs. arXiv 2017, arXiv:1707.05005. [Google Scholar]
Yang, D.; Qu, B.; Yang, J.; Cudré-Mauroux, P. Revisiting User Mobility and Social Relationships in LBSNs: A Hypergraph Embedding Approach. In Proceedings of the The World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; ACM: San Francisco, CA, USA, 2019; pp. 2147–2157. [Google Scholar] [CrossRef]
Zhu, Y.; Guan, Z.; Tan, S.; Liu, H.; Cai, D.; He, X. Heterogeneous hypergraph embedding for document recommendation. Neurocomputing 2016, 216, 150–162. [Google Scholar] [CrossRef]
Hussein, R.; Yang, D.; Cudré-Mauroux, P. Are Meta-Paths Necessary?: Revisiting Heterogeneous Graph Embeddings. In Proceedings of the 27th International Conference on Information and Knowledge Management (CIKM 2018), Torino, Italy, 22–26 October 2018; ACM: Torino, Italy, 2018; pp. 437–446. [Google Scholar] [CrossRef]
Zhang, H.; Qiu, L.; Yi, L.; Song, Y. Scalable Multiplex Network Embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, 13–19 July 2018; ijcai.org: Stockholm, Sweden, 2018; pp. 3082–3088. [Google Scholar] [CrossRef]
Lee, S.; Park, C.; Yu, H. BHIN2vec: Balancing the Type of Relation in Heterogeneous Information Network. In Proceedings of the 28th International Conference on Information and Knowledge Management (CIKM 2019), Beijing, China, 3–7 November 2019; ACM: Beijing, China, 2019; pp. 619–628. [Google Scholar] [CrossRef]
Li, Y.; Patra, J.C. Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics 2010, 26, 1219–1224. [Google Scholar] [CrossRef] [PubMed]
Joodaki, M.; Ghadiri, N.; Maleki, Z.; Shahreza, M.L. A scalable random walk with restart on heterogeneous networks with Apache Spark for ranking disease-related genes through type-II fuzzy data fusion. J. Biomed. Informatics 2021, 115, 103688. [Google Scholar] [CrossRef]
Tian, Z.; Guo, M.; Wang, C.; Xing, L.; Wang, L.; Zhang, Y. Constructing an integrated gene similarity network for the identification of disease genes. In Proceedings of the International Conference on Bioinformatics and Biomedicine (BIBM 2016), Shenzhen, China, 15–18 December 2016; IEEE Computer Society: Shenzhen, China, 2016; pp. 1663–1668. [Google Scholar] [CrossRef]
Luo, J.; Liang, S. Prioritization of potential candidate disease genes by topological similarity of protein-protein interaction network and phenotype data. J. Biomed. Sci. 2015, 53, 229–236. [Google Scholar] [CrossRef]
Lee, O.-J.; Jeon, H.; Jung, J.J. Learning multi-resolution representations of research patterns in bibliographic networks. J. Inf. 2021, 15, 101126. [Google Scholar] [CrossRef]
Du, B.; Tong, H. MrMine: Multi-resolution Multi-network Embedding. In Proceedings of the 28th International Conference on Information and Knowledge Management (CIKM 2019), Beijing, China, 3–7 November 2019; ACM: Beijing, China, 2019; pp. 479–488. [Google Scholar] [CrossRef]
Lee, O.-J.; Jung, J.J. Story embedding: Learning distributed representations of stories based on character networks. Artif. Intell. 2020, 281, 103235. [Google Scholar] [CrossRef]
Sajjad, H.P.; Docherty, A.; Tyshetskiy, Y. Efficient Representation Learning Using Random Walks for Dynamic Graphs. arXiv 2019, arXiv:1901.01346. [Google Scholar]
Heidari, F.; Papagelis, M. EvoNRL: Evolving Network Representation Learning Based on Random Walks. In Proceedings of the 7th International Conference on Complex Networks and Their Applications (COMPLEX NETWORKS 2018), Cambridge, UK, 11–13 December 2018; Springer: Cambridge, UK, 2018; Volume 812, pp. 457–469. [Google Scholar] [CrossRef]
Nguyen, G.H.; Lee, J.B.; Rossi, R.A.; Ahmed, N.K.; Koh, E.; Kim, S. Continuous-Time Dynamic Network Embeddings. In Proceedings of the Companion of The Web Conference (WWW 2018), Lyon, France, 23–27 April 2018; ACM: Lyon, France, 2018; pp. 969–976. [Google Scholar] [CrossRef]
Pandhre, S.; Mittal, H.; Gupta, M.; Balasubramanian, V.N. STwalk: Learning trajectory representations in temporal graphs. In Proceedings of the India Joint International Conference on Data Science and Management of Data (COMAD/CODS 2018), Goa, India, 11–13 January 2018; ACM: Goa, India, 2018; pp. 210–219. [Google Scholar] [CrossRef]
Singer, U.; Guy, I.; Radinsky, K. Node Embedding over Temporal Graphs. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 10–16 August 2019; ijcai.org: Macao, China, 2019; pp. 4605–4612. [Google Scholar] [CrossRef]
Mahdavi, S.; Khoshraftar, S.; An, A. dynnode2vec: Scalable Dynamic Network Embedding. In Proceedings of the International Conference on Big Data (IEEE BigData 2018), Seattle, WA, USA, 10–13 December 2018; IEEE: Seattle, WA, USA, 2018; pp. 3762–3765. [Google Scholar] [CrossRef]
Lin, D.; Wu, J.; Yuan, Q.; Zheng, Z. T-EDGE: Temporal WEighted MultiDiGraph Embedding for Ethereum Transaction Network Analysis. Front. Phys. 2020, 8, 204. [Google Scholar] [CrossRef]
Lee, O.-J.; Jung, J.J.; Kim, J. Learning Hierarchical Representations of Stories by Using Multi-Layered Structures in Narrative Multimedia. Sensors 2020, 20, 1978. [Google Scholar] [CrossRef] [PubMed]
Lee, O.-J.; Jung, J.J. Story Embedding: Learning Distributed Representations of Stories based on Character Networks (Extended Abstract). In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), Yokohama, Japan, 7–15 January 2021; Bessiere, C., Ed.; ijcai.org: Yokohama, Japan, 2020; pp. 5070–5074. [Google Scholar] [CrossRef]
Chen, H.; Perozzi, B.; Hu, Y.; Skiena, S. HARP: Hierarchical Representation Learning for Networks. In Proceedings of the 32nd Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: New Orleans, LA, USA, 2018; pp. 2127–2134. [Google Scholar]
Tang, J.; Qu, M.; Mei, Q. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. In Proceedings of the 21th International Conference on Knowledge Discovery and Data Mining (SIGKDD 2015), Sydney, Australia, 10–13 August 2015; ACM: Sydney, Australia, 2015; pp. 1165–1174. [Google Scholar] [CrossRef]
Fu, T.; Lee, W.; Lei, Z. HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning. In Proceedings of the Conference on Information and Knowledge Management (CIKM 2017), Singapore, 6–10 November 2017; ACM: Singapore, 2017; pp. 1797–1806. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Wei, X.; Xu, L.; Cao, B.; Yu, P.S. Cross View Link Prediction by Learning Noise-resilient Representation Consensus. In Proceedings of the 26th International Conference on World Wide Web (WWW 2017), Perth, Australia, 3–7 April 2017; ACM: Perth, Australia, 2017; pp. 1611–1619. [Google Scholar] [CrossRef]
Liu, L.; Cheung, W.K.; Li, X.; Liao, L. Aligning Users across Social Networks Using Network Embedding. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; IJCAI/AAAI Press: New York, NY, USA, 2016; pp. 1774–1780. [Google Scholar]
Wang, S.; Tang, J.; Aggarwal, C.C.; Chang, Y.; Liu, H. Signed Network Embedding in Social Media. In Proceedings of the International Conference on Data Mining (SIAM 2017), Houston, Texas, USA, 27–29 April 2017; SIAM: Houston, TX, USA, 2017; pp. 327–335. [Google Scholar] [CrossRef]
Huang, Z.; Mamoulis, N. Heterogeneous Information Network Embedding for Meta Path based Proximity. arXiv 2017, arXiv:1701.05291. [Google Scholar]
Zhang, Q.; Wang, H. Not All Links Are Created Equal: An Adaptive Embedding Approach for Social Personalized Ranking. In Proceedings of the 39th International Conference on Research and Development in Information Retrieval (SIGIR 2016), Pisa, Italy, 17–21 July 2016; ACM: Pisa, Italy, 2016; pp. 917–920. [Google Scholar] [CrossRef]
Wang, J.; Lu, Z.; Song, G.; Fan, Y.; Du, L.; Lin, W. Tag2Vec: Learning Tag Representations in Tag Networks. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 3314–3320. [Google Scholar] [CrossRef]
Fu, G.; Yuan, B.; Duan, Q.; Yao, X. Representation Learning for Heterogeneous Information Networks via Embedding Events. In Proceedings of the 26th International Conference (ICONIP 2019), Sydney, Australia, 12–15 December 2019; Springer: Sydney, Australia, 2019; Volume 11953, pp. 327–339. [Google Scholar] [CrossRef]
Wu, X.; Pang, H.; Fan, Y.; Linghu, Y.; Luo, Y. ProbWalk: A random walk approach in weighted graph embedding. In Proceedings of the 10th International Conference of Information and Communication Technology (ICICT 2020), Wuhan, China, 13–15 November 2020; Elsevier Procedia: Wuhan, China, 2021; Volume 183, pp. 683–689. [Google Scholar] [CrossRef]
Li, Q.; Cao, Z.; Zhong, J.; Li, Q. Graph Representation Learning with Encoding Edges. Neurocomputing 2019, 361, 29–39. [Google Scholar] [CrossRef]
Li, Q.; Zhong, J.; Li, Q.; Cao, Z.; Wang, C. Enhancing Network Embedding with Implicit Clustering. In Proceedings of the 24th Database Systems for Advanced Applications (DASFAA 2019), Chiang Mai, Thailand, 22–25 April 2019; Springer: Chiang Mai, Thailand, 2019; Volume 11446, pp. 452–467. [Google Scholar] [CrossRef]
Gao, H.; Huang, H. Deep Attributed Network Embedding. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, 13–19 July 2018; Lang, J., Ed.; ijcai.org: Stockholm, Sweden, 2018; pp. 3364–3370. [Google Scholar] [CrossRef]
Sun, X.; Guo, J.; Ding, X.; Liu, T. A General Framework for Content-enhanced Network Representation Learning. arXiv 2016, arXiv:1610.02906. [Google Scholar]
Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Homophily, Structure, and Content Augmented Network Representation Learning. In Proceedings of the 16th International Conference on Data Mining (ICDM 2016), Barcelona, Spain, 12–15 December 2016; Bonchi, F., Domingo-Ferrer, J., Baeza-Yates, R., Zhou, Z., Wu, X., Eds.; IEEE Computer Society: Barcelona, Spain, 2016; pp. 609–618. [Google Scholar] [CrossRef]
Chen, H.; Anantharam, A.R.; Skiena, S. DeepBrowse: Similarity-Based Browsing Through Large Lists (Extended Abstract). In Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP 2017), Munich, Germany, 4–6 October 2017; Beecks, C., Borutta, F., Kröger, P., Seidl, T., Eds.; Springer: Munich, Germany, 2017; Volume 10609, pp. 300–314. [Google Scholar] [CrossRef]
Cao, S.; Lu, W.; Xu, Q. Deep Neural Networks for Learning Graph Representations. In Proceedings of the 30th Conference on Artificial Intelligence (AAAI 2016), Phoenix, AZ, USA, 12–17 February 2016; Schuurmans, D., Wellman, M.P., Eds.; AAAI Press: Phoenix, AZ, USA, 2016; pp. 1145–1152. [Google Scholar]
Shen, X.; Chung, F. Deep Network Embedding for Graph Representation Learning in Signed Networks. Trans. Cybern. 2020, 50, 1556–1568. [Google Scholar] [CrossRef]
Goyal, P.; Kamra, N.; He, X.; Liu, Y. DynGEM: Deep Embedding Method for Dynamic Graphs. arXiv 2018, arXiv:1805.11273. [Google Scholar]
Yu, W.; Cheng, W.; Aggarwal, C.C.; Zhang, K.; Chen, H.; Wang, W. NetWalk: A Flexible Deep Embedding Approach for Anomaly Detection in Dynamic Networks. In Proceedings of the 24th International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; ACM: London, UK, 2018; pp. 2672–2681. [Google Scholar] [CrossRef]
Yu, Z.; Kuang, Z.; Liu, J.; Chen, H.; Zhang, J.; You, J.; Wong, H.; Han, G. Adaptive Ensembling of Semi-Supervised Clustering Solutions. IEEE Trans. Knowl. Data Eng. 2017, 29, 1577–1590. [Google Scholar] [CrossRef]
Goyal, P.; Chhetri, S.R.; Canedo, A. dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. Knowl.-Based Syst. 2020, 187, 104816. [Google Scholar] [CrossRef]
Taheri, A.; Gimpel, K.; Berger-Wolf, T. Learning graph representations with recurrent neural network autoencoders. KDD Deep Learn. Day. Available online: https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdf (accessed on 16 April 2023).
Khoshraftar, S.; Mahdavi, S.; An, A.; Hu, Y.; Liu, J. Dynamic Graph Embedding via LSTM History Tracking. In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA 2019), Washington, DC, USA, 5–8 October 2019; IEEE: Washington, DC, USA, 2019; pp. 119–127. [Google Scholar] [CrossRef]
Chen, J.; Zhang, J.; Xu, X.; Fu, C.; Zhang, D.; Zhang, Q.; Xuan, Q. E-LSTM-D: A Deep Learning Framework for Dynamic Network Link Prediction. IEEE Trans. Syst. Man Cybern. 2021, 51, 3699–3712. [Google Scholar] [CrossRef]
Taheri, A.; Gimpel, K.; Berger-Wolf, T.Y. Sequence-to-sequence modeling for graph representation learning. Appl. Netw. Sci. 2019, 4, 68.1–68.26. [Google Scholar] [CrossRef]
Taheri, A.; Gimpel, K.; Berger-Wolf, T.Y. Learning to Represent the Evolution of Dynamic Graphs with Recurrent Models. In Proceedings of the Companion of The World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; ACM: San Francisco, CA, USA, 2019; pp. 301–307. [Google Scholar] [CrossRef]
Zhang, C.; Huang, C.; Yu, L.; Zhang, X.; Chawla, N.V. Camel: Content-Aware and Meta-path Augmented Metric Learning for Author Identification. In Proceedings of the World Wide Web Conference on World Wide Web (WWW 2018), Lyon, France, 23–27 April 2018; Champin, P., Gandon, F., Lalmas, M., Ipeirotis, P.G., Eds.; ACM: Lyon, France, 2018; pp. 709–718. [Google Scholar] [CrossRef]
Park, C.; Kim, D.; Zhu, Q.; Han, J.; Yu, H. Task-Guided Pair Embedding in Heterogeneous Network. In Proceedings of the 28th International Conference on Information and Knowledge Management (CIKM 2019), Beijing, China, 3–7 November 2019; ACM: Beijing, China, 2019; pp. 489–498. [Google Scholar] [CrossRef]
Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.S.; Leskovec, J. Strategies for Pre-training Graph Neural Networks. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020; OpenReview.net: Addis Ababa, Ethiopia, 2020. [Google Scholar]
Li, R.; Wang, S.; Zhu, F.; Huang, J. Adaptive Graph Convolutional Neural Networks. In Proceedings of the 32nd Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: New Orleans, LA, USA, 2018; pp. 3546–3553. [Google Scholar]
Zhuang, C.; Ma, Q. Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification. In Proceedings of the Conference on World Wide Web (WWW 2018), Lyon, France, 23–27 April 2018; ACM: Lyon, France, 2018; pp. 499–508. [Google Scholar] [CrossRef]
Wang, Y.; Wu, X.; Wu, L. Differential Privacy Preserving Spectral Graph Analysis. In Proceedings of the 17th Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conference (PAKDD 2013), Gold Coast, Australia, 14–17 April 2013; Springer: Gold Coast, Australia, 2013; Volume 7819, pp. 329–340. [Google Scholar] [CrossRef]
NT, H.; Maehara, T. Revisiting Graph Neural Networks: All We Have is Low-Pass Filters. arXiv 2019, arXiv:1905.09550. [Google Scholar]
Wu, F.; Zhang, T.; de Souza, A.H., Jr.; Fifty, C.; Yu, T.; Weinberger, K.Q. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Long Beach, CA, USA, 2019; Volume 97, pp. 6861–6871. [Google Scholar]
Yang, Z.; Cohen, W.W.; Salakhutdinov, R. Revisiting Semi-Supervised Learning with Graph Embeddings. In Proceedings of the 33nd International Conference on Machine Learning (ICML 2016), New York City, NY, USA, 19–24 June 2016; JMLR.org: New York City, NY, USA, 2016; Volume 48, pp. 40–48. [Google Scholar]
Levie, R.; Monti, F.; Bresson, X.; Bronstein, M.M. CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters. IEEE Trans. Signal Process. 2019, 67, 97–109. [Google Scholar] [CrossRef]
Liu, X.; Xia, G.; Lei, F.; Zhang, Y.; Chang, S. Higher-Order Graph Convolutional Networks With Multi-Scale Neighborhood Pooling for Semi-Supervised Node Classification. IEEE Access 2021, 9, 31268–31275. [Google Scholar] [CrossRef]
Yuan, S.; Wang, C.; Jiang, Q.; Ma, J. Community Detection with Graph Neural Network using Markov Stability. In Proceedings of the International Conference on Artificial Intelligence in Information and Communication (ICAIIC 2022), Jeju Island, Republic of Korea, 21–24 February 2022; IEEE: Jeju Island, Republic of Korea, 2022; pp. 437–442. [Google Scholar] [CrossRef]
Zhou, X.; Shen, Y.; Zhu, Y.; Huang, L. Predicting Multi-Step Citywide Passenger Demands Using Attention-Based Neural Networks. In Proceedings of the 11th International Conference on Web Search and Data Mining (WSDM 2018), Marina Del Rey, CA, USA, 5–9 February 2018; ACM: New York, NY, USA, 2018; pp. 736–744. [Google Scholar] [CrossRef]
Zhang, X.; Huang, C.; Xu, Y.; Xia, L.; Dai, P.; Bo, L.; Zhang, J.; Zheng, Y. Traffic Flow Forecasting with Spatial-Temporal Graph Diffusion Network. In Proceedings of the 35th Conference on Artificial Intelligence (AAAI 2021), Virtual Event, 2–9 February 2021; AAAI Press: Virtual Event, 2021; pp. 15008–15015. [Google Scholar]
Zhao, Q.; Ye, Z.; Chen, C.; Wang, Y. Persistence Enhanced Graph Neural Network. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), Palermo, Italy, 26–28 August 2020; PMLR: Palermo, Italy, 2020; pp. 2896–2906. [Google Scholar]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural Message Passing for Quantum Chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; PMLR: Sydney, Australia, 2017; Volume 70, pp. 1263–1272. [Google Scholar]
Chang, J.; Gu, J.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Structure-Aware Convolutional Neural Networks. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems 2018 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; NeurIPS: Montréal, QC, Canada, 2018; pp. 11–20. [Google Scholar]
Chen, J.; Zhu, J.; Song, L. Stochastic Training of Graph Convolutional Networks with Variance Reduction. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; PMLR: Stockholm, Sweden, 2018; Volume 80, pp. 941–949. [Google Scholar]
Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; Guo, Y., Farooq, F., Eds.; ACM: London, UK, 2018; pp. 974–983. [Google Scholar] [CrossRef]
Wang, H.; Lian, D.; Ge, Y. Binarized Collaborative Filtering with Distilling Graph Convolutional Networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 10–16 August 2019; AAAI Press: Macao, China, 2019; pp. 4802–4808. [Google Scholar]
Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019; OpenReview.net: New Orleans, LA, USA, 2019. [Google Scholar]
Qu, M.; Tang, J.; Shang, J.; Ren, X.; Zhang, M.; Han, J. An Attention-based Collaboration Framework for Multi-View Network Representation Learning. In Proceedings of the 26th Conference on Information and Knowledge Management (CIKM 2017), Singapore, 6–10 November 2017; ACM: Singapore, 2017; pp. 1767–1776. [Google Scholar] [CrossRef]
Tran, D.V.; Navarin, N.; Sperduti, A. On Filter Size in Graph Convolutional Networks. In Proceedings of the Symposium Series on Computational Intelligence (SSCI 2018), Bangalore, India, 18–21 November 2018; IEEE: Bangalore, India, 2018; pp. 1534–1541. [Google Scholar] [CrossRef]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Graph Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
Lo, W.W.; Layeghy, S.; Sarhan, M.; Gallagher, M.; Portmann, M. E-GraphSAGE: A Graph Neural Network based Intrusion Detection System for IoT. In Proceedings of the Network Operations and Management Symposium (NOMS 2022), Budapest, Hungary, 25–29 April 2022; IEEE: Budapest, Hungary, 2022; pp. 1–9. [Google Scholar] [CrossRef]
Cai, T.; Luo, S.; Xu, K.; He, D.; Liu, T.; Wang, L. GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; PMLR: Virtual Event, 2021; Volume 139, pp. 1204–1215. [Google Scholar]
Le, T.; Bertolini, M.; Noé, F.; Clevert, D. Parameterized Hypercomplex Graph Neural Networks for Graph Classification. In Proceedings of the 30th Artificial Neural Networks and Machine Learning (ICANN 2021), Bratislava, Slovakia, 14–17 September 2021; Farkas, I., Masulli, P., Otte, S., Wermter, S., Eds.; Springer: Bratislava, Slovakia, 2021; Volume 12893, pp. 204–216. [Google Scholar] [CrossRef]
Yadati, N.; Nimishakavi, M.; Yadav, P.; Nitin, V.; Louis, A.; Talukdar, P.P. HyperGCN: A New Method For Training Graph Convolutional Networks on Hypergraphs. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems 2019 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R., Eds.; NeurIPS: Vancouver, BC, Canada, 2019; pp. 1509–1520. [Google Scholar]
Zhang, W.; Liu, H.; Liu, Y.; Zhou, J.; Xiong, H. Semi-Supervised Hierarchical Recurrent Graph Neural Network for City-Wide Parking Availability Prediction. In Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: New York, NY, USA, 2020; pp. 1186–1193. [Google Scholar]
Peng, H.; Wang, H.; Du, B.; Bhuiyan, M.Z.A.; Ma, H.; Liu, J.; Wang, L.; Yang, Z.; Du, L.; Wang, S.; et al. Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Inf. Sci. 2020, 521, 277–290. [Google Scholar] [CrossRef]
Cheng, D.; Xiang, S.; Shang, C.; Zhang, Y.; Yang, F.; Zhang, L. Spatio-Temporal Attention-Based Neural Network for Credit Card Fraud Detection. In Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: New York, NY, USA, 2020; pp. 362–369. [Google Scholar]
Xie, Z.; Lv, W.; Huang, S.; Lu, Z.; Du, B.; Huang, R. Sequential Graph Neural Network for Urban Road Traffic Speed Prediction. IEEE Access 2020, 8, 63349–63358. [Google Scholar] [CrossRef]
Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.; Ye, J.; Li, Z. Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction. In Proceedings of the 32nd Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA, 2–7 February 2018; AAAI Press: New Orleans, LA, USA, 2018; pp. 2588–2595. [Google Scholar]
Zhang, J.; Zheng, Y.; Qi, D. Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. In Proceedings of the 31st Conference on Artificial Intelligence (AAAI 2017), San Francisco, CA, USA, 4–9 February 2017; Singh, S., Markovitch, S., Eds.; AAAI Press: San Francisco, CA, USA, 2017; pp. 1655–1661. [Google Scholar]
Schlichtkrull, M.S.; Kipf, T.N.; Bloem, P.; van den Berg, R.; Titov, I.; Welling, M. Modeling Relational Data with Graph Convolutional Networks. In Proceedings of the 15th The Semantic Web International Conference (ESWC 2018), Heraklion, Greece, 3–7 June 2018; Springer: Heraklion, Greece, 2018; Volume 10843, pp. 593–607. [Google Scholar] [CrossRef]
Jing, B.; Park, C.; Tong, H. HDMI: High-order Deep Multiplex Infomax. In Proceedings of the The Web Conference (WWW ’21), Ljubljana, Slovenia, 19–23 April 2021; Leskovec, J., Grobelnik, M., Najork, M., Tang, J., Zia, L., Eds.; ACM/IW3C2: Ljubljana, Slovenia, 2021; pp. 2414–2424. [Google Scholar] [CrossRef]
Park, C.; Kim, D.; Han, J.; Yu, H. Unsupervised Attributed Multiplex Network Embedding. In Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: New York, NY, USA, 2020; pp. 5371–5378. [Google Scholar]
Wei, H.; Hu, G.; Bai, W.; Xia, S.; Pan, Z. Lifelong representation learning in dynamic attributed networks. Neurocomputing 2019, 358, 1–9. [Google Scholar] [CrossRef]
Pareja, A.; Domeniconi, G.; Chen, J.; Ma, T.; Suzumura, T.; Kanezashi, H.; Kaler, T.; Schardl, T.B.; Leiserson, C.E. EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs. In Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: New York, NY, USA, 2020; pp. 5363–5370. [Google Scholar]
Sun, F.; Hoffmann, J.; Verma, V.; Tang, J. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020; OpenReview.net: Addis Ababa, Ethiopia, 2020. [Google Scholar]
Hu, Z.; Dong, Y.; Wang, K.; Chang, K.; Sun, Y. GPT-GNN: Generative Pre-Training of Graph Neural Networks. In Proceedings of the 26th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2020), Virtual Event, CA, USA, 6–10 July 2020; ACM: Virtual Event, CA, USA, 2020; pp. 1857–1867. [Google Scholar] [CrossRef]
Karypis, G.; Kumar, V. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 1998, 20, 359–392. [Google Scholar] [CrossRef]
Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; Prasanna, V.K. Accurate, Efficient and Scalable Graph Embedding. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2019), Rio de Janeiro, Brazil, 20–24 May 2019; IEEE: Rio de Janeiro, Brazil, 2019; pp. 462–471. [Google Scholar] [CrossRef]
Huang, W.; Zhang, T.; Rong, Y.; Huang, J. Adaptive Sampling Towards Fast Graph Representation Learning. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; NeurIPS: Montréal, QC, Canada, 2018; pp. 4563–4572. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; ICLR: San Diego, CA, USA, 2015. [Google Scholar]
Zhao, L.; Akoglu, L. PairNorm: Tackling Oversmoothing in GNNs. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020; OpenReview.net: Addis Ababa, Ethiopia, 2020. [Google Scholar]
Ma, L.; Rabbany, R.; Romero-Soriano, A. Graph Attention Networks with Positional Embeddings. In Proceedings of the 25th Advances in Knowledge Discovery and Data Mining Pacific-Asia Conference (PAKDD 2021), Virtual Event, 11–14 May 2021; Karlapalem, K., Cheng, H., Ramakrishnan, N., Agrawal, R.K., Reddy, P.K., Srivastava, J., Chakraborty, T., Eds.; Springer: Virtual Event, 2021; Volume 12712, pp. 514–527. [Google Scholar] [CrossRef]
Huang, B.; Carley, K.M. Syntax-Aware Aspect Level Sentiment Classification with Graph Attention Networks. In Proceedings of the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 5468–5476. [Google Scholar] [CrossRef]
Zhang, J.; Shi, X.; Xie, J.; Ma, H.; King, I.; Yeung, D. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI 2018), Monterey, CA, USA, 6–10 August 2018; AUAI Press: Monterey, CA, USA, 2018; pp. 339–349. [Google Scholar]
Haonan, L.; Huang, S.H.; Ye, T.; Xiuyan, G. Graph Star Net for Generalized Multi-Task Learning. arXiv 2019, arXiv:1906.12330. [Google Scholar]
Abu-El-Haija, S.; Perozzi, B.; Al-Rfou, R.; Alemi, A.A. Watch Your Step: Learning Node Embeddings via Graph Attention. In Proceedings of the 31st Advances in Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; NeurIPS: Montréal, QC, Canada, 2018; pp. 9198–9208. [Google Scholar]
Kim, D.; Oh, A. How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021; OpenReview.net: Virtual Event, 2021. [Google Scholar]
Wang, G.; Ying, R.; Huang, J.; Leskovec, J. Improving Graph Attention Networks with Large Margin-based Constraints. arXiv 2019, arXiv:1910.11945. [Google Scholar]
Zhu, Y.; Wang, J.; Zhang, J.; Zhang, K. Node Embedding and Classification with Adaptive Structural Fingerprint. Neurocomputing 2022, 502, 196–208. [Google Scholar] [CrossRef]
Yang, Y.; Qiu, J.; Song, M.; Tao, D.; Wang, X. Distilling Knowledge From Graph Convolutional Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE: Seattle, WA, USA, 2020; pp. 7072–7081. [Google Scholar] [CrossRef]
Nathani, D.; Chauhan, J.; Sharma, C.; Kaul, M. Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; ACL: Florence, Italy, 2019; pp. 4710–4723. [Google Scholar] [CrossRef]
Sun, X.; Wang, Z.; Yang, J.; Liu, X. Deepdom: Malicious domain detection with scalable and heterogeneous graph convolutional networks. Comput. Secur. 2020, 99, 102057. [Google Scholar] [CrossRef]
Xue, H.; Yang, L.; Rajan, V.; Jiang, W.; Wei, Y.; Lin, Y. Multiplex Bipartite Network Embedding Using Dual Hypergraph Convolutional Networks. In Proceedings of the Web Conference 2021 (WWW 2021), Ljubljana, Slovenia, 19–23 April 2021; ACM: New York, NY, USA, 2021; pp. 1649–1660. [Google Scholar] [CrossRef]
Wang, Y.; Duan, Z.; Liao, B.; Wu, F.; Zhuang, Y. Heterogeneous Attributed Network Embedding with Graph Convolutional Networks. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Honolulu, HI, USA, 2019; pp. 10061–10062. [Google Scholar] [CrossRef]
Yu, P.; Fu, C.; Yu, Y.; Huang, C.; Zhao, Z.; Dong, J. Multiplex Heterogeneous Graph Convolutional Network. In Proceedings of the 28th Conference on Knowledge Discovery and Data Mining (KDD 2022), Washington, DC, USA, 14–18 August 2022; Zhang, A., Rangwala, H., Eds.; ACM: Washington, DC, USA, 2022; pp. 2377–2387. [Google Scholar] [CrossRef]
Wang, H.; Zhao, M.; Xie, X.; Li, W.; Guo, M. Knowledge Graph Convolutional Networks for Recommender Systems. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; Liu, L., White, R.W., Mantrach, A., Silvestri, F., McAuley, J.J., Baeza-Yates, R., Zia, L., Eds.; ACM: San Francisco, CA, USA, 2019; pp. 3307–3313. [Google Scholar] [CrossRef]
Mao, C.; Yao, L.; Luo, Y. MedGCN: Medication recommendation and lab test imputation via graph convolutional networks. J. Biomed. Inform. 2022, 127, 104000. [Google Scholar] [CrossRef]
Xu, F.; Lian, J.; Han, Z.; Li, Y.; Xu, Y.; Xie, X. Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation. In Proceedings of the 28th International Conference on Information and Knowledge Management (CIKM 2019), Beijing, China, 3–7 November 2019; Zhu, W., Tao, D., Cheng, X., Cui, P., Rundensteiner, E.A., Carmel, D., He, Q., Yu, J.X., Eds.; ACM: Beijing, China, 2019; pp. 529–538. [Google Scholar] [CrossRef]
Wang, C.; Pan, S.; Long, G.; Zhu, X.; Jiang, J. MGAE: Marginalized Graph Autoencoder for Graph Clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, 6–10 November 2017; ACM: Singapore, 2017; pp. 889–898. [Google Scholar] [CrossRef]
Simonovsky, M.; Komodakis, N. GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders. In Proceedings of the 27th International Conference on Artificial Neural Networks (ICANN 2018), Rhodes, Greece, 4–7 October 2018; Springer: Rhodes, Greece, 2018. [Google Scholar]
Cao, N.D.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv 2018, arXiv:1805.11973. [Google Scholar]
Suganuma, M.; Ozay, M.; Okatani, T. Exploiting the Potential of Standard Convolutional Autoencoders for Image Restoration by Evolutionary Search. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018; PMLR: Stockholm, Sweden, 2018; Volume 80, pp. 4778–4787. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 14–16 April 2014; ICLR: Banff, AB, Canada, 2014. [Google Scholar]
Wu, G.; Lin, S.; Shao, X.; Zhang, P.; Qiao, J. QPGCN: Graph convolutional network with a quadratic polynomial filter for overcoming over-smoothing. Appl. Intell. 2022, 53, 7216–7231. [Google Scholar] [CrossRef]
Shi, L.; Wu, W.; Hu, W.; Zhou, J.; Chen, J.; Zheng, W.; He, L. DualGCN: An Aspect-Aware Dual Graph Convolutional Network for review-based recommender. Knowl.-Based Syst. 2022, 242, 108359. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; NeurIPS: Long Beach, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; ACL: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Event, 19–25 June 2021; IEEE: Virtual Event, 2021; pp. 12299–12310. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021; OpenReview.net: Vienna, Austria, 2021. [Google Scholar]
Cai, D.; Lam, W. Graph Transformer for Graph-to-Sequence Learning. In Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: New York, NY, USA, 2020; pp. 7464–7471. [Google Scholar]
Jeon, H.J.; Choi, G.S.; Cho, S.Y.; Lee, H.; Ko, H.Y.; Jung, J.J.; Lee, O.-J.; Yi, M.Y. Learning contextual representations of citations via graph transformer. In Proceedings of the 2nd International Conference on Human-centered Artificial Intelligence (Computing4Human 2021), Da Nang, Vietnam, 28–29 October 2021; ceur-ws: Da Nang, Vietnam, 2021. [Google Scholar]
Kreuzer, D.; Beaini, D.; Hamilton, W.L.; Létourneau, V.; Tossou, P. Rethinking Graph Transformers with Spectral Attention. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual Event, 6–14 December 2021; NeurIPS: Virtual Event, 2021; pp. 21618–21629. [Google Scholar]
Hu, Z.; Dong, Y.; Wang, K.; Sun, Y. Heterogeneous Graph Transformer. In Proceedings of the Web Conference (WWW 2020), Taipei, Taiwan, 20–24 April 2020; ACM/IW3C2: Taipei, Taiwan, 2020; pp. 2704–2710. [Google Scholar] [CrossRef]
Khoo, L.M.S.; Chieu, H.L.; Qian, Z.; Jiang, J. Interpretable Rumor Detection in Microblogs by Attending to User Interactions. In Proceedings of the 34th Conference on Artificial Intelligence(AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: New York, NY, USA, 2020; pp. 8783–8790. [Google Scholar]
Min, E.; Rong, Y.; Xu, T.; Bian, Y.; Luo, D.; Lin, K.; Huang, J.; Ananiadou, S.; Zhao, P. Neighbour Interaction based Click-Through Rate Prediction via Graph-masked Transformer. In Proceedings of the 45th International Conference on Research and Development in Information Retrieval (SIGIR 2022), Madrid, Spain, 11–15 July 2022; ACM: Madrid, Spain, 2022; pp. 353–362. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, LA, USA, 1–6 June 2018; ACL: New Orleans, LA, USA, 2018; pp. 464–468. [Google Scholar] [CrossRef]
Liu, Y.; Yang, S.; Lei, C.; Wang, G.; Tang, H.; Zhang, J.; Sun, A.; Miao, C. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In Proceedings of the Multimedia Conference (MM’21), Virtual Event, China, 20–24 October 2021; ACM: Virtual Event, China, 2021; pp. 2853–2861. [Google Scholar] [CrossRef]
Lin, K.; Wang, L.; Liu, Z. End-to-End Human Pose and Mesh Reconstruction with Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2021), Virtual Event, 19–25 June 2021; Computer Vision Foundation/IEEE: Virtual Event, 2021; pp. 1954–1963. [Google Scholar] [CrossRef]
Makarov, I.; Kiselev, D.; Nikitinsky, N.; Subelj, L. Survey on graph embeddings and their applications to machine learning problems on graphs. PeerJ Comput. Sci. 2021, 7, e357. [Google Scholar] [CrossRef]
Yang, M.; Zhou, M.; Li, Z.; Liu, J.; Pan, L.; Xiong, H.; King, I. Hyperbolic Graph Neural Networks: A Review of Methods and Applications. arXiv 2022, arXiv:2202.13852. [Google Scholar] [CrossRef]
Chami, I.; Ying, Z.; Ré, C.; Leskovec, J. Hyperbolic Graph Convolutional Neural Networks. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; NeurIPS: Vancouver, BC, Canada, 2019; pp. 4869–4880. [Google Scholar]
Liu, Q.; Nickel, M.; Kiela, D. Hyperbolic Graph Neural Networks. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; NeurIPS: Vancouver, BC, Canada, 2019; pp. 8228–8239. [Google Scholar]
Ratcliffe, J.G.; Axler, S.; Ribet, K. Foundations of Hyperbolic Manifolds; Springer: Berlin/Heidelberg, Germany, 1994; Volume 149. [Google Scholar]
Krioukov, D.V.; Papadopoulos, F.; Kitsak, M.; Vahdat, A.; Boguñá, M. Hyperbolic Geometry of Complex Networks. arXiv 2010, arXiv:1006.5169. [Google Scholar] [CrossRef] [PubMed]
Defferrard, M.; Milani, M.; Gusset, F.; Perraudin, N. DeepSphere: A graph-based spherical CNN. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020; OpenReview.net: Addis Ababa, Ethiopia, 2020. [Google Scholar]
Zhu, D.; Cui, P.; Wang, D.; Zhu, W. Deep Variational Network Embedding in Wasserstein Space. In Proceedings of the 24th International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; Guo, Y., Farooq, F., Eds.; ACM: London, UK, 2018; pp. 2827–2836. [Google Scholar] [CrossRef]
He, S.; Liu, K.; Ji, G.; Zhao, J. Learning to Represent Knowledge Graphs with Gaussian Embedding. In Proceedings of the 24th International Conference on Information and Knowledge Management (CIKM 2015), Melbourne, VIC, Australia, 19–23 October 2015; ACM: Melbourne, VIC, Australia, 2015; pp. 623–632. [Google Scholar] [CrossRef]
Vilnis, L.; McCallum, A. Word Representations via Gaussian Embedding. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Bengio, Y., LeCun, Y., Eds.; ICLR: San Diego, CA, USA, 2015. [Google Scholar]
Kampffmeyer, M.; Chen, Y.; Liang, X.; Wang, H.; Zhang, Y.; Xing, E.P. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 11487–11496. [Google Scholar]
Zhang, T.; Zheng, W.; Cui, Z.; Li, Y. Tensor graph convolutional neural network. arXiv 2018, arXiv:1803.10071. [Google Scholar]
Yang, L.; Zhan, X.; Chen, D.; Yan, J.; Loy, C.C.; Lin, D. Learning to Cluster Faces on an Affinity Graph. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 2298–2306. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, L.; Li, Y.; Wang, S. Linkage Based Face Clustering via Graph Convolution Network. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 1117–1125. [Google Scholar] [CrossRef]
Narasimhan, M.; Lazebnik, S.; Schwing, A.G. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; NeurIPS: Montréal, QC, Canada, 2018; pp. 2659–2670. [Google Scholar]
Cui, Z.; Xu, C.; Zheng, W.; Yang, J. Context-Dependent Diffusion Network for Visual Relationship Detection. In Proceedings of the Multimedia Conference on Multimedia Conference (MM 2018), Seoul, Republic of Korea, 22–26 October 2018; Boll, S., Lee, K.M., Luo, J., Zhu, W., Byun, H., Chen, C.W., Lienhart, R., Mei, T., Eds.; ACM: Seoul, Republic of Korea, 2018; pp. 1475–1482. [Google Scholar] [CrossRef]
Dai, B.; Zhang, Y.; Lin, D. Detecting Visual Relationships with Deep Relational Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Honolulu, HI, USA, 2017; pp. 3298–3308. [Google Scholar] [CrossRef]
Johnson, J.; Gupta, A.; Fei-Fei, L. Image Generation From Scene Graphs. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; Computer Vision Foundation/IEEE Computer Society: Salt Lake City, UT, USA, 2018; pp. 1219–1228. [Google Scholar] [CrossRef]
Chen, Y.; Rohrbach, M.; Yan, Z.; Yan, S.; Feng, J.; Kalantidis, Y. Graph-Based Global Reasoning Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 433–442. [Google Scholar] [CrossRef]
Wang, P.; Wu, Q.; Cao, J.; Shen, C.; Gao, L.; van den Hengel, A. Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 1960–1968. [Google Scholar] [CrossRef]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 3595–3603. [Google Scholar] [CrossRef]
Wang, X.; Gupta, A. Videos as Space-Time Region Graphs. In Proceedings of the 15th Computer Vision European Conference (ECCV 2018), Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Munich, Germany, 2018; Volume 11209, pp. 413–431. [Google Scholar] [CrossRef]
Gao, J.; Zhang, T.; Xu, C. Graph Convolutional Tracking. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 4649–4659. [Google Scholar] [CrossRef]
Zhong, J.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; Computer Vision Foundation/IEEE: Long Beach, CA, USA, 2019; pp. 1237–1246. [Google Scholar] [CrossRef]
Bastings, J.; Titov, I.; Aziz, W.; Marcheggiani, D.; Sima’an, K. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark, 9–11 September 2017; Palmer, M., Hwa, R., Riedel, S., Eds.; ACL: Copenhagen, Denmark, 2017; pp. 1957–1967. [Google Scholar] [CrossRef]
Strubell, E.; McCallum, A. Dependency Parsing with Dilated Iterated Graph CNNs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark, 9–11 September 2017; Chang, K., Chang, M., Srikumar, V., Rush, A.M., Eds.; ACL: Copenhagen, Denmark, 2017; pp. 1–6. [Google Scholar] [CrossRef]
Vashishth, S.; Bhandari, M.; Yadav, P.; Rai, P.; Bhattacharyya, C.; Talukdar, P.P. Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D.R., Màrquez, L., Eds.; ACL: Florence, Italy, 2019; pp. 3308–3318. [Google Scholar] [CrossRef]
Kim, D.; Kim, S.; Kwak, N. Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; Long Papers. Korhonen, A., Traum, D.R., Màrquez, L., Eds.; ACL: Florence, Italy, 2019; Volume 1, pp. 3568–3584. [Google Scholar] [CrossRef]
Hao, Y.; Sheng, Y.; Wang, J. A Graph Representation Learning Algorithm for Low-Order Proximity Feature Extraction to Enhance Unsupervised IDS Preprocessing. Appl. Sci. 2019, 9, 4473. [Google Scholar] [CrossRef]
Narayanan, A.; Chandramohan, M.; Chen, L.; Liu, Y. A multi-view context-aware approach to Android malware detection and malicious code localization. Empir. Softw. Eng. 2018, 23, 1222–1274. [Google Scholar] [CrossRef]
Nguyen, H.; Nguyen, D.; Ngo, Q.; Tran, V.; Le, V. Towards a rooted subgraph classifier for IoT botnet detection. In Proceedings of the 7th International Conference on Computer and Communications Management (ICCCM 2019), Bangkok, Thailand, 27–29 July 2019; ACM: Bangkok, Thailand, 2019; pp. 247–251. [Google Scholar] [CrossRef]
Zitnik, M.; Agrawal, M.; Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 2018, 34, i457–i466. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Zhou, J.; Xu, T.; Dou, D.; Xiong, H. GeomGCL: Geometric Graph Contrastive Learning for Molecular Property Prediction. In Proceedings of the 36th Conference on Artificial Intelligence (AAAI 2022), Virtual Event, 22 February–1 March 2022; AAAI Press: Palo Alto, CA, USA, 2022; pp. 4541–4549. [Google Scholar]
Duvenaud, D.; Maclaurin, D.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NeurIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; NeurIPS: Montreal, QC, Canada, 2015; pp. 2224–2232. [Google Scholar]
Yamanishi, Y.; Araki, M.; Gutteridge, A.; Honda, W.; Kanehisa, M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. In Proceedings of the 16th International Conference on Intelligent Systems for Molecular Biology (ISMB 2008), Toronto, ON, Canada, 19–23 July 2008; Oxford Academic: Toronto, ON, Canada, 2008; pp. 232–240. [Google Scholar] [CrossRef]
Cobanoglu, M.C.; Liu, C.; Hu, F.; Oltvai, Z.N.; Bahar, I. Predicting Drug-Target Interactions Using Probabilistic Matrix Factorization. J. Chem. Inf. Model. 2013, 53, 3399–3409. [Google Scholar] [CrossRef]
Sharma, A.; Rani, R. BE-DTI’: Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning. Comput. Methods Prog. Biomed. 2018, 165, 151–162. [Google Scholar] [CrossRef]
Zheng, X.; Ding, H.; Mamitsuka, H.; Zhu, S. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In Proceedings of the the 19th International Conference on Knowledge Discovery and Data Mining (KDD 2013), Chicago, IL, USA, 11–14 August 2013; Dhillon, I.S., Koren, Y., Ghani, R., Senator, T.E., Bradley, P., Parekh, R., He, J., Grossman, R.L., Uthurusamy, R., Eds.; ACM: Chicago, IL, USA, 2013; pp. 1025–1033. [Google Scholar] [CrossRef]
Yamanishi, Y.; Kotera, M.; Moriya, Y.; Sawada, R.; Kanehisa, M.; Goto, S. DINIES: Drug-target interaction network inference engine based on supervised analysis. Nucleic Acids Res. 2014, 42, 39–45. [Google Scholar] [CrossRef]
Ezzat, A.; Zhao, P.; Wu, M.; Li, X.L.; Kwoh, C.K. Drug-Target Interaction Prediction with Graph Regularized Matrix Factorization. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 646–656. [Google Scholar] [CrossRef]
Li, X.; Yan, X.; Gu, Q.; Zhou, H.; Wu, D.; Xu, J. DeepChemStable: Chemical Stability Prediction with an Attention-Based Graph Convolution Network. J. Chem. Inf. Model. 2019, 59, 1044–1049. [Google Scholar] [CrossRef]
Kulmanov, M.; Khan, M.A.; Hoehndorf, R. DeepGO: Predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 2018, 34, 660–668. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Luo, J.; Xiao, Q.; Liang, C.; Ding, P.; Cao, B. Predicting MicroRNA-Disease Associations Using Network Topological Similarity Based on DeepWalk. IEEE Access 2017, 5, 24032–24039. [Google Scholar] [CrossRef]
Su, X.; Hu, L.; You, Z.; Hu, P.; Zhao, B. Attention-based Knowledge Graph Representation Learning for Predicting Drug-drug Interactions. Briefings Bioinform. 2022, 23, bbac140. [Google Scholar] [CrossRef]
Çelebi, R.; Yasar, E.; Uyar, H.; Gümüs, Ö.; Dikenelli, O.; Dumontier, M. Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction using Linked Open Data. In Proceedings of the 11th International Conference Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2018), Antwerp, Belgium, 3–6 December 2018; Waagmeester, A., Baker, C.J.O., Splendiani, A., Beyan, O.D., Marshall, M.S., Eds.; CEUR-WS.org: Antwerp, Belgium, 2018; Volume 2275. [Google Scholar]
Karim, M.R.; Cochez, M.; Jares, J.B.; Uddin, M.; Beyan, O.D.; Decker, S. Drug-Drug Interaction Prediction Based on Knowledge Graph Embeddings and Convolutional-LSTM Network. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB 2019), Niagara Falls, NY, USA, 7–10 September 2019; Shi, X.M., Buck, M., Ma, J., Veltri, P., Eds.; ACM: Niagara Falls, NY, USA, 2019; pp. 113–123. [Google Scholar] [CrossRef]
Lin, X.; Quan, Z.; Wang, Z.; Ma, T.; Zeng, X. KGNN: Knowledge Graph Neural Network for Drug-Drug Interaction Prediction. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), Yokohama, Japan, 7–15 January 2021; Bessiere, C., Ed.; ijcai.org: Yokohama, Japan, 2020; pp. 2739–2745. [Google Scholar] [CrossRef]
Zhao, T.; Hu, Y.; Valsdottir, L.R.; Zang, T.; Peng, J. Identifying drug-target interactions based on graph convolutional network and deep neural network. Briefings Bioinform. 2021, 22, 2141–2150. [Google Scholar] [CrossRef] [PubMed]
An, Q.; Yu, L. A heterogeneous network embedding framework for predicting similarity-based drug-target interactions. Briefings Bioinform. 2021, 22, bbab275. [Google Scholar] [CrossRef]
Peng, J.; Wang, Y.; Guan, J.; Li, J.; Han, R.; Hao, J.; Wei, Z.; Shang, X. An end-to-end heterogeneous graph representation learning-based framework for drug-target interaction prediction. Briefings Bioinform. 2021, 22, bbaa430. [Google Scholar] [CrossRef]
Monti, F.; Frasca, F.; Eynard, D.; Mannion, D.; Bronstein, M.M. Fake News Detection on Social Media using Geometric Deep Learning. arXiv 2019, arXiv:1902.06673. [Google Scholar]
Benamira, A.; Devillers, B.; Lesot, E.; Ray, A.K.; Saadi, M.; Malliaros, F.D. Semi-supervised learning and graph neural networks for fake news detection. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, BC, Canada, 27–30 August 2019; ACM: Vancouver, BC, Canada, 2019; pp. 568–569. [Google Scholar] [CrossRef]
Nguyen, V.; Sugiyama, K.; Nakov, P.; Kan, M. FANG: Leveraging Social Context for Fake News Detection Using Graph Representation. In Proceedings of the 29th International Conference on Information and Knowledge Management (CIKM 2020), Virtual Event, Ireland, 19–23 October 2020; d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P., Eds.; ACM: Virtual Event, Ireland, 2020; pp. 1165–1174. [Google Scholar] [CrossRef]
Piao, J.; Zhang, G.; Xu, F.; Chen, Z.; Li, Y. Predicting Customer Value with Social Relationships via Motif-based Graph Attention Networks. In Proceedings of the The Web Conference 2021 (WWW 2021), Ljubljana, Slovenia, 19–23 April 2021; Leskovec, J., Grobelnik, M., Najork, M., Tang, J., Zia, L., Eds.; ACM/IW3C2: Ljubljana, Slovenia, 2021; pp. 3146–3157. [Google Scholar] [CrossRef]
Li, C.; Goldwasser, D. Encoding social information with graph convolutional networks for political perspective detection in news media. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July– 2 August 2019; ACL: Florence, Italy, 2019; pp. 2594–2604. [Google Scholar]
Qiu, J.; Tang, J.; Ma, H.; Dong, Y.; Wang, K.; Tang, J. DeepInf: Social Influence Prediction with Deep Learning. In Proceedings of the 24th International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; Guo, Y., Farooq, F., Eds.; ACM: London, UK, 2018; pp. 2110–2119. [Google Scholar] [CrossRef]
Koren, Y.; Bell, R.M.; Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Gu, Q.; Zhou, J.; Ding, C.H.Q. Collaborative Filtering: Weighted Nonnegative Matrix Factorization Incorporating User and Item Graphs. In Proceedings of the International Conference on Data Mining (SDM 2010), Columbus, OH, USA, 29 April–1 May 2010; SIAM: Columbus, OH, USA, 2010; pp. 199–210. [Google Scholar] [CrossRef]
Guo, Q.; Sun, Z.; Theng, Y. Exploiting Side Information for Recommendation. In Proceedings of the 19th International Conference (ICWE 2019), Daejeon, Republic of Korea, 11–14 June 2019; Bakaev, M., Frasincar, F., Ko, I., Eds.; Springer: Daejeon, Republic of Korea, 2019; Volume 11496, pp. 569–573. [Google Scholar] [CrossRef]
Sun, Z.; Guo, Q.; Yang, J.; Fang, H.; Guo, G.; Zhang, J.; Burke, R. Research commentary on recommendations with side information: A survey and research directions. Electron. Commer. Res. Appl. 2019, 37. [Google Scholar] [CrossRef]
He, R.; Lin, C.; Wang, J.; McAuley, J.J. Sherlock: Sparse Hierarchical Embeddings for Visually-Aware One-Class Collaborative Filtering. In Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; Kambhampati, S., Ed.; IJCAI/AAAI Press: New York, NY, USA, 2016; pp. 3740–3746. [Google Scholar]
Zhang, Y.; Lai, G.; Zhang, M.; Zhang, Y.; Liu, Y.; Ma, S. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th International Conference on Research and Development in Information Retrieval (SIGIR’14), Gold Coast, QLD, Australia, 6–11 July 2014; Geva, S., Trotman, A., Bruza, P., Clarke, C.L.A., Järvelin, K., Eds.; ACM: Gold Coast, QLD, Australia, 2014; pp. 83–92. [Google Scholar] [CrossRef]
Fan, W.; Ma, Y.; Li, Q.; He, Y.; Zhao, Y.E.; Tang, J.; Yin, D. Graph Neural Networks for Social Recommendation. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; Liu, L., White, R.W., Mantrach, A., Silvestri, F., McAuley, J.J., Baeza-Yates, R., Zia, L., Eds.; ACM: San Francisco, CA, USA, 2019; pp. 417–426. [Google Scholar] [CrossRef]
Dong, X.; Yu, L.; Wu, Z.; Sun, Y.; Yuan, L.; Zhang, F. A Hybrid Collaborative Filtering Model with Deep Structure for Recommender Systems. In Proceedings of the 31st Conference on Artificial Intelligence, (AAAI 2017), San Francisco, CA, USA, 4–9 February 2017; Singh, S., Markovitch, S., Eds.; AAAI Press: San Francisco, CA, USA, 2017; pp. 1309–1315. [Google Scholar]
Wang, X.; He, X.; Cao, Y.; Liu, M.; Chua, T. KGAT: Knowledge Graph Attention Network for Recommendation. In Proceedings of the 25th International Conference on Knowledge Discovery & Data Mining (KDD 2019), Anchorage, AK, USA, 4–8 August 2019; ACM: Anchorage, AK, USA, 2019; pp. 950–958. [Google Scholar] [CrossRef]
Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T. Neural Graph Collaborative Filtering. In Proceedings of the 42nd International Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, 21–25 July 2019; Piwowarski, B., Chevalier, M., Gaussier, É., Maarek, Y., Nie, J., Scholer, F., Eds.; ACM: Paris, France, 2019; pp. 165–174. [Google Scholar] [CrossRef]
van den Berg, R.; Kipf, T.N.; Welling, M. Graph Convolutional Matrix Completion. arXiv 2017, arXiv:1706.02263. [Google Scholar]
Chai, D.; Wang, L.; Yang, Q. Bike flow prediction with multi-graph convolutional networks. In Proceedings of the 26th International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2018), Seattle, WA, USA, 6–9 November 2018; Kashani, F.B., Hoel, E.G., Güting, R.H., Tamassia, R., Xiong, L., Eds.; ACM: Seattle, WA, USA, 2018; pp. 397–400. [Google Scholar] [CrossRef]
Diehl, F.; Brunner, T.; Truong-Le, M.; Knoll, A.C. Graph Neural Networks for Modelling Traffic Participant Interaction. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV 2019), Paris, France, 9–12 June 2019; IEEE: Paris, France, 2019; pp. 695–701. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph WaveNet for Deep Spatial-Temporal Graph Modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 10–16 August 2019; Kraus, S., Ed.; ijcai.org: Macao, China, 2019; pp. 1907–1913. [Google Scholar] [CrossRef]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Honolulu, HI, USA, 2019; pp. 922–929. [Google Scholar] [CrossRef]
Khodayar, M.; Mohammadi, S.; Khodayar, M.E.; Wang, J.; Liu, G. Convolutional Graph Autoencoder: A Generative Deep Neural Network for Probabilistic Spatio-Temporal Solar Irradiance Forecasting. IEEE Trans. Sustain. Energy 2020, 11, 571–583. [Google Scholar] [CrossRef]
Khodayar, M.; Wang, J. Spatio-temporal graph deep neural network for short-term wind speed forecasting. IEEE Trans. Sustain. Energy 2019, 10, 670–681. [Google Scholar] [CrossRef]
Owerko, D.; Gama, F.; Ribeiro, A. Predicting Power Outages Using Graph Neural Networks. In Proceedings of the 6th IEEE Global Conference on Signal and Information Processing (GlobalSIP 2018), Anaheim, CA, USA, 26–29 November 2018; IEEE: Anaheim, CA, USA, 2018; pp. 743–747. [Google Scholar] [CrossRef]
Austin, A.E.; Desrosiers, T.A.; Shanahan, M.E. Directed acyclic graphs: An under-utilized tool for child maltreatment research. Child Abus. Negl. 2019, 91, 78–87. [Google Scholar] [CrossRef]
Waller, I.; Anderson, A. Quantifying social organization and political polarization in online platforms. Nature 2021, 600, 264–268. [Google Scholar] [CrossRef]
Kubin, E.; von Sikorski, C. The role of (social) media in political polarization: A systematic review. Ann. Int. Commun. Assoc. 2021, 45, 188–206. [Google Scholar] [CrossRef]
Tokita, C.K.; Guess, A.M.; Tarnita, C.E. Polarized information ecosystems can reorganize social networks via information cascades. Proc. Natl. Acad. Sci. USA 2021, 118, e2102147118. [Google Scholar] [CrossRef]
Martin, T. community2vec: Vector representations of online communities encode semantic relationships. In Proceedings of the 2nd Workshop on NLP and Computational Social Science (NLP+CSS@ACL 2017), Vancouver, BC, Canada, 3 August 2017; Hovy, D., Volkova, S., Bamman, D., Jurgens, D., O’Connor, B., Tsur, O., Dogruöz, A.S., Eds.; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 27–31. [Google Scholar] [CrossRef]
Elson, D.K.; Dames, N.; McKeown, K.R. Extracting Social Networks from Literary Fiction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, 11–16 July 2010; ACL: Uppsala, Sweden, 2010; pp. 138–147. [Google Scholar]
Moretti, F. Network theory, plot analysis. New Left Rev. 2011, 2, 80–102. [Google Scholar]
Agarwal, A.; Kotalwar, A.; Rambow, O. Automatic Extraction of Social Networks from Literary Text: A Case Study on Alice in Wonderland. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), Nagoya, Japan, 14–18 October 2013; Asian Federation of Natural Language Processing/ACL: Nagoya, Japan, 2013; pp. 1202–1208. [Google Scholar]
Jayannavar, P.; Agarwal, A.; Ju, M.; Rambow, O. Validating Literary Theories Using Automatic Social Network Extraction. In Proceedings of the Workshop on Computational Linguistics for Literature (CLfL@NAACL-HLT 2015), Denver, CO, USA, 4 June 2015; Feldman, A., Kazantseva, A., Szpakowicz, S., Koolen, C., Eds.; The Association for Computer Linguistics: Denver, CO, USA, 2015; pp. 32–41. [Google Scholar] [CrossRef]
Grayson, S.; Wade, K.; Meaney, G.; Greene, D. The Sense and Sensibility of Different Sliding Windows in Constructing Co-occurrence Networks from Literature. In Proceedings of the Computational History and Data-Driven Humanities International Workshop (CHDDH 2016), Dublin, Ireland, 25 May 2016; Bozic, B., Mendel-Gleason, G., Debruyne, C., O’Sullivan, D., Eds.; Springer: Dublin, Ireland, 2016; Volume 482, pp. 65–77. [Google Scholar] [CrossRef]
Dekker, N.; Kuhn, T.; van Erp, M. Evaluating social network extraction for classic and modern fiction literature. PeerJ Prepr. 2018, 6, e27263. [Google Scholar] [CrossRef]
Inoue, N.; Pethe, C.; Kim, A.; Skiena, S. Learning and Evaluating Character Representations in Novels. In Proceedings of the Findings of the Association for Computational Linguistics (ACL 2022), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; ACL: Dublin, Ireland, 2022; pp. 1008–1019. [Google Scholar] [CrossRef]
Kounelis, A.; Vikatos, P.; Makris, C. Movie Recommendation System Based on Character Graph Embeddings. In Proceedings of the Artificial Intelligence Applications and Innovations (AIAI 2021), Crete, Greece, 25–27 June 2021; Maglogiannis, I., MacIntyre, J., Iliadis, L., Eds.; Springer: Crete, Greece, 2021; Volume 628, pp. 418–430. [Google Scholar] [CrossRef]
Riedl, M.O. Computational Narrative Intelligence: A Human-Centered Goal for Artificial Intelligence. arXiv 2016, arXiv:1602.06484. [Google Scholar]
Na, G.S.; Jang, S.; Lee, Y.L.; Chang, H. Tuplewise material representation based machine learning for accurate band gap prediction. J. Phys. Chem. 2020, 124, 10616–10623. [Google Scholar] [CrossRef]
Wang, H.; Wang, K.; Yang, J.; Shen, L.; Sun, N.; Lee, H.; Han, S. GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning. In Proceedings of the 57th Design Automation Conference (DAC 2020), San Francisco, CA, USA, 20–24 July 2020; IEEE: San Francisco, CA, USA, 2020; pp. 1–6. [Google Scholar] [CrossRef]
Li, Y.; Lin, Y.; Madhusudan, M.; Sharma, A.K.; Xu, W.; Sapatnekar, S.S.; Harjani, R.; Hu, J. A Customized Graph Neural Network Model for Guiding Analog IC Placement. In Proceedings of the 39th International Conference On Computer Aided Design (ICCAD 2020), San Diego, CA, USA, 2–5 November 2020; IEEE: San Diego, CA, USA, 2020; pp. 135.1–135.9. [Google Scholar] [CrossRef]
Lu, Y.; Nath, S.; Khandelwal, V.; Lim, S.K. Doomed Run Prediction in Physical Design by Exploiting Sequential Flow and Graph Learning. In Proceedings of the 40th International Conference On Computer Aided Design (ICCAD 2021), Munich, Germany, 1–4 November 2021; IEEE: Munich, Germany, 2021; pp. 1–9. [Google Scholar] [CrossRef]
Gallo, C.; Capozzi, V. A Wafer Bin Map “Relaxed” Clustering Algorithm for Improving Semiconductor Production Yield. Open Comput. Sci. 2020, 10, 231–245. [Google Scholar] [CrossRef]
Ko, Y.; Fujita, H. An evidential analytics for buried information in big data samples: Case study of semiconductor manufacturing. Inf. Sci. 2019, 486, 190–203. [Google Scholar] [CrossRef]
Mirhoseini, A.; Goldie, A.; Yazgan, M.; Jiang, J.W.; Songhori, E.M.; Wang, S.; Lee, Y.; Johnson, E.; Pathak, O.; Nazi, A.; et al. A graph placement methodology for fast chip design. Nature 2021, 594, 207–212. [Google Scholar] [CrossRef] [PubMed]
Schulz, B.; Jacobi, C.; Gisbrecht, A.; Evangelos, A.; Chan, C.W.; Gan, B.P. Graph Representation and Embedding for Semiconductor Manufacturing Fab States. In Proceedings of the Winter Simulation Conference (WSC 2022), Singapore, 11–14 December 2022; IEEE: Singapore, 2022; pp. 3382–3393. [Google Scholar] [CrossRef]
Jiao, X.; Li, X.; Lin, D.; Xiao, W. A Graph Neural Network based Deep Learning Predictor for Spatio-Temporal Group Solar Irradiance Forecasting. IEEE Trans. Ind. Inform. 2021, 18, 6142–6149. [Google Scholar] [CrossRef]
Wang, K.; Qi, X.; Liu, H. Photovoltaic power forecasting based LSTM-Convolutional Network. Energy 2019, 189, 116225. [Google Scholar] [CrossRef]
Lira, H.; Martí, L.; Sanchez-Pi, N. A Graph Neural Network with Spatio-Temporal Attention for Multi-Sources Time Series Data: An Application to Frost Forecast. Sensors 2022, 22, 1486. [Google Scholar] [CrossRef]
Lin, H.; Gao, Z.; Xu, Y.; Wu, L.; Li, L.; Li, S.Z. Conditional Local Convolution for Spatio-Temporal Meteorological Forecasting. In Proceedings of the 36th Conference on Artificial Intelligence (AAAI 2022), Virtual Event, 22 February–1 March 2022; AAAI Press: Virtual Event, 2022; pp. 7470–7478. [Google Scholar]
Jeon, H.J.; Choi, M.W.; Lee, O.-J. Day-Ahead Hourly Solar Irradiance Forecasting Based on Multi-Attributed Spatio-Temporal Graph Convolutional Network. Sensors 2022, 22, 7179. [Google Scholar] [CrossRef]
Xiao, X.; Jin, Z.; Wang, S.; Xu, J.; Peng, Z.; Wang, R.; Shao, W.; Hui, Y. A dual-path dynamic directed graph convolutional network for air quality prediction. Sci. Total. Environ. 2022, 827, 154298. [Google Scholar] [CrossRef]
Han, J.; Liu, H.; Zhu, H.; Xiong, H.; Dou, D. Joint Air Quality and Weather Prediction Based on Multi-Adversarial Spatiotemporal Networks. In Proceedings of the 35th Conference on Artificial Intelligence (AAAI 2021), Virtual Event, 2–9 February 2021; AAAI Press: Virtual Event, 2021; pp. 4081–4089. [Google Scholar] [CrossRef]
Han, J.; Liu, H.; Zhu, H.; Xiong, H. Kill Two Birds with One Stone: A Multi-View Multi-Adversarial Learning Approach for Joint Air Quality and Weather Prediction. IEEE Trans. Knowl. Data Eng. 2023, 1–14. [Google Scholar] [CrossRef]
Yano, K.; Shiina, T.; Kurata, S.; Kato, A.; Komaki, F.; Sakai, S.; Hirata, N. Graph-Partitioning Based Convolutional Neural Network for Earthquake Detection Using a Seismic Array. J. Geophys. Res. Solid Earth 2021, 126, e2020JB020269. [Google Scholar] [CrossRef]
van den Ende, M.P.A.; Ampuero, J.P. Automated Seismic Source Characterization Using Deep Graph Neural Networks. Geophys. Res. Lett. 2020, 47, e2020GL088690. [Google Scholar] [CrossRef]
Zhang, X.; Reichard-Flynn, W.; Zhang, M.; Hirn, M.; Lin, Y. Spatiotemporal Graph Convolutional Networks for Earthquake Source Characterization. J. Geophys. Res. Solid Earth 2022, 127, e2022JB024401. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Dong, J.; Sun, X.; Lima, E.; Mu, Q.; Wang, X. A CFCC-LSTM Model for Sea Surface Temperature Prediction. IEEE Geosci. Remote. Sens. Lett. 2018, 15, 207–211. [Google Scholar] [CrossRef]
Jia, X.; Ji, Q.; Han, L.; Liu, Y.; Han, G.; Lin, X. Prediction of Sea Surface Temperature in the East China Sea Based on LSTM Neural Network. Remote. Sens. 2022, 14, 3300. [Google Scholar] [CrossRef]
Sun, Y.; Yao, X.; Bi, X.; Huang, X.; Zhao, X.; Qiao, B. Time-Series Graph Network for Sea Surface Temperature Prediction. Big Data Res. 2021, 25, 100237. [Google Scholar] [CrossRef]
Zhang, X.; Li, Y.; Frery, A.C.; Ren, P. Sea Surface Temperature Prediction With Memory Graph Convolutional Networks. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Pritzel, A.; Ravuri, S.V.; Ewalds, T.; Alet, F.; Eaton-Rosen, Z.; et al. GraphCast: Learning skillful medium-range global weather forecasting. arXiv 2022, arXiv:2212.12794. [Google Scholar] [CrossRef]
Shi, N.; Xu, J.; Wurster, S.W.; Guo, H.; Woodring, J.; Van Roekel, L.P.; Shen, H.W. GNN-Surrogate: A Hierarchical and Adaptive Graph Neural Network for Parameter Space Exploration of Unstructured-Mesh Ocean Simulations. IEEE Trans. Vis. Comput. Graph. 2022, 28, 2301–2313. [Google Scholar] [CrossRef]
Cachay, S.R.; Erickson, E.; Bucker, A.F.C.; Pokropek, E.; Potosnak, W.; Bire, S.; Osei, S.; Lütjens, B. The World as a Graph: Improving El Niño Forecasts with Graph Neural Networks. arXiv 2021, arXiv:2104.05089. [Google Scholar]
Mu, B.; Qin, B.; Yuan, S. ENSO-ASC 1.0.0: ENSO deep learning forecast model with a multivariate air–sea coupler. Geosci. Model Dev. 2021, 14, 6977–6999. [Google Scholar] [CrossRef]
Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Gallagher, B.; Eliassi-Rad, T. Collective Classification in Network Data. AI Mag. 2008, 29, 93–106. [Google Scholar] [CrossRef]
Lim, K.W.; Buntine, W.L. Bibliographic Analysis with the Citation Network Topic Model. In Proceedings of the 6th Asian Conference on Machine Learning (ACML 2014), Nha Trang City, Vietnam, 26–28 November 2014; JMLR.org: Nha Trang City, Vietnam, 2014; Volume 39. [Google Scholar]
Tang, J.; Zhang, J.; Yao, L.; Li, J.; Zhang, L.; Su, Z. ArnetMiner: Extraction and mining of academic social networks. In Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining (SIGKDD 2008), Las Vegas, NV, USA, 24–27 August 2008; ACM: Las Vegas, NV, USA, 2008; pp. 990–998. [Google Scholar] [CrossRef]
McAuley, J.J.; Leskovec, J. Learning to Discover Social Circles in Ego Networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NeurIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; NeurIPS: Lake Tahoe, NV, USA, 2012; pp. 548–556. [Google Scholar]
Debnath, A.K.; Lopez de Compadre, R.L.; Debnath, G.; Shusterman, A.J.; Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 1991, 34, 786–797. [Google Scholar] [CrossRef] [PubMed]
Borgwardt, K.M.; Ong, C.S.; Schönauer, S.; Vishwanathan, S.V.N.; Smola, A.J.; Kriegel, H. Protein function prediction via graph kernels. In Proceedings of the 13th International Conference on Intelligent Systems for Molecular Biology (ISMB 2005), Detroit, MI, USA, 25–29 June 2005; Oxford University Press: Detroit, MI, USA, 2005; pp. 47–56. [Google Scholar] [CrossRef]
Jin, D.; Huo, C.; Liang, C.; Yang, L. Heterogeneous Graph Neural Network via Attribute Completion. In Proceedings of the Web Conference (WWW 2021), Ljubljana, Slovenia, 19–23 April 2021; Leskovec, J., Grobelnik, M., Najork, M., Tang, J., Zia, L., Eds.; ACM/IW3C2: Virtual Event, 2021; pp. 391–400. [Google Scholar] [CrossRef]
Sun, Y.; Han, J.; Yan, X.; Yu, P.S.; Wu, T. PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks. Proc. VLDB Endow. 2011, 4, 992–1003. [Google Scholar] [CrossRef]
Zafarani, R.; Liu, H. Social Computing Data Repository at ASU. 2009. Available online: http://socialcomputing.asu.edu (accessed on 16 April 2023).
Zhang, X.; Zhao, J.; LeCun, Y. Character-Level Convolutional Networks for Text Classification. arXiv 2015, arXiv:1509.01626. [Google Scholar]
Kunegis, J. KONECT: The Koblenz network collection. In Proceedings of the 22nd International World Wide Web Conference (WWW 2013), Rio de Janeiro, Brazil, 13–17 May 2013; Carr, L., Laender, A.H.F., Lóscio, B.F., King, I., Fontoura, M., Vrandecic, D., Aroyo, L., de Oliveira, J.P.M., Lima, F., Wilde, E., Eds.; International World Wide Web Conferences Steering Committee/ACM: Rio de Janeiro, Brazil, 2013; pp. 1343–1350. [Google Scholar] [CrossRef]
Tang, J.; Gao, H.; Liu, H. mTrust: Discerning multi-faceted trust in a connected world. In Proceedings of the 5th International Conference on Web Search and Web Data Mining (WSDM 2012), Seattle, WA, USA, 8–12 February 2012; Adar, E., Teevan, J., Agichtein, E., Maarek, Y., Eds.; ACM: Seattle, WA, USA, 2012; pp. 93–102. [Google Scholar] [CrossRef]
Gehrke, J.; Ginsparg, P.; Kleinberg, J.M. Overview of the 2003 KDD Cup. SIGKDD Explor. 2003, 5, 149–151. [Google Scholar] [CrossRef]
Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 16 April 2023).
Leskovec, J.; Lang, K.J.; Dasgupta, A.; Mahoney, M.W. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Math. 2009, 6, 29–123. [Google Scholar] [CrossRef]
Lippmann, R.; Cunningham, R.K.; Fried, D.J.; Graf, I.; Kendall, K.R.; Webster, S.E.; Zissman, M.A. Results of the DARPA 1998 Offline Intrusion Detection Evaluation. In Proceedings of the Recent Advances in Intrusion Detection, Second International Workshop, (RAID 1999), West Lafayette, IN, USA, 7–9 September 1999; Springer: West Lafayette, IN, USA, 1999. [Google Scholar]
Zafarani, R.; Liu, H. Users Joining Multiple Sites: Distributions and Patterns. In Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM 2014), Ann Arbor, MI, USA, 1–4 June 2014; Adar, E., Resnick, P., Choudhury, M.D., Hogan, B., Oh, A., Eds.; AAAI Press: Ann Arbor, MI, USA, 2014. [Google Scholar]
Furey, T.S.; Cristianini, N.; Duffy, N.; Bednarski, D.W.; Schummer, M.; Haussler, D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16, 906–914. [Google Scholar] [CrossRef]
Zheng, J.; Liu, J.; Shi, C.; Zhuang, F.; Li, J.; Wu, B. Recommendation in heterogeneous information network via dual similarity regularization. Int. J. Data Sci. Anal. 2017, 3, 35–48. [Google Scholar] [CrossRef]
Ali, Z.; Kefalas, P.; Muhammad, K.; Ali, B.; Imran, M. Deep learning in citation recommendation models survey. Expert Syst. Appl. 2020, 162, 113790. [Google Scholar] [CrossRef]
Fey, M.; Lenssen, J.E. Fast Graph Representation Learning with PyTorch Geometric. arXiv 2019, arXiv:1903.02428. [Google Scholar]
Wang, M.; Zheng, D.; Ye, Z.; Gan, Q.; Li, M.; Song, X.; Zhou, J.; Ma, C.; Yu, L.; Gai, Y.; et al. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv 2019, arXiv:1909.01315. [Google Scholar]
Cen, Y.; Hou, Z.; Wang, Y.; Chen, Q.; Luo, Y.; Yao, X.; Zeng, A.; Guo, S.; Zhang, P.; Dai, G.; et al. CogDL: An Extensive Toolkit for Deep Learning on Graphs. arXiv 2021, arXiv:2103.00959. [Google Scholar]
Liu, M.; Luo, Y.; Wang, L.; Xie, Y.; Yuan, H.; Gui, S.; Yu, H.; Xu, Z.; Zhang, J.; Liu, Y.; et al. DIG: A Turnkey Library for Diving into Graph Deep Learning Research. J. Mach. Learn. Res. 2021, 22, 240.1–240.9. [Google Scholar]
Zhu, Z.; Xu, S.; Tang, J.; Qu, M. GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019; Liu, L., White, R.W., Mantrach, A., Silvestri, F., McAuley, J.J., Baeza-Yates, R., Zia, L., Eds.; ACM: San Francisco, CA, USA, 2019; pp. 2494–2504. [Google Scholar] [CrossRef]
Zhu, R.; Zhao, K.; Yang, H.; Lin, W.; Zhou, C.; Ai, B.; Li, Y.; Zhou, J. AliGraph: A comprehensive graph neural network platform. Proc. Vldb Endow. 2019, 12, 2094–2105. [Google Scholar] [CrossRef]
Sonthalia, R.; Gilbert, A.C. Tree! I am no Tree! I am a low dimensional Hyperbolic Embedding. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual Event, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; NeurIPS: Virtual Event, 2020. [Google Scholar]
Lou, A.; Katsman, I.; Jiang, Q.; Belongie, S.; Lim, S.N.; De Sa, C. Differentiating through the FréChet Mean. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria, 13–18 July 2020; JMLR.org: Vienna, Austria, 2020. [Google Scholar]
Shu, J.; Xi, B.; Li, Y.; Wu, F.; Kamhoua, C.A.; Ma, J. Understanding Dropout for Graph Neural Networks. In Proceedings of the Companion of The Web Conference 2022 (WWW 2022), Lyon, France, 25–29 April 2022; Laforest, F., Troncy, R., Simperl, E., Agarwal, D., Gionis, A., Herman, I., Médini, L., Eds.; ACM: Lyon, France, 2022; pp. 1128–1138. [Google Scholar] [CrossRef]
Papp, P.A.; Martinkus, K.; Faber, L.; Wattenhofer, R. DropGNN: Random Dropouts Increase the Expressiveness of Graph Neural Networks. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems 2021 (NeurIPS 2021), Virtual Event, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W., Eds.; NeurIPS: Virtual Event, 2021; pp. 21997–22009. [Google Scholar]
Rong, Y.; Huang, W.; Xu, T.; Huang, J. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020; OpenReview.net: Addis Ababa, Ethiopia, 2020. [Google Scholar]
Chien, E.; Peng, J.; Li, P.; Milenkovic, O. Adaptive Universal Generalized PageRank Graph Neural Network. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021; OpenReview.net: Vienna, Austria, 2021. [Google Scholar]
Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph Transformer Networks. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; NeurIPS: Vancouver, BC, Canada, 2019; pp. 11960–11970. [Google Scholar]
Yang, F.; Fan, K.; Song, D.; Lin, H. Graph-based prediction of Protein-protein interactions with attributed signed graph embedding. BMC Bioinform. 2020, 21, 1–16. [Google Scholar] [CrossRef]
Xie, Y.; Li, S.; Yang, C.; Wong, R.C.; Han, J. When Do GNNs Work: Understanding and Improving Neighborhood Aggregation. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), Yokohama, Japan, 7–15 January 2021; Bessiere, C., Ed.; ijcai.org: Yokohama, Japan, 2020; pp. 1303–1309. [Google Scholar] [CrossRef]
Pan, S.; Wu, J.; Zhu, X.; Zhang, C.; Wang, Y. Tri-Party Deep Network Representation. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; IJCAI Press: New York, NY, USA, 2016; pp. 1895–1901. [Google Scholar]

Figure 1. The popularity of graph representation learning models in the Scopus database. The line plot shows changes in the number of publications in different types of graph representation learning models from 2010 to 2022. The y-axis denotes the number of publications on the popularity of graph representation learning models over the years. There are seven keywords, including graph representation learning (GRL), graph kernels (GK), matrix factorization-based graph embedding (MF), graph neural networks (GNNs), graph autoencoder (GAE), graph convolution networks (GCNs), graph transformer (GT), and non-Euclidean graph embedding (NEGE). There are nineteen representative models, including DeepWalk [14], Grarep [15], LINE [16], GGCN [17], GCN [18], HOPE [5], Node2Vec [4], GAT [19], Metapath2Vec [20], Struc2Vec [21], GraphSage [22], G2G [23], GIN [24], HGAT [25], DGI [26], HGNN [27], GCNII [28], GT [29], and EGT [30].

Figure 2. A comprehensive view of graph embedding. Given a spare, high-dimensional graph

G = (V, E)

where V and E denote the set of nodes and edges. Graph embedding learning aims to find a function

ϕ

that maps nodes from graph space to d-dimensional vector space with

d ≪ | V |

.

Figure 2. A comprehensive view of graph embedding. Given a spare, high-dimensional graph

G = (V, E)

where V and E denote the set of nodes and edges. Graph embedding learning aims to find a function

ϕ

that maps nodes from graph space to d-dimensional vector space with

d ≪ | V |

.

Figure 3. Methods for modeling dynamic graphs over time. (a) The representation of a dynamic graph by a series of snapshots; (b) The evolution of edges and nodes in the dynamic graph from time t to

t + 1

. In (a), the graph G is the collection of

G (t)

(i.e.,

G = {G (1), G (2), \dots, G (t)}

) which t is the time span, and the entities of G change from time t to

t + 1

. (b) depicts the evolution of edges in the same dynamic graph from (a) which each edge contains the series of the time spans from t to

t + 1

. At time t, the graph has five nodes (

v_{1}

,

v_{2}

,

v_{3}

,

v_{4}

,

v_{5}

) and five edges

(e_{13} e_{15} e_{34} e_{45} e_{23})

. However, at time

t + 1

, the edge

e_{23}

and node

v_{2}

are removed, and a new node

v_{6}

, a new edge

e_{56}

are added in the graph.

Figure 3. Methods for modeling dynamic graphs over time. (a) The representation of a dynamic graph by a series of snapshots; (b) The evolution of edges and nodes in the dynamic graph from time t to

t + 1

. In (a), the graph G is the collection of

G (t)

(i.e.,

G = {G (1), G (2), \dots, G (t)}

) which t is the time span, and the entities of G change from time t to

t + 1

. (b) depicts the evolution of edges in the same dynamic graph from (a) which each edge contains the series of the time spans from t to

t + 1

. At time t, the graph has five nodes (

v_{1}

,

v_{2}

,

v_{3}

,

v_{4}

,

v_{5}

) and five edges

(e_{13} e_{15} e_{34} e_{45} e_{23})

. However, at time

t + 1

, the edge

e_{23}

and node

v_{2}

are removed, and a new node

v_{6}

, a new edge

e_{56}

are added in the graph.

Figure 4. The proposed taxonomy for graph representation learning models.

Figure 5. The Weisfeiler–Lehman isomorphism test. (a) Original labels,

i = 0

; (b) Relabeled labels,

i = 1

. There are two interactions of WL relabeling for the graph with five nodes

\{v_{1}, v_{2}, v_{3}, v_{4}, v_{5}\}

. In (a), labels of nodes are initialized consisting of 5 nodes. In (b), in the first iteration, new labels of the nodes will be reassigned and calculated based on the connection information to its adjacent nodes. For example, node

v_{1}

is adjacent to node

v_{2}

and node

v_{3}

, therefore the new label of

v_{1}

is calculated as

\{v_{1}, 〈v_{2}, v_{3}〉\}

and resigned as new label

v_{6}

. The same steps are repeated until a steady state for the nodes is reached.

Figure 5. The Weisfeiler–Lehman isomorphism test. (a) Original labels,

i = 0

; (b) Relabeled labels,

i = 1

. There are two interactions of WL relabeling for the graph with five nodes

\{v_{1}, v_{2}, v_{3}, v_{4}, v_{5}\}

. In (a), labels of nodes are initialized consisting of 5 nodes. In (b), in the first iteration, new labels of the nodes will be reassigned and calculated based on the connection information to its adjacent nodes. For example, node

v_{1}

is adjacent to node

v_{2}

and node

v_{3}

, therefore the new label of

v_{1}

is calculated as

\{v_{1}, 〈v_{2}, v_{3}〉\}

and resigned as new label

v_{6}

. The same steps are repeated until a steady state for the nodes is reached.

Figure 6. Node sampling techniques. (a) k-hop sampling; (b) Random-walk sampling. The source node

v_{s}

and the target node

v_{t}

are taken as the source node and the target node in the graph. In (a), the k-hop proximity sampling strategy begins from source node

v_{s}

, and the green nodes are considered to be the

1 s t

-hop proximity of node

v_{s}

. The blue and the black nodes are considered 2nd-hop and 3rd-hop proximity of node

v_{s}

, respectively. In (b), the random-walk sampling strategy takes a random walk (red arrow) from the source node

v_{s}

to the target node

v_{t}

.

Figure 6. Node sampling techniques. (a) k-hop sampling; (b) Random-walk sampling. The source node

v_{s}

and the target node

v_{t}

are taken as the source node and the target node in the graph. In (a), the k-hop proximity sampling strategy begins from source node

v_{s}

, and the green nodes are considered to be the

1 s t

-hop proximity of node

v_{s}

. The blue and the black nodes are considered 2nd-hop and 3rd-hop proximity of node

v_{s}

, respectively. In (b), the random-walk sampling strategy takes a random walk (red arrow) from the source node

v_{s}

to the target node

v_{t}

.

Figure 7. Sampling strategy in Node2Vec and WalkLets model. (a) Sampling strategy in Node2Vec model; (b) Sampling strategy in WalkLets model. In (a), assume a random path from the DeepWalk model is of the form:

(v_{1} \to v_{2} \to v_{3} \to v_{4})

, then the corpus of random walk pairs at scale

k = 3

is:

A_{1} = {(v_{1}, v_{2}), (v_{2}, v_{3}), (v_{3}, v_{4})}

,

A_{2} = {(v_{1}, v_{3}), (v_{2}, v_{4})}

, and

A_{3} = {(v_{1}, v_{4})}

. In (b), there are two parameters: The return parameter p and the in–out parameter q. Parameters

1, 1 / p

, and

1 / q

are conditional probabilities. Starting at node u and now at v, the random walk looks at the next node based on the probabilities

1 / p

and

1 / q

.

Figure 7. Sampling strategy in Node2Vec and WalkLets model. (a) Sampling strategy in Node2Vec model; (b) Sampling strategy in WalkLets model. In (a), assume a random path from the DeepWalk model is of the form:

(v_{1} \to v_{2} \to v_{3} \to v_{4})

, then the corpus of random walk pairs at scale

k = 3

is:

A_{1} = {(v_{1}, v_{2}), (v_{2}, v_{3}), (v_{3}, v_{4})}

,

A_{2} = {(v_{1}, v_{3}), (v_{2}, v_{4})}

, and

A_{3} = {(v_{1}, v_{4})}

. In (b), there are two parameters: The return parameter p and the in–out parameter q. Parameters

1, 1 / p

, and

1 / q

are conditional probabilities. Starting at node u and now at v, the random walk looks at the next node based on the probabilities

1 / p

and

1 / q

.

Figure 8. Sampling strategy in Sub2Vec model. Assume that there are two subgraphs

G_{1} = {v_{1}, v_{2}, v_{3}, v_{4}}

, and

G_{2} = {v_{5}, v_{6}, v_{7}, v_{9}}

. For neighborhood properties, the model uses random-walk sampling on all nodes in subgraphs

G_{1}

and

G_{2}

to capture the subgraph structure. For structural properties, they introduced a ratio of node degree when sampling. With the length of the random-walk path is 3, then the degree path for

G_{1}

is

0.75 \to 0.75 \to 0.75

, while the degree path from node

v_{5}

to

v_{9}

is:

0.25 \to 0.75 \to 0.25

.

Figure 8. Sampling strategy in Sub2Vec model. Assume that there are two subgraphs

G_{1} = {v_{1}, v_{2}, v_{3}, v_{4}}

, and

G_{2} = {v_{5}, v_{6}, v_{7}, v_{9}}

. For neighborhood properties, the model uses random-walk sampling on all nodes in subgraphs

G_{1}

and

G_{2}

to capture the subgraph structure. For structural properties, they introduced a ratio of node degree when sampling. With the length of the random-walk path is 3, then the degree path for

G_{1}

is

0.75 \to 0.75 \to 0.75

, while the degree path from node

v_{5}

to

v_{9}

is:

0.25 \to 0.75 \to 0.25

.

Figure 9. The random-walk sampling based on motif. (a) Random-walk sampling; (b) Motif-based random-walk sampling. (a) presents a random-walk path from node

v_{1}

to

v_{7}

:

v_{1} \to v_{3} \to v_{4} \to v_{5} \to v_{7}

. In (b), the motif-based path is:

〈v_{1}, v_{2}, v_{3}〉 \to 〈v_{2}, v_{3}, v_{4}〉 \to 〈v_{2}, v_{4}, v_{5}〉 \to 〈v_{4}, v_{5}, v_{6}〉

.

Figure 9. The random-walk sampling based on motif. (a) Random-walk sampling; (b) Motif-based random-walk sampling. (a) presents a random-walk path from node

v_{1}

to

v_{7}

:

v_{1} \to v_{3} \to v_{4} \to v_{5} \to v_{7}

. In (b), the motif-based path is:

〈v_{1}, v_{2}, v_{3}〉 \to 〈v_{2}, v_{3}, v_{4}〉 \to 〈v_{2}, v_{4}, v_{5}〉 \to 〈v_{4}, v_{5}, v_{6}〉

.

Figure 10. Updating random-walk paths to the corpus on dynamic graphs. At time t, the graph has 3 nodes:

v_{1}, v_{2}, v_{3}

with two edges:

(v_{1}, v_{2})

and

(v_{2}, v_{3})

. Assuming the length of the random walk is 3, then the set of random walks:

〈v_{1}, v_{2}, v_{1}〉

,

〈v_{1}, v_{2}, v_{3}〉

,

〈v_{2}, v_{1}, v_{2}〉

,

〈v_{2}, v_{3}, v_{2}〉

,

〈v_{3}, v_{2}, v_{1}〉

,

〈v_{3}, v_{2}, v_{3}〉

. At the time

t + 1

: The graph has a new node

v_{4}

and a new edge

(v_{2}, v_{4})

. Then, new random walks will be updated on the corpus are:

〈v_{4}, v_{2}, v_{1}〉

,

〈v_{4}, v_{2}, v_{3}〉

, and

〈v_{4}, v_{2}, v_{4}〉

.

Figure 10. Updating random-walk paths to the corpus on dynamic graphs. At time t, the graph has 3 nodes:

v_{1}, v_{2}, v_{3}

with two edges:

(v_{1}, v_{2})

and

(v_{2}, v_{3})

. Assuming the length of the random walk is 3, then the set of random walks:

〈v_{1}, v_{2}, v_{1}〉

,

〈v_{1}, v_{2}, v_{3}〉

,

〈v_{2}, v_{1}, v_{2}〉

,

〈v_{2}, v_{3}, v_{2}〉

,

〈v_{3}, v_{2}, v_{1}〉

,

〈v_{3}, v_{2}, v_{3}〉

. At the time

t + 1

: The graph has a new node

v_{4}

and a new edge

(v_{2}, v_{4})

. Then, new random walks will be updated on the corpus are:

〈v_{4}, v_{2}, v_{1}〉

,

〈v_{4}, v_{2}, v_{3}〉

, and

〈v_{4}, v_{2}, v_{4}〉

.

Figure 11. The strategy of edge and node collapsing of HARP model. (a) Edge compression; (b) Node compression. In (a), the super nodes

〈v_{1}, v_{2}〉

and

〈v_{3}, v_{4}〉

are formed by merging edges

e_{12}

and

e_{34}

, respectively. In (b), the super nodes

〈v_{1}, v_{2}〉

and

〈v_{3}, v_{4}〉

are formed by merging node pairs

(v_{1}, v_{2})

and

(v_{3}, v_{4})

, respectively.

Figure 11. The strategy of edge and node collapsing of HARP model. (a) Edge compression; (b) Node compression. In (a), the super nodes

〈v_{1}, v_{2}〉

and

〈v_{3}, v_{4}〉

are formed by merging edges

e_{12}

and

e_{34}

, respectively. In (b), the super nodes

〈v_{1}, v_{2}〉

and

〈v_{3}, v_{4}〉

are formed by merging node pairs

(v_{1}, v_{2})

and

(v_{3}, v_{4})

, respectively.

Figure 12. The self-centered network of NEWEE model. For instance, the self-centered of node

v_{2}

could be defined as

G^{'} = (V^{'}, E^{'})

where

V^{'} = (v_{1}, v_{2}, v_{3}, v_{4}, v_{5})

and

E^{'}

is the set of edges in

G^{'}

.

Figure 12. The self-centered network of NEWEE model. For instance, the self-centered of node

v_{2}

could be defined as

G^{'} = (V^{'}, E^{'})

where

V^{'} = (v_{1}, v_{2}, v_{3}, v_{4}, v_{5})

and

E^{'}

is the set of edges in

G^{'}

.

Figure 13. The architecture of SDNE model. The features of nodes

x_{i}

and

x_{j}

are the inputs of the SDNE model. The encoder layer compresses the feature data

x_{i}

and

x_{j}

into vectors

Z_{i}

and

Z_{j}

in the latent space. The decoder layer aims to reconstruct the node features.

Figure 13. The architecture of SDNE model. The features of nodes

x_{i}

and

x_{j}

are the inputs of the SDNE model. The encoder layer compresses the feature data

x_{i}

and

x_{j}

into vectors

Z_{i}

and

Z_{j}

in the latent space. The decoder layer aims to reconstruct the node features.

Figure 14. The architecture of DynGEM model. Similarity to the SDNE model, the DynGEM model could capture the 1st-order and 2nd-order proximity between two nodes in graphs with the encoder and decoder layers. The difference is vector embedding

θ_{t}

parameters at time t are updated from vector embedding

θ_{t - 1}

at time

t - 1

.

Figure 14. The architecture of DynGEM model. Similarity to the SDNE model, the DynGEM model could capture the 1st-order and 2nd-order proximity between two nodes in graphs with the encoder and decoder layers. The difference is vector embedding

θ_{t}

parameters at time t are updated from vector embedding

θ_{t - 1}

at time

t - 1

.

Figure 15. An example of the Topo-LSTM model. Given by a cascade sequence

S = {(v_{1}, 1), (v_{2}, 2), (v_{3}, 3), (v_{4}, 4)}

, the model first takes features of each node

x_{1}

,

x_{2}

,

x_{3}

,

x_{4}

as inputs and then infers embeddings via Topo-LSTM model.

Figure 15. An example of the Topo-LSTM model. Given by a cascade sequence

S = {(v_{1}, 1), (v_{2}, 2), (v_{3}, 3), (v_{4}, 4)}

, the model first takes features of each node

x_{1}

,

x_{2}

,

x_{3}

,

x_{4}

as inputs and then infers embeddings via Topo-LSTM model.

Figure 16. The sampling strategy of [57]. The model lists all the node pairs in respective weights as input of the autoencoder model.

Figure 17. The temporal random-walk sampling strategy of LSTM-Node2Vec model during the graphs’ evolution. (a) t; (b)

t + 1

; (c)

t + 2

. At the time t, the graph has four nodes and four edges between nodes. At the time

t + 1

and

t + 2

, the graph has new nodes

v_{5}

and

v_{6}

, respectively. A temporal random walk for node

v_{1}

with length

L = 3

could be:

P = {(v_{2}, v_{3}, v_{4}), (v_{3}, v_{2}, v_{5}), (v_{3}, v_{5}, v_{6}), \dots}

.

Figure 17. The temporal random-walk sampling strategy of LSTM-Node2Vec model during the graphs’ evolution. (a) t; (b)

t + 1

; (c)

t + 2

. At the time t, the graph has four nodes and four edges between nodes. At the time

t + 1

and

t + 2

, the graph has new nodes

v_{5}

and

v_{6}

, respectively. A temporal random walk for node

v_{1}

with length

L = 3

could be:

P = {(v_{2}, v_{3}, v_{4}), (v_{3}, v_{2}, v_{5}), (v_{3}, v_{5}, v_{6}), \dots}

.

Figure 18. An example of the GraphSAINT model. (a) A subgraph has five nodes

v_{1}

,

v_{2}

,

v_{3}

,

v_{4}

, and

v_{5}

; (b) A full GCN has three layers. (a) presents a subgraph with nodes. In the subgraph, there are 3 nodes

(v_{1}

,

v_{2}

,

v_{3})

with higher order than the other two nodes (

v_{4}

,

v_{5}

). (b) presents a full CGNN-based model with three layers. Three nodes with higher degrees should be sampled from each other in the next layers.

Figure 18. An example of the GraphSAINT model. (a) A subgraph has five nodes

v_{1}

,

v_{2}

,

v_{3}

,

v_{4}

, and

v_{5}

; (b) A full GCN has three layers. (a) presents a subgraph with nodes. In the subgraph, there are 3 nodes

(v_{1}

,

v_{2}

,

v_{3})

with higher order than the other two nodes (

v_{4}

,

v_{5}

). (b) presents a full CGNN-based model with three layers. Three nodes with higher degrees should be sampled from each other in the next layers.

Figure 19. The architecture of GAE and VGAE model. The model adopts the adjacency matrix A and the feature matrix X as inputs. The encoder part includes two convolutional GNN layers. In the GAE model, the decoder part adopts the embedding matrix Z as input and reconstructs the adjacency matrix A using an inner product. In the VAGE model, the output of GNN could be represented as a Gaussian distribution.

Table 1. A summary of notations.

Notations	Descriptions
V	The set of nodes in the graph G
E	The set of edges in graph G
N	The number of nodes in graph G
$E^{t}$	The set of edges with type t in heterogeneous graphs
$v_{i}$	The node $v_{i}$ in the graph G
$e_{i j}$	The edge $(v_{i}, v_{j})$ in the graph G
A	The adjacency matrix of the graph G
X	The feature matrix of nodes in graph G
D	The degree matrix of nodes in graph G
$ϕ$	Projection function
$Z_{i}$	The embedding vector of node $v_{i}$
M	The transition matrix
$N (v_{i})$	The set of neighbors of node $v_{i}$
k	The k-hop distance from a target node to other nodes
d	The dimension of vector in latent space
$y_{i}$	The label of node $v_{i}$

Table 2. A summary of graph kernel models.

Models	Graph Types	Tasks	Loss Function
[83]	Static graphs	Graph comparison	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[106]	Static graphs	Graph comparison	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[33]	Static graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[34]	Static graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[84]	Static graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[35]	Static graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[85]	Static graphs	Graph comparison	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[107]	Attributed graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[37]	Attributed graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[108]	Attributed graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[36]	Attributed graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[109]	Attributed graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
[110]	Attributed graphs	Graph classification	$\sum_{v_{i} \in V} max (0, 1 - y_{i}^{⊺} {\hat{y}}_{i}) + L_{2}$
GraTFEL [111]	Dynamic graphs	Graph reconstruction Link prediction	$\frac{1}{N} \sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2} + L_{1} + L_{2}$
[112]	Dynamic graphs	Link prediction	$\frac{1}{N} \sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2} + L_{1} + L_{2}$
[113]	Dynamic graphs	Link prediction	$\frac{1}{N} \sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2}$

Table 6. A summary of structure-preservation models for heterogeneous graphs and dynamic graphs. K is the number of clusters in graphs,

N_{n e g}

refers to the number of negative samples, and

P_{n}

means the noise distribution.

Table 6. A summary of structure-preservation models for heterogeneous graphs and dynamic graphs. K is the number of clusters in graphs,

N_{n e g}

refers to the number of negative samples, and

P_{n}

means the noise distribution.

Models	Graph Types	Tasks	Loss Function
MBRep [90]	Hypergraphs	Link prediction	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
Motif2Vec [87]	Heterogeneous graphs	Node classification link prediction	$\sum_{v_{i} \in V} \sum_{v_{j} \in N (v_{i})} - log (p (v_{j} \| Z_{i}))$
JUST [159]	Heterogeneous graphs	Node classification Node clustering	$\sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i} Z_{j}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k}))$
[160]	Multiplex graphs	Link prediction	$\sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i} Z_{j}) + \|N_{n e g}\| E_{v_{k} \sim P_{n} (v_{k})} (log σ (- Z_{i} Z_{k}))$
BHIN2Vec [161]	Heterogeneous graph	Node classification	$- \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) log (1 - {\hat{y}}_{i}))$
[162]	Heterogeneous graphs	Link prediction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[163]	Heterogeneous graphs	Link prediction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[164]	Heterogeneous graphs	Link prediction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[165]	Heterogeneous graphs	Link prediction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[166]	Heterogeneous graphs	Entities prediction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
MrMine [167]	Multiplex graphs	Graph classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
[168]	Heterogeneous graph	Link prediction	$\sum_{(v_{i}, v_{j}) \in E} [log σ (Z_{i}^{⊺} Z_{j}) - \sum_{k = 1}^{n} (E_{v_{k} \sim P (v_{i})} log σ (Z_{i}^{⊺} Z_{k}))]$
[169]	Dynamic graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[170]	Dynamic graphs	Node classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
[171]	Dynamic graphs	Link prediction	$- \sum_{i = 1}^{N} [y_{i} log {\hat{y}}_{i} + α (1 - y_{i}) log (1 - {\hat{y}}_{i})]$
STWalk [172]	Dynamic graphs	Node classification
[173]	Dynamic graphs	Node classification, Link prediction	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[174]	Dynamic graphs	Link prediction, Node classification	$- \sum_{v_{i} \in V} \sum_{v_{j} \in N (v_{i})} log ({\hat{Z}}_{j})$
Dyn2Vec [10]	Dynamic graphs	Node classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
[92]	Dynamic graphs	Link prediction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} [y_{i} log {\hat{y}}_{i} + α (1 - y_{i}) log (1 - {\hat{y}}_{i}]$
T-EDGE [175]	Dynamic graphs	Node classification	$- \sum_{v_{i} \in V, 〈v_{i}, v_{j}〉 \in E} log (σ (Z_{i}^{⊺} Z_{j})) - \| N_{n e g} \| \sum_{v_{k} \sim P_{n} (v)} log (σ (- Z_{i}^{⊺} Z_{k}))$
LBSN2Vec [157]	Hyper graphs	Link prediction	$\sum_{i = 1}^{n} (1 - cos (Z_{i}, Z_{j}))$
[158]	Hyper graphs	Link prediction	$\underset{Z}{arg min} Z^{⊺} L Z$

Table 8. A summary of fully connected graph autoencoder models. A and

\hat{A}

are the input adjacency matrix and reconstructed adjacency matrix, respectively, B is the penalty matrix,

A^{t}

is the adjacency matrix of node type t, L denotes the number of layers, k is the length of random-walk steps,

s_{i j}

denotes the proximity between

v_{i}

and

v_{j}

, and

Z_{i}^{(l)}

is the hidden vector of node

v_{i}

at layer l.

Table 8. A summary of fully connected graph autoencoder models. A and

\hat{A}

are the input adjacency matrix and reconstructed adjacency matrix, respectively, B is the penalty matrix,

A^{t}

is the adjacency matrix of node type t, L denotes the number of layers, k is the length of random-walk steps,

s_{i j}

denotes the proximity between

v_{i}

and

v_{j}

, and

Z_{i}^{(l)}

is the hidden vector of node

v_{i}

at layer l.

Models	Graph Types	Objective	Loss Function
SDNE [50]	Static graphs	1st-order proximity, 2nd-order proximity	${∥(\hat{A} - A) ⊙ B∥}_{F}^{2} + λ \sum_{i, j = 1}^{n} s_{i j} {∥Z_{i} - Z_{j}∥}_{2}^{2} + L_{2}$ .
DHNE [51]	Hyper graphs	1st-order proximity, 2nd-order proximity	${∥\hat{A} - A∥}_{F}^{2} + λ \sum_{t}^{n} {∥A^{t} - {\hat{A}}^{t}∥}_{F}^{2}$ .
DNE-SBP [197]	Signed graphs	1st-order proximity	$\sum_{l = 1}^{L} ({∥({\hat{A}}^{(l)} - A^{(l)}) ⊙ B∥}_{F}^{2} + α_{l} A {∥Z_{i}^{(l)} - Z_{j}^{(l)}∥}_{F}^{2} + β_{l} L_{1} + γ_{l} L_{2})$ .
DynGEM [198]	Dynamic graphs	1st-order proximity, 2nd-order proximity	${∥(\hat{A} - A) ⊙ B∥}_{F}^{2} + λ \sum_{i, j = 1}^{n} s_{i j} {∥Z_{i} - Z_{j}∥}_{2}^{2} + L_{1} + L_{2}$ .
NetWalk [199]	Dynamic graphs	Random walk	$\sum_{l = 1}^{L} {∥Z_{i}^{(l)} - Z_{j}^{(l)}∥}_{2}^{2} + \sum_{l = 1}^{L} {∥\| A_{i}^{(l)} - A_{j}^{(l)}∥}_{2}^{2} + L_{2}$ .
DNGR [196]	Static graphs	PPMI matrix	${∥\hat{A} - A∥}_{F}^{2}$ .

Table 9. A summary of graph recurrent autoencoder models.

G_{i, t}

is the diffusion graph of a cascade at time t,

y_{i}

is the label of node

v_{i}

, T is the timestamp window,

A_{i j}^{t}

is the adjacency matrix at time t,

σ (\cdot)

is the sigmoid function.

w_{i, j}

is the weight between two nodes

v_{i}

and

v_{j}

,

N_{s} (v_{i})

is the set of neighbors of node

v_{i}

, and triple

(v_{i}, v_{j}, v_{k})

denotes

(v_{i}, v_{j}) \in P

, and

v_{k}

is the negative sample.

Table 9. A summary of graph recurrent autoencoder models.

G_{i, t}

is the diffusion graph of a cascade at time t,

y_{i}

is the label of node

v_{i}

, T is the timestamp window,

A_{i j}^{t}

is the adjacency matrix at time t,

σ (\cdot)

is the sigmoid function.

w_{i, j}

is the weight between two nodes

v_{i}

and

v_{j}

,

N_{s} (v_{i})

is the set of neighbors of node

v_{i}

, and triple

(v_{i}, v_{j}, v_{k})

denotes

(v_{i}, v_{j}) \in P

, and

v_{k}

is the negative sample.

Model	Graph Type	Sampling Strategy	Loss Function
[44]	Hypergraphs	Local transition function	$\sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2}$
[45]	Homogeneous graphs	Local transition function	$\sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2}$
[57]	Weighted graphs	Node-weight sequences	$\frac{1}{N} \sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2}$ , $\sum_{< v_{i}, v_{j} > \in E}^{} w_{i j} log (p (v_{i}, v_{j}))$ p(v_i, v_j) = SoftMax ( $Z_{i}^{⊺}$ Z_j)
[202]	Dynamic graphs	Random walk, Shortest paths, BFS	$\frac{1}{N} \sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2}$
LSTM-Node2Vec [203]	Dynamic graphs	Temporal random walk	$- \sum_{v_{i} \in V} log p (N_{s} (v_{i}) \| Z_{i})$
E-LSTM-D [204]	Dynamic graphs	1st-order proximity	${∥(A_{t} - {\hat{A}}_{t}) ⊙ B∥}_{F}^{2} + λ L_{2}$
Dyngraph2Vec-AERNN [201]	Dynamic graphs	Adjacency matrix	${∥(A_{t} - {\hat{A}}_{t}) ⊙ B∥}_{F}^{2}$
Topo-LSTM [49]	Directed graphs	Diffusion structure	$- \sum_{v_{i} \in V} \sum_{t = 1}^{T} log p (v_{i, t} \| G_{t}) + α L_{2}$
SHNE [48]	Heterogeneous graphs	Random walk Meta-path	$\sum_{〈v_{i}, v_{j}, v_{k}〉} log σ (Z_{j} \cdot Z_{i}) + log σ (- Z_{k} \cdot Z_{i})$
[17]	Directed graphs	Transition matrix	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
GraphRNA [93]	Attributed graph	Random walk	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[205]	Labeled graphs	Random walks, shortest paths, and breadth-first search.	$\sum_{v_{i} \in V} {∥Z_{i} - {\hat{Z}}_{i}∥}_{2}^{2}$ $- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[206]	Dynamic graphs	Graph reconstruction	$\sum_{t = 1}^{T} \sum_{< v_{i}, v_{j} > \in E}^{} A_{i, j}^{t} log ({\hat{A}}_{i, j}^{t})$
Camel [207]	Heterogeneous graphs	Link prediction	$\sum_{\begin{matrix} (v_{i}, v_{j}) \in E \\ (v_{i}, v_{k}) \notin E \end{matrix}} [{∥Z_{i} - Z_{j}∥}_{2}^{2} - {∥Z_{i} - Z_{k}∥}_{2}^{2}]$ $+ α_{1} \sum_{(v_{1}, v_{k} \notin P)} log σ (- Z_{i}^{T} Z_{j}) + α_{2} L_{2}$
TaPEm [208]	Heterogeneous graphs	Link prediction	$- \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
[209]	Heterogeneous graphs	Link prediction	$- \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$

Table 10. A summary of spectral CGNN models.

Model	Graph Type	Tasks	Loss Function
[56]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[96]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i}) + L_{2}$
[210]	Static graphs	Multi-task prediction Node classification	${[\frac{1}{\| V \|} \sum_{v_{i} \in V} {({\hat{y}}_{i} - y_{i}^{})}^{2}]}^{\frac{1}{2}}$
[211]	Static graphs	Label classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i}) + L_{2}$
GCN [18]	Knowledge graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
EGCN [55]	Static graphs	Multi-task classification, Link prediction	${[\frac{1}{\| V \|} \sum_{v_{i} \in V} {({\hat{y}}_{i} - y_{i}^{})}^{2}]}^{\frac{1}{2}}$
LNPP [212]	Static graphs	Graph Reconstruction	${∥A - \hat{A}∥}_{F}^{2}$
[213]	Static graphs	Node classification	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{D}^{2})$
[214]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[215]	Heterogeneous graphs	Node classification	$\sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i}^{⊺} Z_{j}) - \sum_{(v_{i}, v_{k}) \notin E} log σ (- Z_{i}^{⊺} Z_{j})$

Table 11. A summary of spatial CGNN models for static and homogeneous graphs. m is the total weight of the degrees of the Graph,

V_{t}

is the number of clusters in the graph.

P_{n} (v)

is a negative sampling distribution,

A^{(k)}

is the transition matrix at time k, and B is the batch of nodes used to calculate the gradient estimation.

Table 11. A summary of spatial CGNN models for static and homogeneous graphs. m is the total weight of the degrees of the Graph,

V_{t}

is the number of clusters in the graph.

P_{n} (v)

is a negative sampling distribution,

A^{(k)}

is the transition matrix at time k, and B is the batch of nodes used to calculate the gradient estimation.

Model	Graph Type	Tasks	Loss Function
HCNP [217]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
CDMG [218]	Static graphs	Community detection	$- t r a c e (H^{⊺} A^{(k)} H)$
[219]	Static graphs	Passenger Prediction	${[\frac{1}{\| V \|} \sum_{v_{i} \in V} {({\hat{y}}_{i} - y_{i}^{})}^{2}]}^{\frac{1}{2}}$
ST-GDN [220]	Static graphs	Link prediction	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[221]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
MPNNs [222]	Static graphs	Node prediction	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
GraphSAGE [22]	Static graphs	Node classification	$\sum_{v_{i} \in V} - log (σ (y_{i}^{⊺} y_{j})) - \| N_{n e g} \| E_{v_{k} \sim P_{n} (v)} log (σ (- y_{i}^{⊺} y_{k}))$
FastGCN [52]	Static graphs	Node classification, link prediction	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
SACNNs [223]	Static graphs	Node classification Regression tasks	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$ $\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
Cluster-GCN [95]	Static graphs	Node classification	$- \frac{1}{\| B \|} \sum_{v_{i} \in B} y_{i}^{⊺} log ({\hat{y}}_{i})$
[18]	Static graphs	Node classification	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[224]	Static graphs	Node classification	$- \frac{1}{\| B \|} \sum_{v_{i} \in B} y_{i}^{⊺} log ({\hat{y}}_{i})$
GraphSAINT [53]	Static graphs	Node classification Community prediction	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
VGAE [72]	Static graphs	Link prediction	$E_{q (Z \| X, A)} [log p (A \| Z)] - K L [q (Z \| X, A) \| \| p (Z)]$
PinSAGE [225]	Static graphs	Link prediction	$E_{n_{k} \sim P_{n} (i)} max (0, Z_{i} \cdot Z_{n_{k}} - Z_{i} \cdot Z_{j} + m)$
Hi-GCN [54]	Static graphs	Classification tasks	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[226]	Static graphs	Link prediction	$\sum_{〈v_{i}, v_{j}, v_{k}〉} log σ (Z_{i} Z_{j} - Z_{i} Z_{k}) + α L_{2}$
[28]	Static graph	Node classification	$\frac{1}{2} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2} + α L_{1}$
[26]	Static graph	Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
[227]	Static graphs	Classification tasks	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
[228]	Static graph	Node Classification Link prediction	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
[229]	Static graphs	Node classification Link prediction	$- \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
DCRNN [230]	Static graphs	Node classification	$\frac{1}{\| V \|} \sum_{v_{i} \in V} \|y_{i} - {\hat{y}}_{i}\|$
PinSAGE [225]	Static graphs	Link prediction	$E_{n_{k} \sim P_{n} (i)} max (0, Z_{i} \cdot Z_{n_{k}} - Z_{i} \cdot Z_{j} + m)$
E-GraphSAGE [231]	Static graph	Edge classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
GraphNorm [232]	Static graphs	Graph classification	$\sum_{v_{i} \in V} \frac{1}{2} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
GIN [24]	Heterogeneous graphs	Node classification, Graph classification	$\sum_{v_{i} \in V} \frac{1}{2} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
DeeperGCN [98]	Static graphs	Node property prediction, Graph property prediction	$\sum_{v_{i} \in V} \frac{1}{2} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
PHC-GNNs [233]	Static graphs	Graph classification	$- \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
HGNN [27]	Hypergraphs	Node classification, Recognition tasks.	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
HyperGCN [234]	Hypergraphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$

Table 12. A summary of spatial CGNN models for dynamic and heterogeneous graphs, m is the margin.

Model	Graph Type	Tasks	Loss Function
SHARE [235]	Dynamic graphs	Availability prediction	$\frac{1}{N} (\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2} + y_{i}^{⊺} log ({\hat{y}}_{i}))$
Dyn-GRCNN [236]	Dynamic graphs	Traffic flow forecasting	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$ ${[\frac{1}{\| V \|} \sum_{v_{i} \in V} {({\hat{y}}_{i} - y_{i}^{})}^{2}]}^{\frac{1}{2}}$
STAN [237]	Dynamic graphs	Fraud detection	$\sum_{v_{i} \in V} [y_{i}^{⊺} l o g ({\hat{y}}_{i}) + α (1 - y_{i}^{⊺}) l o g (1 - {\hat{y}}_{i})]$
SeqGNN [238]	Dynamic graphs	Traffic speed prediction	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
DMVST-Net [239]	Dynamic graphs	Taxi demand prediction	$\sum_{v_{i} \in V} ({∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2} + {∥\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}∥}_{2}^{2})$
ST-ResNet [240]	Dynamic graphs	Flow prediction	$\sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
R-GCNs [241]	Knowledge graphs	Entity classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
HDMI [242]	Multiplex graphs	Node clustering, Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
DMGI [243]	Multiplex graphs	Link Prediction, Clustering, Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
LDANE [244]	Dynamic graphs	Graph reconstruction, Link prediction, Node classification	$\sum_{v_{i} \in V} {∥{\hat{A}}_{i} - A_{i}∥}_{2}^{2} + α \sum_{〈v_{i}, v_{j}〉 \in E} {∥Z_{i} - Z_{j}∥}_{2}^{2} + L_{1} + L_{2}$
EvolveGCN [245]	Dynamic graphs	Link prediction, Node, edge classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$

Table 13. A summary of attentive convolutional GNN models.

p_{i j}^{l}

denotes the probability of an edge between two node

v_{i}

and

v_{j}

at layer l,

p_{i j}^{} = σ (W (h_{i} | | h_{j}))

.

Table 13. A summary of attentive convolutional GNN models.

p_{i j}^{l}

denotes the probability of an edge between two node

v_{i}

and

v_{j}

at layer l,

p_{i j}^{} = σ (W (h_{i} | | h_{j}))

.

Model	Graph Type	Tasks	Loss Function
GAT [19]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
GATv2 [58]	Static graphs	Link prediction, Graph prediction, Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
Gaan [255]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
GraphStar [256]	Static graphs	Node classification, Graph classification, Link prediction	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
HAN [25]	Heterogeneous graphs	Node classification, Node clustering	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[257]	Static graphs	Label-agreement prediction, Link prediction	$- \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
SuperGAT [258]	Static graphs	Label-agreement Link prediction	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i})) + α \sum_{l = 1}^{L} L_{E}^{l} + β L_{2}$ where $L_{E}^{l} = \sum_{(v_{i}, v_{j})} 1_{((v_{i}, v_{j}) \in E)} log p_{i j}^{l} + 1_{((v_{i}, v_{j}) \notin E)} log (1 - p_{i j}^{l})$
CGAT [259]	Static graphs	Node classification	$\sum_{v_{i} \in V} \sum_{\begin{matrix} v_{j} \in N (v_{i}), \\ v_{k} \notin N (v_{i}) \end{matrix}} max (0, ϕ_{i k}^{} + m - ϕ_{i j}^{}) - \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[260]	Static graphs	Node classification	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[261]	Static graphs	Node classification, Object recognition	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i}) + α \sum_{(v_{i}, v_{j}) \in E} {∥Z_{i} - Z_{j}∥}_{2}^{2}$
[25]	Heterogeneous graphs	Node classification, Node clustering	$- \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
[262]	Knowledge graphs	Relation prediction	$\sum_{〈h, t, r〉} \sum_{〈h^{'}, t^{'}, r〉} max (0, {∥h + r - t∥}_{1} - {∥h^{'} + r - t^{'}∥}_{1} + m)$
[224]	Static graphs	Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
R-GCN [241]	Knowledge graphs	Entity classification, Link prediction	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
DMGI [243]	Attributed multiplex graphs	Node clustering, Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i})) + α L_{E} + β L_{2}$ where $L_{E} = {[Z - σ (H^{(r)} \| r \in R)]}^{2} - {[Z - σ ({\tilde{H}}^{(r)} \| r \in R)]}^{2}$
SHetGCN [263]	Heterogeneous graphs	Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
DualHGCN [264]	Multiplex bipartite graphs	Node classification Link prediction	$\sum_{(v_{i}, v_{j}) \in E} [α log σ (Z_{i}^{⊺} Z_{j}) + (1 - α) \sum_{k = 1}^{n} (E_{v_{k} \sim P (v_{i})} log σ (Z_{i}^{⊺} Z_{k}))]$
HANE [265]	Heterogeneous graphs	Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$
MHGCN [266]	Multiplex heterogeneous Graph	Link prediction Node classification	$- \frac{1}{\| V \|} \sum_{v_{i} \in V} y_{i}^{⊺} log ({\hat{y}}_{i})$ $\sum_{(v_{i}, v_{j}) \in E} log σ (Z_{i}^{⊺} Z_{j}) - \sum_{(v_{i}, v_{k}) \notin E} log σ (- Z_{i}^{⊺} Z_{j})$

Table 14. A summary of graph convolutional autoencoder models. E is the edge attribute tensor, X is a node attribute matrix.

Algorithms	Graph Types	Tasks	Loss Function
GAE [72]	Static graphs	Link prediction	$E_{q (Z \| X, A)} [log p (A \| Z)] - K L [q (Z \| X, A) \| \| p (Z)]$
VGAE [72]	Static graphs	Link prediction	$E_{q (Z \| X, A)} [log p (A \| Z)] - K L [q (Z \| X, A) \| \| p (Z)]$
[271]	Static graphs	Graph generation	$- α_{1} log p (A \| Z) - α_{2} log p (X \| Z) - α_{3} log p (E \| Z)$
[272]	Static graphs	Graph generation	$- \sum_{v_{i} \in V} (y_{i}^{⊺} log ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i}))$
MGAE [270]	Static graphs	Graph clustering	${∥A - \hat{A}∥}^{2}$
[273]	Static graphs	Graph reconstruction	$\frac{1}{\| V \|} \sum_{v_{i} \in V} {∥y_{i} - {\hat{y}}_{i}∥}_{2}^{2}$
LDANE [244]	Dynamic graphs	Graph reconstruction, Link prediction, Node classification	$\sum_{v_{i} \in V} {∥{\hat{A}}_{i} - A_{i}∥}_{2}^{2} + α \sum_{〈i, j〉 \in E} {∥Z_{i} - Z_{j}∥}_{2}^{2} + L_{1} + L_{2}$

Table 15. A summary of graph transformer models. MP is the message-passing, SPD is the shortest-path distance.

Model	Graph Type	Transformer Type	Sampling Strategy	Self-Supervised Learning
[64]	Tree-like graphs	Structural encoding	Dependency path BFS, DFS	Structure reconstruction
[65]	Tree-like graphs	Structural encoding	Dependency path	Structure reconstruction
[282]	Tree-like graphs	Structural encoding	SPD	Structure reconstruction
Graph-Bert [63]	Static graphs	Structural encoding Attention + GNN	WL and K-hop	Attribute reconstruction Structure reconstruction
[283]	Static graphs	Structural encoding	WL, K-hop	Structure reconstruction
[29]	Heterogeneous graphs	Structural encoding Edge channels	Laplacian matrix	Structure reconstruction
SAN [284]	Heterogeneous graphs	Structural encoding Edge channels	Laplacian matrix	Structure reconstruction
Grover [100]	Heterogeneous graphs	MP + Attention	k-hop	Feature prediction Motif prediction
Mesh Graphormer [99]	Static graphs	Attention + CGNNs	k-order proximity	Graph reconstruction
HGT [285]	Heterogeneous graphs	Attention+ MP	Meta-paths	Graph reconstruction
UGformer [61]	Heterogeneous graphs	Attention +GNN	1st-order proximity	Graph reconstruction
StA-PLAN [286]	Heterogeneous graphs	Attention matrix	1st-order proximity	Structure reconstruction
NI-CTR [287]	Heterogeneous graphs	Attention matrix	Subgraph sampling	Structure reconstruction
[101]	Heterogeneous graphs	MP + Attention	1-hop neighbors	Structure reconstruction
[66]	Heterogeneous graphs	Attention + MP	Subgraph	Masked label prediction
Gophormer [46]	Heterogeneous graphs	Attention matrix	Ego-graph k-order proximity	Node classification
Graformer [47]	Knowledge graphs	Edge channels	SPD	Structure reconstruction
Graphormer [67]	Homogeneous graphs	Edge channels	SPD	Structure reconstruction
EGT [30]	Homogeneous graphs	Edge channels	SPD	Structure reconstruction

Table 16. A summary of hyperbolic models.

Models	Graph Types	Hyperbolic Models	Model Types
[70]	Homogeneous graphs	Poincaré disk	Shallow models
[102]	Homogeneous graphs	Lorentz model	Shallow models
[71]	Heterogeneous graphs	Poincaré disk	Shallow models
[293]	Homogeneous graphs	Poincaré disk	Convolutional GNNs
[294]	Homogeneous graphs	Poincaré disk, Lorentz model	GNNs
LGCN [69]	Homogeneous graphs	Lorentzian model	GNNs
[68]	Homogeneous graphs	Gyrovector model	GAT

Table 17. A summary of Gaussian embeddings models.

Model	Graph Type	Model	Structure Preservation
VGAE [72]	Homogeneous graphs	Autoencoder-based GCNs	Random-walk sampling
DVNE [298]	Homogeneous graphs	Autoencoder	1-order, 2-order proximity
[104]	Heterogeneous graphs	MLP	Meta-path
[23]	Homogeneous graphs	Autoencoder	k-order proximity
KG2E [299]	Knowledge graphs	Triplet score $(h, t, r)$	1-order proximity

Table 18. A summary of benchmark datasets for graph embedding models. # Nodes and # Edges indicate the number of nodes and edges in graphs, respectively.

Dataset	Graph Type	Category	# Nodes	# Edges
Cora [406]	Homogeneous graph	Citation network	2808	5429
Citeseer [407]	Homogeneous graph	Citation network	3312	4732
Reddit [22]	Homogeneous graph	Social network	232,965	114,615,892
PubMed [406]	Homogeneous graph	Citation network	19,717	44,338
Wikipedia [406]	Homogeneous graph	Webpage	2405	17,981
DBLP [408]	Homogeneous graph	Citation network	781,109	4,191,677
BlogCatalog [408]	Homogeneous graph	Social network	10,312	333,983
Flickr [1]	Homogeneous graph	Social network	80,513	5,899,882
Facebook [409]	Homogeneous graph	Social network	4039	88,234
PPI [22]	Homogeneous graph	Biochemical network	56,944	818,716
MUTAG [410]	Homogeneous graph	Biochemical network	27,163	148,100
PROTEIN [411]	Homogeneous graph	Biochemical network	43,500	162,100
Wiki	Homogeneous graph	Webpage	4,780	184,81 K
YouTube	Homogeneous graph	Video streaming	1,130,000	2,99 M
DBLP [412]	Heterogeneous graph	Citation network	Author (A): 4057 Paper (P): 14,328 Term (T): 7723 Venue (V): 20	A-P: 19,645 P-T: 85,810 P-V: 14,328
ACM [412]	Heterogeneous graph	Citation network	Paper (P): 4019 Author (A): 7167 Subject (S): 60	P-P: 9615 P-A: 13,407 P-S: 4019
IMDB [412]	Heterogeneous graph	Movie reviews	Movie (M): 4278 Director (D): 2081 Actor (A): 5257	M-D: 4278 M-A: 12,828
DBIS [413]	Heterogeneous graph	Citation network	Venues (V): 464 Authors (A): 5000 Publication (P): 72,902	-
BlogCatalog3 [414]	Heterogeneous graph	Social network	User: 10,312 Group: 39	348,459
Yelp [415]	Heterogeneous graph	Social media	User: 630,639 Business: 86,810 City: 10 Category: 807	-
U.S. Patents [180]	Heterogeneous graph	Patent, Trademark Office	Patent: 295,145 Inventor: 293,848 Assignee: 31,805 Class: 14	-
UCI [416]	Dynamic graph	Social network	1899	59,835
DNC [416]	Dynamic graph	Social network	2029	39,264
Epinions [417]	Dynamic graph	Social media	6224	19496
Hep-th [418]	Dynamic graph	Citation network	34,000	421,000
Auto Systems [419]	Dynamic graph	BGP logs	6000	13,000
Enron	Dynamic graph	Email network	87,000	1,100,000
StackOverflow [420]	Dynamic graph	Question&Answer	14,000	195,000
dblp [408]	Dynamic graph	Citation network	90,000	749,000
Darpa [421]	Dynamic graph	Computer network	12,000	22,000

Table 19. A summary of libraries of graph embeddings models. The accessibility of URLs for open-source repositories of the libraries have been checked on 16 April 2023.

Library	URL	Platform	Model
PyTorch Geometric (PyG) [426]	https://github.com/pyg-team/pytorch_geometric	PyTorch	Various GNN models and basic graph deep-learning operations
Deep Graph Library (DGL) [427]	https://github.com/dmlc/dgl	TensorFlow, PyTorch	Various GNN models and basic graph deep-learning operations
OpenNE	https://github.com/thunlp/OpenNE/tree/pytorch	TensorFlow, PyTorch	Shallow models: DeepWalk, Node2Vec, GAE, VGAE, LINE, TADW, SDNE, HOPE, GraRep, GCN
CogDL [428]	https://github.com/THUDM/cogdl	TensorFlow, PyTorch	Various GNN models
Dive into Graphs (DIG) [429]	https://github.com/divelab/DIG	PyTorch	Various GNN models and research-oriented studies (Graph generation, Self-supervised learning (SSL), explainability, 3D graphs, and graph out-of-distribution).
Graphvite [430]	https://github.com/deepgraphlearning/graphvite	Python	DeepWalk, LINE, Node2Vec, TransE, RotatE, and LargeVis.
GraphLearn [431]	https://github.com/alibaba/graph-learn	Python	Various GNN models, the framework can support the sampling batch graphs or offline training process.
Connector	https://github.com/NSLab-CUK/connector	Pytorch	Various shallow models and GNN models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hoang, V.T.; Jeon, H.-J.; You, E.-S.; Yoon, Y.; Jung, S.; Lee, O.-J. Graph Representation Learning and Its Applications: A Survey. Sensors 2023, 23, 4168. https://doi.org/10.3390/s23084168

AMA Style

Hoang VT, Jeon H-J, You E-S, Yoon Y, Jung S, Lee O-J. Graph Representation Learning and Its Applications: A Survey. Sensors. 2023; 23(8):4168. https://doi.org/10.3390/s23084168

Chicago/Turabian Style

Hoang, Van Thuy, Hyeon-Ju Jeon, Eun-Soon You, Yoewon Yoon, Sungyeop Jung, and O-Joun Lee. 2023. "Graph Representation Learning and Its Applications: A Survey" Sensors 23, no. 8: 4168. https://doi.org/10.3390/s23084168

APA Style

Hoang, V. T., Jeon, H.-J., You, E.-S., Yoon, Y., Jung, S., & Lee, O.-J. (2023). Graph Representation Learning and Its Applications: A Survey. Sensors, 23(8), 4168. https://doi.org/10.3390/s23084168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Graph Representation Learning and Its Applications: A Survey

Abstract

1. Introduction

2. Problem Description

3. Graph Representation Learning Models

3.1. Graph Kernels

3.2. Matrix Factorization-Based Models

3.3. Shallow Models

3.3.1. Structure-Preservation Models

3.3.2. Proximity Reconstruction Models

3.4. Deep Neural Network-Based Models

3.4.1. Graph Autoencoders

3.4.2. Recurrent Graph Neural Networks

3.4.3. Convolutional Graph Neural Networks

3.4.4. Graph Transformer Models

3.5. Non-Euclidean Models

3.5.1. Hyperbolic Embedding Models

3.5.2. Spherical Embedding Models

3.5.3. Gaussian Embedding Models

4. Applications

4.1. Computer Vision

4.2. Natural Language Processing

4.3. Computer Security

4.4. Bioinformatics

4.5. Social Media Analysis

4.6. Recommendation Systems

4.7. Smart Cities

4.8. Computational Social Science

4.9. Digital Humanity

4.10. Semiconductor Manufacturing

4.11. Weather Forecasting

5. Evaluation Methods

5.1. Benchmark Datasets

5.2. Downstream Tasks and Evaluation Metrics

5.3. Libraries for Graph Representation Learning

6. Challenges and Future Research Directions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Open-Source Implementations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI