Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs

Hou, Guoqiang; Yu, Qiwen; Chen, Fan; Chen, Guang

doi:10.3390/math12233689

Open AccessArticle

Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs

¹

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

²

College of Intelligent Manufacturing, Chongqing Vocational and Technical College of Industry and Trade, Chongqing 401120, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(23), 3689; https://doi.org/10.3390/math12233689

Submission received: 25 October 2024 / Revised: 15 November 2024 / Accepted: 16 November 2024 / Published: 25 November 2024

Download

Browse Figures

Versions Notes

Abstract

Knowledge graph embedding has been identified as an effective method for node-level classification tasks in directed graphs, the objective of which is to ensure that nodes of different categories are embedded as far apart as possible in the feature space. The directed graph is a general representation of unstructured knowledge graphs. However, existing methods lack the ability to simultaneously approximate high-order filters and globally pay attention to the task-related connectivity between distant nodes for directed graphs. To address this limitation, a directed spectral graph transformer (DSGT), a hybrid architecture model, is constructed by integrating the graph transformer and directed spectral graph convolution networks. The graph transformer leverages multi-head attention mechanisms to capture the global connectivity of the feature graph from different perspectives in the spatial domain, which bridges the gap between frequency responses and, further, naturally couples the graph transformer and directed graph convolutional neural networks (GCNs). In addition to the inherent hard inductive bias of DSGT, we introduce directed node positional and structure-aware edge embedding to provide topological prior knowledge. Extensive experiments demonstrate that the DSGT exhibits state-of-the-art (SOTA) or competitive node-level representation capabilities across datasets of varying attributes and scales. Furthermore, the experimental results indicate that the homophily and degree of correlation of the nodes significantly influence the classification performance of the model. This finding opens significant avenues for future research.

Keywords:

knowledge graph embeddings; hybrid architecture; graph transformer; directed spectral graph convolution networks; node-level representation learning

MSC:

05C62; 68T07; 68T35

1. Introduction

A knowledge graph is composed of an underlying topological structure and nodes, which can materialize a complex relationship between entries. The graph topology is a set of edges that describes the relative layout of nodes, regardless of their specific spatial positions. Node representation learning involves abstracting node-level semantic knowledge from its context and local neighborhood structure with continuously informative aggregation along edges in a one-hop neighborhood. For instance, in social networks, a node and edge represent a user and an interaction between users, respectively, providing crucial insights for analyzing user behavior and designing recommendation systems [1]. In e-commerce, a user, an item, and user–item interaction behaviors can be regarded as a graph network that is used to analyze a user’s preference for an item and recommend interesting products to the user [2]. Efficient representation learning that respects the graph’s topological structure is crucial for graph mining. Permutation invariance and scalability overcome the inductive biases imposed by canonical ordering and the finite number of grid units, respectively. Graph neural networks have garnered significant interest due to their ability to handle unstructured data of arbitrary size through constructing K-hop neighborhoods and performing scale-invariant information aggregation.

Spectral graph theory underlies classical graph convolution networks (GNNs (See Appendix B for abbreviations and corresponding full names.)) [3,4,5], relating the spectral polynomial filter with spatial convolution for a self-adjoint graph shift operator (GSO). In practice, one-way interactions between entities are more common, such as a user collecting an item, but not vice versa, resulting in a directed graph without a self-adjoint graph shift operator (GSO). The absence of a self-adjoint GSO, such as a positive-definite graph Laplacian matrix (GLM), makes it impossible to capture frequency responses in the eigenspace. Many researchers redefine the symmetry GLM for directed graphs for the generalization of spectral graph theory, whose workings are roughly summarized in three parts: structural decomposition [6], attribute decomposition [7], and random walk [8,9]. These methods mainly involve considering biased spatial informative diffusion in directed graphs, which contributes to geometrically unmeaning interpretation in the frequency domain. Others focus on the characteristics of the frequency response in a generalized eigenspace and directed spectral graph convolution networks [10], which is beneficial for finding the signal diffusion pattern in directed graphs. In the GCN framework, step-wise spatial convolution propagates as many messages along the edge as possible, but there is an over-smoothing and over-squashing dilemma [11,12,13]. The use of a graph transformer can alleviate both issues, being able to pay attention to the dependence between points that are far apart.

Neglecting the directional edges, graph transformers (GTs) work in the spatial [14,15,16,17,18,19] or spectral domain [20]. Node positional embedding (node PE) and the multi-head attention mechanism are the key to spatial GT. Instead of ordered identification and absolute positioning in a spatial coordinate system, the roles played in frequency responses are unique for a node. Node PE is a soft inductive bias that compensates for the structure-aware inability in the global attention mechanism [14]. Node PE suffers from sign and basis ambiguity due to sign/rotation-invariant eigendecomposition. On the one hand, a canonical trick or permutation-invariant layer mathematically unifies the eigenspace. On the other hand, considering all the spectral features (eigenvalues and vectors) leads the model to have some mathematical equivalence. The aim of a spectral graph transformer is to adaptively generate a data-dependent and structure-aware GLM [10]. Existing directed GTs mainly work in the spatial domain. Node PE is designed to represent the strength and directional distortion of a single point across the node in different graph patterns. Graph patterns are discovered using various methods for directed graphs [21,22]. Additionally, the expression of an attention map has a critical impact on the computational efficiency of GTs. A greater information capacity in the node feature [18] and comprehensive symmetry in the attention kernel lead to a more reasonable attention score. However, GT has been shown to only approximate low-order filters within a finite error [15]. Researchers have employed a hybrid architecture of GCN and graph transformer to simultaneously achieve high-order filtering and bypass the information explosion from a high-order GCN. However, the hybrid architecture has rarely been implemented in directed graphs to date.

As shown in Table 1, the current methods for node representation learning in directed graphs take into account either the approximation of a high-order filter or a global receptive field. Our aim was to functionally complement each directed GCN and GT more fruitfully. The multi-head attention mechanism of GT could provide abundant global connectivity from different perspectives, reflecting an irregular support set of graph frequencies [23]. In spectral graph theory, the polynomial filter coefficients of a GCN dominate the smoothing of frequency components. Therefore, polynomial filter coefficients should be mutually influential and data-dependent, such that the task-related filter performs better and is more flexible.

Inspired by the latest research in deep GNNs and spectral graph theory, we propose a powerful hybrid architecture model for directed graphs, called the directed spectral graph transformer (DSGT). The algorithm is a two-stage model with the ability to approximate the frequency responses of directed graphs of arbitrary order; it consists of data preprocessing and forward inference. In the preprocessing stage, we initialize the structure-aware embedding of the node position and edge. In the forward inference stage, an asymmetric attention kernel and a high-order directed GNN can approximate arbitrary-order graph filters while comprehensively capturing global node dependencies. Specifically, a complete set of graph Fourier bases is derived from the feasible solution of constraint-preserved optimization. Then, the relative significance between nodes is regarded as a discrete gradient flow along the edge in different graph patterns. Subsequently, directed GT is responsible for global aggregation and providing a multi-head attention map. Directed GNNs extract the multi-dimensional support of polynomial filter coefficients and aggregate information inside a K-hop neighborhood. Further, node-level representation is obtained by fusing local and global aggregation. Finally, we decouple the updates of edge attributes and node PE. In node classification tasks, DSGT consistently outperforms other models, with over 10% higher accuracy on some datasets (see Section 4 for details). It also shows lower standard deviations across multiple datasets, indicating strong stability and robustness.

The contributions of our paper are as follows:

Novel mixed architecture: This is a new hybrid architecture model for directed graphs that can approximate an arbitrary-order filter without over-squashing and over-smoothing issues.
New directed PE: Concepts from continuous signal processing (SP) are generalized to a discrete directed graph. A new method is adopted to evenly search for a complete set of graph Fourier bases across as wide a frequency band as possible and to further obtain a directed node PE from them.
Benchmarking: Extensive experiments demonstrate the SOTA and competitive performance of our DSGT compared to baselines and the effectiveness of its module. We experientially analyzed the experimental results.

The rest of this article is organized as follows. In Section 2, the previous work related to graph representation learning models is briefly reviewed and analyzed. The details of the model architecture are mathematically formulated in Section 3. The experimental results and corresponding discussions for several public datasets are elaborated in Section 4. Finally, we summarize our work based on the experimental results in Section 5. In addition, full descriptions of the abbreviations and symbols are provided in Table A1 and Table A2.

2. Related Work

For node representation learning, we briefly outline the development of GNNs and state the design motivations behind these models, laying the foundation for our current research.

2.1. Undirected Graph Neural Network

The spectrum theory of undirected graphs was first applied to the field of graph representation learning, as formulated in Appendix C.1. SGCNN is an early spectral graph convolutional model with higher computational complexity [3]. In SGCNN, the polynomial filter would be equivalent to the power series of GLM under the condition of a self-adjoint undirected GLM. ChebConv reduces the computational complexity from the exponential to polynomial level via Chebyshev polynomials that can approximate the power series of an undirected GLM normalized by its spectral radius [4]. To further simplify the model, GCN [5] only takes the first-order form of ChebConv and the neighborhood continuously expands through step-wise iteration.

2.2. Directed Graph Neural Network

Major efforts have been made to define symmetric directed GLMs, such as random walks [8,9], and structural [6] and attribute decomposition [7], enabling the application of traditional spectral GCNs to directed graphs. DGCN decomposes the adjacency matrix of directed graphs into first-order proximal, in-neighbor, and out-neighbor matrices, which represent basic connection, in-degree, and out-degree distributions, respectively [6]. Nodes sharing the same out-/in-neighbors exhibit similar diffusion/aggregation behaviors; hence, the number of shared in-/out-neighbors reflects the in-/out-degree similarity between nodes. When the directionality of information propagation in the graph is ignored, multiple non-isomorphic directed graphs can correspond to the same undirected graph. As a result, applying an undirected GNN to a directed graph naturally inevitably leads to information loss. To address this, the directionality of edges is uniquely decomposed into the strength of the connections between nodes and the direction of information propagation. Specifically, the directed GLM is regarded as a Hermitian matrix, constructed from the underlying undirected graph and phase. The underlying undirected graph topology describes the connectivity relationships between nodes, while the phase is applied to capture the cyclic subgraph. In the complex domain, MagNet carries out a spatial convolution operation via the directed GLM [7]. The graph adjacent matrix, normalized by the out-degree, represents the random diffusion probability of the node signal, referred to as the random walk matrix. In strongly connected directed graphs, the adjacency matrix is irreducible and non-negative. Based on the Perron–Frobenius theory and the stationary distribution of Markov chains, the Perron vector describes the steady-state distribution of nodes. A symmetric directed graph Laplacian matrix is then constructed by normalizing the random walk matrix with the Perron vector [8]. However, the requirement of strong connectivity limits the model’s applicability. To address this, a state transition matrix for the Markov chain is constructed by incorporating long-distance random jumps into the PageRank algorithm, resulting in a symmetric directed approximate GLM being derived [9]. Another research direction explores spectral graph theory for directed graphs. In this case, the behavior of directed GLM is analyzed within the generalized eigenvalue space. Additionally, linear operator perturbation theory demonstrates that the frequency response of a directed graph can be fully understood through holomorphic functions in the complex domain. Similar to the Chebyshev polynomials, Faber polynomials are used in the complex domain to approximate arbitrary holomorphic functions [10].

2.3. Graph Transformer

The first GT adaptively fuses multi-scale neighborhood aggregation through a channel attention mechanism, but lacks the global attention capability of the transformer architecture [24]. Dwivedi V.P. et al. first truly generalized the transformer to a discrete graph domain via node PE and feature update rules [14]. The node position encoding, serving as a soft inductive bias, is crucially aware of the diffusion distance between nodes in the GT. However, eigendecomposition is insensitive to sign flip and orthogonal transformation groups, resulting in the sign and basis being ambiguous. To overcome both of these issues, Lim et al. leveraged IGN [25] to achieve sign- and basis-invariant networks [26], ensuring the uniqueness of isomorphic graph node embeddings. The output of the parametric network violates the constraints of the Fourier basis. Ma et al. designed a non-parametric method for basis canonicalization [27]. Due to computational overhead, Dwivedi et al. employed random walks to obtain unique representations of node positions and decoupled the updates of edge and position embeddings [17]. Nevertheless, spatially structure-aware embeddings from random walks are less expressive than their spectral counterparts. Kreuzer et al. incorporated spectral features from eigendecomposition to balance computational complexity and model performance degradation [16]. Given that an equivalent complete set of graph Fourier bases from eigendecomposition is not enumerable, it is challenging for the model to fully determine both ambiguities. Additionally, Bo et al. leveraged a transformer encoder–decoder architecture to adaptively extract a task-related GLM, known as spectral GT, from spectral features [20].

Improving the effectiveness of attention maps can also observably improve the expression of GT. Traditional attention kernels, such as additive and dot-product attention kernels [28], have been proven to inadequately reveal the asymmetry between nodes from a single perspective [29,30]. Inspired by KSVD [31], Chen Y. et al. proposed an attention kernel that incorporates bidirectional asymmetry between nodes, thereby further enhancing the asymmetric expression of transformer [30]. Experiments have shown that the similarity of local neighborhood structures is crucial for evaluating the relative dependence between nodes, suggesting that the local neighborhood structure affects the attention score [18]. There is less research on directed GTs, whose bottleneck is the lack of a valid and universal directed node PE. Research on directed GTs remains relatively sparse, with a key bottleneck being the lack of an effective and universal directed node PE. Geisler et al. introduced the magnetic GLM to describe the biased connections between nodes and performed aggregation on both the imaginary and real parts separately through GTs [32]. However, the method for feature processing lacks geometric justification, resulting in suboptimal performance.

3. Proposed Model

For node-level graph representation learning, DSGT fruitfully integrates GT and directed GNN with the ability to approximate an arbitrary-order filter. As shown in Figure 1, the DSGT is a two-stage model, including preprocessing and forward inference.

3.1. Graph Preprocessing

This section primarily elaborates on node PE and the initialization of structure-aware features for the edge. Graph data consist of graph topology and node features, which fall within the domain of irregular data. The graph topology describes the complex and flexible relationships between nodes, meaning that reordering the graph nodes results in a series of equivalent isomorphic graphs. Unlike grid data, the importance of nodes within the graph topology under different graph patterns is unique to them. Additionally, to respect the dominant direction of signal propagation under different graph frequencies, the graph connectivity under various graph frequencies is used as the initial feature for edges. To some extent, the initial edge features and global positional encoding of nodes can help to improve their expression in specific tasks.

3.1.1. Global Node Positional Encoding

To diversify directed graph patterns, the graph frequency is expected for an arithmetic sequence across as wide a frequency band as possible. Specifically, a two-stage optimization strategy is employed: first, the spectral radius and corresponding graph Fourier basis are determined by maximizing the directed total variation, and then, a set of complete Fourier bases is identified, with the corresponding graph frequencies as evenly distributed across the entire frequency band as possible. The node PE of fixed dimensions is derived from a complete set of graph Fourier bases.

{\hat{p}}_{i} = MLP (Concat (λ, u_{i})) P^{(0)} = {GP}_{1} (Transformer (\hat{P}))

(1)

In Equation (1), the first-layer

d_{p}

-dimensional node PE

P^{(0)} \in R^{N \times d_{p}}

, the frequency spectrum of the directed graph

λ = {λ_{k}, k = 1, \dots, N} \in R^{N}

, and the importance of

i_{t h}

under graph frequency

u_{i} = {u_{k}^{i}, k = 1, \dots, N} \in R^{N}

. The

N \times 2

-dimensional tensor is obtained via

Concat (\cdot)

for each node.

{GP}_{i} (\cdot)

is the global pooling operation to the

i_{t h}

dimension of the tensor.

MLP : R^{2} \to R^{d_{p}}

;

Transformer : R^{d_{p}} \to R^{d_{p}}

. The total variation is used to measure the smoothness of the node signal variations on the graph topology, quantifying the expected value of the discrete derivatives of graph signals, which can be used to define graph frequencies. The directed and undirected total variations are separately formulated as

DV (x) = \sum_{i, j = 1}^{N} A_{i j} {[x_{i} - x_{j}]}_{+}^{2}

and

TV (x) = x L^{T} x = \sum_{i = 1, j > i}^{N} A_{i j} {[x_{i} - x_{j}]}^{2}

, where

{[x]}_{+}^{2} = {({[x]}_{+})}^{2} = {(max (0, x))}^{2}

. For a self-adjoint adjacent matrix A, either

A_{i j} {[x_{i} - x_{j}]}_{+}^{2}

or

A_{j i} {[x_{j} - x_{i}]}_{+}^{2}

is zero if

A_{i j} = A_{j i} \neq 0

; in other words,

TV (x) \equiv DV (x)

, suggesting that the method for node PE is also applicable for undirected graphs. Notably, the total variation is insensitive to sign and permutation, which implies that there are no ambiguities for the definition of directed node PE. The expected frequencies follow an arithmetic progression, namely

f_{k} = DV (u_{k}) = \frac{k - 1}{N - 1} f_{max}, k = 1, \dots, N

, where

f_{max}

is the spectral radius/largest frequency of the directed graph. The uniqueness of the frequency avoids basis ambiguity. First, the graph Fourier basis of the largest frequency is obtained through the maximum directed total variation and ensures the frequency band. Second, a group of approximately evenly distributed spectral features is the solution to the minimum spectral dispersion function across the frequency band.

\begin{matrix} min_{u} - DV (u) s . t . u^{T} u = 1 \\ min_{U} ϕ (U) = δ (U) + \frac{λ}{2} ({∥u_{1} - u_{min}∥}^{2} + {∥u_{N} - u_{max}∥}^{2}) \\ s . t . U^{T} U = I_{N}, δ (U) = \sum_{i = 1}^{N - 1} {[DV (u_{i + 1}) - DV (u_{i})]}^{2} \end{matrix}

(2)

In Equation (2),

u_{min} = 1_{N} / \sqrt{N}

is the pattern of uniform diffusion and is orthogonal with

u_{max}

(Proposition 4 [22]). The orthogonal constraint causes non-convex optimization. Wen et al. proposed a feasible iterative strategy for processing an orthogonality-preserved optimization problem with the form of

{min}_{U \in R^{N \times M}} Φ (U)

s.t.

U^{T} U = I_{M}

. The iterative method for Equation (2) is shown in Algorithms 1 and 2. For each iteration, the step size is chosen by Algorithm 3 to satisfy the strong Armijo–Wolfe conditions. Ultimately, we can obtain the abundant graph patterns by optimizing objectives in Equation (2).

Algorithm 1 Directed variation maximization.

Input:: Adjacency matrix $A \in R^{N \times N}$ and arbitrarily positive $ϵ > 0$
Output:: Maximum graph frequency $f_{m a x}$ and corresponding graph Fourier basis $u_{m a x}$
1:: Initialize $k = 0$ and unit-norm $u_{0} \in R^{N}$ at random
2:: while $∥ u_{k} - u_{k - 1} ∥ \leq ϵ$ do
3:: Evaluate the objective $ϕ (u_{k}) =$ −DV $(u_{k}) = \sum_{i, j = 1}^{N} A_{i j} {[x_{i} - x_{j}]}_{+}^{2}$
4:: Compute the gradient of the objective function $\bar{g} = \nabla$ DV $u \in R^{N}$ , the $i_{t h}$ element of which is ${\bar{g}}_{i} = 2 (A_{:, i}^{T} {[u - u_{i} 1_{N}]}_{+} - A_{i, :} {[u_{i} 1_{N} - u]}_{+}), 1 \leq i \leq N$
5:: Compute the skew-symmetric matrix $B_{k} = {\bar{g}}_{k} u_{k}^{T} - u_{k} {\bar{g}}_{k}^{T} \in R^{N \times N}$
6:: Select the step length $τ_{k}$ of the $k_{t h}$ iteration by Algorithm 1 or 3
7:: Update $u_{k + 1} (τ_{k}) = {(I_{N} + \frac{τ_{k}}{2} B_{k})}^{- 1} (I_{N} - \frac{τ_{k}}{2} B_{k}) u_{k}$
8:: Update the index of iteration $k \leftarrow k + 1$
9:: end while
10:: Return $u_{m a x} = u_{k}$ and $f_{m a x} =$ DV $(u_{m a x})$

Algorithm 2 Spectral dispersion minimization.

Input:: Adjacency matrix $A \in R^{N \times N}$ and arbitrarily positive $ϵ > 0$ , minimum and maximum graph frequency $u_{m i n} = \frac{1}{\sqrt{N}} 1_{N}, u_{m a x}$ from the directed variation maximization algorithm, and regularization coefficient of the objective function $λ \in (0, 1)$
Output:: A set of graph Fourier bases ${u_{i}, i = {2, \dots, N - 1}}$ whose corresponding graph frequencies approximate the arithmetic sequence
1:: Initialize $k = 0$ and orthonormal $U_{0} \in R^{N \times N}$ at random
2:: while $∥ U_{k} - U_{k - 1} ∥_{F} \leq ϵ$ do
3:: Evaluate the objective with a measure of the constraint violations $ϕ (U) = δ (U) + \frac{λ}{2} ({∥u_{1} - u_{min}∥}_{2}^{2} + {∥u_{N} - u_{max}∥}_{2}^{2})$ , where, $δ (U) = \sum_{i = 1}^{N - 1} {[DV (u_{i + 1}) - DV (u_{i})]}_{2}^{2}$
4:: Compute the gradient of the objective function $G = \nabla ϕ (U) \in R^{N \times N}$ , the rows of which are

$\begin{matrix} G_{1, :} = [DV (u_{1}) - DV (u_{2})] \bar{g} (u_{1}) + λ (u_{1} - u_{min}) \\ G_{i, :} = [2 DV (u_{i}) - DV (u_{i + 1}) - DV (u_{i - 1})] \bar{g} (u_{1}), 1 \leq i \leq N \\ G_{N, :} = [DV (u_{N - 1}) - DV (u_{N})] \bar{g} (u_{N}) + λ (u_{N} - u_{max}) \end{matrix}$
5:: Compute the skew-symmetric matrix $B_{k} = G_{k} U_{k}^{T} - U_{k} G_{k}^{T} \in R^{N \times N}$
6:: Select the step length $τ_{k}$ of the $k_{t h}$ iteration by Algorithm 1 or 3
7:: Update $U_{k + 1} (τ_{k}) = {(I_{N} + \frac{τ_{k}}{2} B_{k})}^{- 1} (I_{N} - \frac{τ_{k}}{2} B_{k}) U_{k}$
8:: Update the index of iteration $k \leftarrow k + 1$
9:: end while
10:: Return $\hat{U} = U_{k}$ and $f_{i} =$ DV $(U_{:, i}), 1 < i < N$

Algorithm 3 Non-monotone curvilinear search algorithm.

Input:: Maximum step length $τ_{max}$ ; minimum step length $τ_{min}$ ; hyperparameters $0 < c_{1} < c_{2} < 1$ , where generally, $c_{1} = 10^{- 4}, c_{2} = 0.9$ ; the objective function $F (X)$ and its gradient $\nabla_{X} F (X), δ = 0.1, η = 0.85, ϵ = 10^{- 5}$ ; chosen initial step length $τ > 0, τ_{\max} = 10^{4}, τ_{\min} = 10^{- 4}, τ_{init} = 10^{- 3}$ ; maximum number of loop $L_{\max} = 25, Q_{0} = 1, C_{0} = F (X_{0})$ ; initial point $X_{0}$ in the Stiefel manifold
Output:: The step length $τ^{*}$ which satisfies the condition
1:: function VariableUpdate( $X_{k}, F, \nabla_{X} F, τ$ )
2:: $N, K = X$ .shape
3:: if $N > 2 \times K$ then
4:: $U = [\nabla F (X_{k}), X_{k}], V = [X_{k}, - \nabla F (X_{k})]$
5:: $X_{k + 1} = X_{k} - τ \times U {(I_{2 K} + 0.5 τ \times V^{T} U)}^{- 1} V^{T} U$
6:: else
7:: $A = \nabla F (X_{k}) X_{k}^{T} - X_{k} \nabla F {(X_{k})}^{T}$
8:: $X_{k + 1} = {(I_{N} + 0.5 τ \times A)}^{- 1} (I_{N} - 0.5 τ \times A) X_{k}$
9:: end if
10:: return $X_{k + 1}$
11:: end function
12:: while $∥ \nabla F (X_{k}) ∥_{2}^{2} > ϵ$ do
13:: while $F (Y_{k} (τ)) \geq C_{k} + c_{1} τ \nabla F (Y_{k} (0))$ do
14:: Scale step length $τ$ to satisfy the condition, namely, $τ \leftarrow δ \times τ$
15:: if current loop index >= $L_{\max}$ and $∥ \nabla F (X_{k}) ∥_{2}^{2} > ϵ$ then
16:: Sample $τ \sim N (0, 1)$
17:: break
18:: end if
19:: end while
20:: $X_{k + 1} =$ VariableUpdate $(X_{k}, F, \nabla F), Q_{k + 1} = η Q_{k} + 1$
21:: $C_{k + 1} = (η Q_{k} C_{k} + F (X_{k + 1})) / Q_{k + 1}$
22:: Update $τ_{k + 1}$ by alternating between the following two methods:

$\begin{matrix} (1) τ_{k, 1} = \frac{tr ({(S_{k - 1})}^{T} S_{k - 1})}{|tr ({(S^{n - 1})}^{T} Y_{k - 1})|} (2) τ_{k, 2} = \frac{|tr ({(S_{k - 1})}^{T} Y_{k - 1})|}{tr ({(Y_{k - 1})}^{T} Y_{k - 1})} \\ which, S_{k - 1} = X_{k} - X_{k - 1}, Y_{k - 1} = \nabla F (X_{k}) - \nabla F (X_{k - 1}) \end{matrix}$
23:: Set $τ = max (min (τ_{k + 1}, τ_{max}), τ_{min})$
24:: if $k > = L_{\max}$ then
25:: break
26:: else
27:: $k \leftarrow k + 1$
28:: end if
29:: end while
30:: Return a feasible step length $τ^{*} = τ_{k + 1}$

3.1.2. Structure-Aware Edge Embedding

Directional derivative and smoothing operators represent the global pattern of information flow, deriving from the difference of graph Fourier bases. Subsequently, graph frequencies, derivative, and smoothing operators are superimposed along a new dimension to obtain a 2D tensor that encapsulates the trend of information flow for each edge. Afterwards, the 2D tensor is transformed into a one-dimensional initial edge feature through an MLP and a transformer successively. The derivative of continuous signal provides information about the direction and intensity of its propagation. For discrete graph domains, the graph Fourier basis behaves similarly to the sine and cosine waves in continuous domains. The derivative of graph Fourier basis reflects the directional flow under the graph pattern. If we hope to further smooth discrete gradient of graph, the arcsine function can be used to postprocess the derivative.

F^{k} = \nabla u_{k} or \nabla arcsin (\frac{u_{k}}{max (|u_{k}|)})

(3)

In Equation (3), the graph Fourier basis

u_{k} \in R^{N}

, the directional field

F^{k} \in R^{N \times N}

, and the discrete derivative operator is ∇. In Appendix A and Equation (11) of [33], the author provides three aggregators (soft and hard softmax aggregators and a center-balanced aggregator) and scalers (degree-attenuation, degree-amplification, and identity), suggesting that

B_{av}^{k}, B_{dx}^{k} \in R^{(9 \times K) \times N \times N}

is the nine-dimensional tensor for the

k_{t h}

graph Fourier basis. When the number of graph Fourier bases is K, the dimension of the directional derivative

B_{av}

and smoothing matrix

B_{dx}

is

(9 \times K) \times N \times N

. Consideration of either the graph frequency or graph Fourier basis is insufficient to distinguish non-isomorphic graphs. The method for structure-aware edge embedding is formulated as follows.

\begin{matrix} {\tilde{e}}_{i j} & = Concat (B_{av} [:, i, j], B_{dx} [:, i, j], Λ) \\ {\bar{e}}_{i j} & = {GP}_{1} (Transformer ({\tilde{e}}_{i j}^{T} W_{0})) e_{i j}^{0} = Concat ({\bar{e}}_{i j}, e_{i j}) \end{matrix}

(4)

In Equation (4), each element of

{λ_{i}, i = 1, \dots, K}

is replicated nine times to form a new

9 \times K

-dimensional vector linear transformation matrix

W_{0} \in R^{3 \times d_{e}}

. The transformer encoder

Transformer : R^{K \times d_{e}} \to K \times d_{e}

captures the dependence of K features to encode context information, and then, the global maximum pooling

{GP}_{2} : R^{K \times d_{e}} \to R^{d_{e}}

performs dimensional reduction. The

{\bar{e}}_{i j} \in R^{d_{e}}

is the structure-aware edge feature between the

i_{t h}

and

j_{t h}

nodes. The structure-aware edge feature

{\bar{e}}_{i j} \in R^{d_{e}}

and attribute

e_{i j} \in R^{d_{a}}

are stacked to obtain the initial edge feature

e_{i j}^{0} \in R^{d_{e} + d_{a}}

.

3.2. Directed Spectral Graph Transformer

The directed spectral graph transformer (DSGT) consists of a global asymmetric graph transformer (GT), a K-order directed graph convolution network (GCN), and a feature decoupling update. Specifically, the global asymmetric GT computes an attention map and then performs global aggregation from which the coefficients of the K-order GCN are extracted. Subsequently, the K-order directed GCN aggregates local information. Finally, inspired by the research of Dwivedi V. et al. [17], the model decouples the update processes for structure (node and edge attributes) and position (global node PE).

3.2.1. Global Asymmetric Graph Transformer

As shown in Appendix A.3, traditional attention kernels have been proved to underexploit the asymmetric relationships between nodes [29]. The global asymmetric graph transformer includes structure encoding, PrimalAttention, and global aggregation. Specifically, the first-order DGN [33] is performed to embed structural information into node attributes. Then, the asymmetric attention map is computed from PrimalAttention and edge attributes. Finally, the global aggregation is performed to obtain global node-level representations. Directional convolution is performed via DGN as follows:

\begin{matrix} {\hat{X}}^{(l)} & = MLP (Concat (D_{in}^{- 1} A X^{(l)}, {\{{\tilde{X}}_{k}^{(l)}\}}_{k = 1, \dots, K})) \\ {\bar{X}}_{α, i, k}^{(l)} & = (\hat{S} (D_{in}, α) \cdot B_{av}^{k} [:, :, i] ⊙ A) X^{(l)} ∥ (\hat{S} (D_{in}, α) \cdot B_{dx} {[:, :, i]}^{k} ⊙ A) X^{(l)} \\ {\tilde{X}}_{k}^{(l)} & = Concat (\{{\bar{X}}_{α, i, k}, α = - 1, 0, + 1, i = 1, 2, 3\}) \\ \hat{S} (D, α) & = diag ({\{S (D_{i i}, α)\}}_{i = {1, \dots, N}}), where, \{\begin{matrix} S (d, α) = {(\frac{log (d + 1)}{δ})}^{α} \\ δ = \frac{1}{N} \sum_{i = 1}^{N} ln (D_{i i} + 1) \\ α \in [- 1, + 1] \end{matrix} \end{matrix}

(5)

In Equation (5), ⊙ is the element-wise product, and

\cdot ∥ \cdot

is the concatenation of node features. The aggregators and scalers are set the same as in Section 3.1.2, where

\{X^{(l)}, {\hat{X}}^{(l)}\} \in R^{N \times d_{f}}

,

d_{f}

is the dimension of the node feature, and N is the number of nodes in the directed graph. For each graph Fourier basis, the aggregation matrix of three aggregators

\{B_{av}^{k}, B_{dx}^{k}\} \in R^{N \times N \times 3}

, and the scalers include the degree-attenuation

\hat{S} (D, - 1) \in R^{N \times N}

, the degree-amplification

\hat{S} (D, + 1)

, the identity

\hat{S} (D, 0) = I_{N}

, and the concatenation of node features after aggregation

{\tilde{X}}_{k}^{l} \in R^{N \times (9 \times d_{f})}

. We stack the node features and go through

MLP : R^{N \times ((9 \times K + 1) \times d_{f})} \to R^{N \times d_{f}}

to obtain the

l_{t h}

node representation

R^{N \times d_{f}}

. The evaluation of attention scores for structure-aware node features is as follows. The mathematical formulation of the asymmetric attention kernel is shown in Appendix C.2.

\begin{matrix} e (X_{i}^{(l)}) & = {(f {(X)}^{T} W_{e})}^{T} g_{q} (q (X_{i}^{(l)})) & r (X_{j}^{(l)}) = {(f {(X)}^{T} W_{r})}^{T} g_{k} (k (X_{j}^{(l)})) \end{matrix}

(6)

\begin{matrix} {ATT}_{i, :} & = Softmax ({\hat{A T T}}_{i, :}) & {\hat{A T T}}_{i, j} = Concat (e (X_{i}^{(l)}), r (X_{j}^{(l)})) \end{matrix}

(7)

In Equation (6),

q (X_{i}^{(l)}) = W_{q} X_{i}^{(l)} \in R^{d_{q}}

and

k (X_{j}^{(l)}) = W_{k} X_{j}^{(l)} \in R^{d_{k}}

are samples of query and key space, respectively, with

g_{q} (\cdot) : R^{d_{p}} \to R^{d}

and

g_{k} (\cdot) : R^{d_{k}} \to R^{d}

ensuring dimension compatibility. The injection of data dependence transforms the projection matrix

W_{e}, W_{r} \in R^{N \times s}

to its revised counterpart

f {(X^{(l)})}^{T} W_{e} = W_{e | x}, f {(X^{(l)})}^{T} W_{r} = W_{r | x} \in R^{p \times s}

, where

f (X^{l}) = {\hat{X}}^{(l)} R^{T} \in R^{N \times p}

is derived from the random projection of

X^{(l)}

according to the Johnson–Lindenstrauss lemma; the random projection matrix

R \in R^{d_{f} \times p}

, each element of which follows a standard normal distribution, namely

R_{i j} \sim N (0, 1)

.

e (X_{i}^{(l)}), r (X_{j}^{(l)}) \in R^{s}

, are, separately, the s projection scores of the

i_{t h}

value and

j_{t h}

key sample in the key and value feature space. The edge feature is integrated into the attention map

ATT \in R^{N \times N}

through the linear transformation matrix

W_{1} \in R^{2 s}, W_{2} \in R^{d_{e} + d_{a}}

. Global aggregation with the in-degree scaler [18] is performed using the weighted sum of neighborhood features.

X_{g}^{l + 1} = σ (D^{- 1 / 2} \cdot ATT \cdot X^{l})

(8)

In Equation (8), the global node representation at the

{l + 1}_{t h}

layer

X_{g}^{l + 1} \in R^{N \times d_{f}}

is obtained by aggregating

X^{(l)} \in R^{N \times d_{f}}

using the global asymmetric GT. In the global asymmetric GT, the optimal projection matrices

W_{e | x}, W_{r | x}

along which mutual information is greatest are obtained through the minimum regularization term in the total loss. The optimization objective for the self-attention kernel can be formulated as follows [31]:

\begin{matrix} max_{W_{e}, W_{r}, e_{i}, r_{j}} J = \frac{1}{2} \sum_{i = 1}^{N} e_{i}^{T} Λ e_{i} + \frac{1}{2} \sum_{j = 1}^{N} r_{j}^{T} Λ r_{j} - Tr (W_{e | x}^{T} W_{r | x}) \\ s . t . e_{i} = e (X_{i}^{(l)}) = {(f {(X)}^{T} W_{e | x})}^{T} φ_{q} (X_{i}) r_{j} = r (X_{j}^{(l)}) = {(f {(X)}^{T} W_{r | x})}^{T} φ_{k} (X_{j}) \end{matrix}

(9)

In Equation (9),

Λ \in R^{s \times s}

is the diagonal matrix filled by Lagrange multipliers, which represents the penalty intensity for objects that violate the constraints. Generally, the number of principal components s is not larger than the dimension p; namely,

s \leq p

. The feasible solution to the optimization objective in Equation (9) is formulated as the least-square support vector machine (LSSVM) (Appendix A.2). We demonstrate that the objective in Equation (9) is constantly equal to 0 if the stationary condition in Equation (A6) is workable (Appendix A.2). To satisfy the stationary condition, the objective of the primal problem is formulated as a regularization term in the loss function. This goal is implemented using automatic differentiation libraries (e.g., PyTorch [34] and TensorFlow [35]). In Equation (10), the regularization term

J_{a}

is expected to approach 0. Although GTs are confronted with the inability to unbiasedly approximate the frequency responses of the high-order filter, both K-order directed GCNs and GTs are complementary with each other such that a hybrid architecture model is crucially constructed.

3.2.2. K-Order Directed Graph Convolution Network

The assumption for a self-adjoint GSO in the traditional spectral graph theory is not applicable for directed graphs. We elaborate the spectral graph theory completely defined in the directed graph and then construct directed spectral graph convolution based on the theory. Specifically, directed GLM is defined as

T = I_{N} - D_{in}^{- 1} A

via graph diffusion [10], where the directed GLM T is not a self-adjoint matrix, suggesting that its frequency response is not described in the eigenspace of GLM T. However, the frequency response of the directed graph can be fully represented in the generalized eigenspace of GLM T. The generalized spectral features of the directed graph are obtained via Jordan decomposition.

In Equation (11),

m_{λ}

is the algebraic multiplicity of the eigenvalue

λ

of directed GLM T, satisfying

(T - λ \cdot I_{N}) \cdot P_{λ} = {[(T - λ \cdot I_{N}) \cdot P_{λ}]}^{m_{λ}} = 0

.

P^{- 1} T P = J

is the Jordanization of GLM T, where the dyad of GLM T satisfies

P_{λ} = [v_{λ}^{1}, \cdot, v_{λ}^{m_{λ}}] {[v_{λ}^{1}, \cdot, v_{λ}^{m_{λ}}]}^{T} \in R^{N \times N}

, where

\{v_{λ}^{i}, i = 1, \dots, m_{λ}\}

is a set of bases in the generalized feature subspace according to the eigenvalue

λ

[36]. In the generalized eigenspace, the polynomial of GLM T is formulated as follows:

\begin{matrix} J_{a} (W_{c}, W_{r}, Λ) = \frac{1}{2} \sum_{i = 1}^{N} e_{i}^{T} Λ e_{i} + \frac{1}{2} \sum_{j = 1}^{N} r_{j}^{T} Λ r_{j} - Tr (W_{e}^{T} W_{r}) \\ = \frac{1}{2} \sum_{i = 1}^{N} ∥ {(W_{e | X} Λ^{1 / 2})}^{T} φ_{q} (X_{i}) ∥_{2}^{2} + \frac{1}{2} \sum_{j = 1}^{N} ∥ {(W_{r | X} Λ^{1 / 2})}^{T} φ_{k} (X_{j}) ∥_{2}^{2} - Tr (W_{e | X}^{T} W_{r | X}) \end{matrix}

(10)

(T - λ \cdot I_{N}) \cdot P_{λ} = \sum_{λ} g (λ) P_{λ} + \sum_{λ} [\sum_{n = 1}^{m_{λ} - 1} \frac{g^{(n)} (λ)}{n!} {(T - λ \cdot I_{N})}^{n}] P_{λ}

(11)

g (T) = \sum_{λ} g (λ) P_{λ} + \sum_{λ} [\sum_{n = 1}^{m_{λ} - 1} \frac{g^{(n)} (λ)}{n!} {(Y - λ \cdot I_{N})}^{n}] P_{λ}

(12)

In Equation (12), the second term of

g (T)

represents the difference in frequency response between a self-adjoint and no self-adjoint GSO. The sum of the dimensions of the feature subspace

m_{λ}

is equal to the number of nodes N, namely

\sum_{λ} m_{λ} = N

. The polynomial of GLM T

g (T)

is also formulated as a holomorphic functional calculus

g (T) = \frac{1}{2 π i} \oint_{Γ} g (z) \cdot {(T - z \cdot I_{N})}^{- 1} d z

, where

Γ

is the curve enveloping all the eigenvalues of GLM T, and

g (z)

is the polynomial of z. The Faber polynomial is the best for approximating any holomorphic function in the complex space. Fortunately, if all the eigenvalues are located inside the unit hypersphere, the Faber polynomial is equivalent to the Chebyshev polynomial [37]. Therefore, directed GLM T is normalized by the modulus of the spectral radius, namely

\tilde{T} = T / ∥ λ_{max} ∥

. The faber polynomial of

g (T)

is formulated as follows:

g_{θ} (\tilde{T}) = \sum_{k = 1}^{K} θ_{i} {\tilde{T}}^{i} \approx \sum_{k = 1}^{K} θ_{i} T_{k} (\tilde{T}), where, \{\begin{matrix} θ = MLP (\frac{1}{N} \sum_{i = 1}^{N} X_{g}^{(l + 1)} [i, :]) \\ T_{0} (X) = I_{N}, T_{1} (X) = X \\ T_{k} (X) = 2 X \cdot T_{k - 1} (X) - T_{k - 1} (X), k \geq 2 \end{matrix}

(13)

In Equation (13), the polynomial coefficients

θ = [θ_{0}, \dots, θ_{K}] \in R^{K + 1}

are obtained from global aggregation. The readout function is a successive component of mean pooling and MLP.

T_{k} (\tilde{T})

approximates the basis of the polynomial space

{\tilde{T}}^{k}

. For the Liouville theorem, an ideal filter for fully suppressing high-frequency noise does not exist; namely,

P_{λ} \cdot g^{(n)} (λ) / n! \to 0

is not possible as

λ \to 0

. To theoretically achieve the goal, we redefine GLM T in the punctured complex space

C / y

as

g_{θ} (\tilde{T}) = \sum_{k = 1}^{K} θ_{i} {\tilde{T}}^{i} \approx \sum_{k = 1}^{K} θ_{i} T_{k} (\tilde{T})

, where

\tilde{T} = \frac{1}{∥λ_{max}∥} T - y \times I_{N}

, and

y \in (0, 1)

is the singular point inside the unit hypersphere, which can adjust the behavior of the holomorphic functional calculus. Finally, local aggregation is performed through a K-order directed GCN.

X_{l}^{(l + 1)} = FFN (D_{in}^{- 1} \cdot \sum_{k = 1}^{K} θ_{i} T_{k} (\tilde{T}) {\hat{X}}^{l}) {\hat{X}}^{(l)} = Concat (X^{(l)}, P^{(l)})

(14)

In Equation (14), the

l_{t h}

-layer forward feedback network

FFN : R^{d_{f} + d_{p}} \to R^{d_{f}}

is for dimension compatibility, and

{\hat{X}}^{(l)} \in R^{N \times d_{f}}

is the concatenation of the

l_{t h}

-layer node features

X^{(l)} \in R^{N \times d_{f}}

and node PE

P^{(l)} \in R^{N \times d_{p}}

. Finally, we design the rules for decoupling edge attributes, and node feature and PE updates, which are key to improving training stability and reference efficiency.

3.2.3. Feature Decoupling Update

Decoupling the update of structural features and positional coordinates helps to enhance the interpretability of the model’s inference process. Specifically, the fusion of the

l_{t h}

-layer local and global node representation

X^{(l)} \in R^{d_{f}}

is formulated as

{\overset{˘}{X}}^{(l)} = MLP (Concat (X_{g}^{(l)}, X_{l}^{(l)}))

, where the multi-layer perceptron

MLP : R^{2 d_{f}} \to R^{d_{f}}

is for dimension compatibility. The edge acts as the channel through which information flows in the graph, so its attribute is updated based on the feature of neighbor nodes, namely

{\overset{˘}{e}}_{k j}^{(l)} = MLP (Concat (x_{k}^{(l)}, x_{j}^{(l)}))

. Unlike structural updates, the node PE needs to reflect the

l_{t h}

-layer diffusion distance between nodes. Because the attention map reflects the global connectivity of the graph, the update of the

l_{t h}

-layer node PE is formulated as

{\overset{˘}{P}}^{(l)} = ATT \cdot P^{(l - 1)}

. Additionally, GraphNorm [38] and residual connection are leveraged to achieve a deeper structure and abstracter representation.

H^{(l)} = Norm (Dropout ({\overset{˘}{H}}^{(l)}) + H^{(l - 1)}), where \{\begin{matrix} {\overset{˘}{H}}^{(l)} & \in \{{\overset{˘}{X}}^{(l)}, {\overset{˘}{P}}^{(l)}, {\overset{˘}{e}}_{k j}^{(l)}\} \\ H^{(l - 1)} & \in \{X^{(l - 1)}, P^{(l - 1)}, e_{k j}^{(l - 1)}\} \end{matrix}

(15)

In Equation (15), randomly inactivating some parameters via

Dropout (\cdot)

can avoid overfitting and improve the robustness of DSGT; generally, the dropout ratio is set to 0.1.

3.3. End-to-End Optimization for DSGT

For node-level classification tasks, the multi-class joint probability distribution is obtained through an affine transformation and softmax operation.

\hat{Y} = softmax (X^{(L)} W_{o}^{T})

(16)

In Equation (16), the linear transformation matrix

W_{o} \in R^{d_{f} \times d_{c}}

, and

softmax (x) = \frac{exp (x_{i})}{\sum_{j} exp (x_{j})}

. The number of layers of DSGT is L.

\hat{Y} \in R^{N \times d_{c}}

represents the probability of the node class, where

d_{c}

is the number of classes. Our goal is to satisfy the stationary condition, force the node PE to form a coordinate system constrained by the graph topology, and obtain robust performance at the node level. To achieve these goals, the loss function is formulated as follows:

L = \frac{1}{N} (\sum_{i = 1}^{N} L_{cls} ({\hat{Y}}_{i}, Y_{i}) + η (J_{a} (W_{e | X}, W_{r | X}, Λ) + J_{e} (\hat{Y}) + J_{p} (P^{(L)})))

(17)

In Equation (17),

\hat{Y} \in R^{N \times d_{c}}

and

Y_{i j} \in {0, \dots, d_{c} - 1}

are the joint probability distributions of the node class and ground-truth node label. The entropy regularization

J_{e} (\hat{Y}) = - \sum_{i = 0}^{N} \sum_{l = 1}^{d_{c}} {\hat{Y}}_{i j} ln ({\hat{Y}}_{i j} + ϵ)

encourages uncertainty in the classification to a certain extent. The presence of

J_{a} (W_{e | X}, W_{r | X}, Λ)

in Equation (17) satisfies the stationary condition of the optimization objective in Equation (A6). The regularization term for node PE

J_{p} (P^{(L)}) = 1 / d_{p} + Tr ({P^{(l)}}^{T} ▵ P^{(l)}) + λ / d_{p} \times ∥ {P^{(l)}}^{T} P^{(l)} - I_{d_{p}} ∥_{2}^{2}

constrains the node PE to form an orthogonal coordinate system, where

▵ = I_{N} - D_{in}^{- 1} A \in R^{N \times N}

is the directed GLM and p is the final/

L_{t h}

-layer node PE

P^{(L)} \in R^{N \times d_{p}}

. The

L_{cls} (\cdot)

is a multi-class cross-entropy loss function. The

η

is a regularization coefficient; generally,

η = 0.1

. The model was constructed using the PyTorch framework, and the parameters were updated using the Adam optimizer. The inference and optimization processes for the DSGT model are illustrated in Algorithm 4.

Algorithm 4 Supervised node-level embedding algorithm.

Input:: Adjacency matrix $A \in R^{N \times N}$ and node feature matrix $X \in R^{N \times d_{f}}$ ; layer number L; learning rate lr = $10^{- 3}$ ; true node’s label $Y \in R^{N}, y_{i} \in {0, \dots, d_{c} - 1}$ ; label number $d_{c}$ ; total epoch number N
Output:: Node feature matrix $X^{L}$ ; edge feature matrix $E^{L}$ ; and node relative position $P^{L}$ in the $L_{t h}$ /final layer
1:: function Feature Engineering(adjacent matrix A)
2:: Compute directed graph frequencies ${λ_{k}, k =, \dots, N}$ and a complete set of Fourier basis ${u_{k}, k = 1, \dots, N}$ from directed graph shift operator A, and get global positional encoding $P^{(0)}$ from them
3:: Compute directional smoothing matrix $B_{a v}$ and derivative matrix $B_{d x}$ by the Fourier basis ${u_{k}, k = 1, \dots, N}$ , and then, initialize edge feature $E^{(0)}$ with graph spectrum ${λ_{i}, i = 1, \dots, N}$ and them
4:: return edge feature matrix $E^{(0)}$ and positional embedding $P^{(0)}$
5:: end function
6:: function DSGT(node feature matrix $X^{(0)}$ , edge feature ${matrix}^{(0)} E$ , node positional embeding $P^{(0)}$ )
7:: for $k \in {1, \dots, K}$ do
8:: Compute global node feature matrix of $k_{t h}$ layer $X_{g}^{(k + 1)}$ using the global asymmetric graph transformer and the average loss of attention kernel $J_{a} (W_{e}, W_{r}, Λ)$ for each head
9:: Compute local node feature matrix of $k_{t h}$ layer $X_{l}^{(k + 1)}$ by K-order Directed Graph Convolution
10:: Update edge feature $E^{(k + 1)}$ , node feature $X^{(k + 1)}$ , and relative positional encoding $P^{(k + 1)}$ with $X_{g}^{(k + 1)}, X_{l}^{(k + 1)}, E^{(k)}$ , and $P^{(k)}$
11:: $k \leftarrow k + 1$
12:: end for
13:: return output node feature matrix $X^{(L)}$
14:: end function
15:: for current epoch $i \leq N$ do
16:: $X^{(0)}, A \leftarrow$ the $i_{t h}$ subgraph of training dataset
17:: $E^{(0)}, P^{(0)} =$ FEATURE ENGINEERING(A)
18:: $X^{(L)} =$ DSGT( $X^{(0)}, E^{(0)}, P^{(0)}$ )
19:: Compute joint probability distribution of node’s multi-label $\hat{Y}$ by affine transformation and softmax for $X^{(L)}$ and its entropy regularization $J_{e} (\hat{Y})$
20:: Compute average loss function $L = \frac{1}{N} (L_{c l s} (\hat{Y}, Y) + η L_{r e g}) = \frac{1}{N} (L_{c l s} (\hat{Y}, Y) + η (J_{a} (W_{e}, W_{r}, Λ) + J_{e} (\hat{Y})))$
21:: Carry out backpropagation algorithm for total loss $L$ and update the learnable parameter $Θ$ by Adam( $lr = lr, β_{1} = 0.9, β_{2} = 0.999$ )
22:: end for
23:: Return Predictive graph-level label $\hat{y} = arg max$ (softmax(z))

4. Results and Discussion

In this section, we evaluate the effectiveness of DSGT compared to various baseline models and the rationality of modules across multi-scale datasets. In our experiments, we employed the uniform splitting and neighbor sampling strategies for datasets with diverse attributes, in order to test the scalability of the models from a unified perspective.

4.1. Experimental Setup

In the section, we first describe the setting of the hyperparameters in our model, which are justified in Section 4.4, and introduce the dataset settings, the details of the baselines, and the configurations of the hardware and software used in the experiment.

4.1.1. Dataset Settings

Homogeneous datasets include bibliographic references (Cora, squirrel, ogbn-arxiv, and arxiv-year) and social networks (genius and Tolokers). In bibliographic reference networks, nodes represent academic papers and edges denote directed citation relationships between papers. The research topics of the papers provide categorical information for the nodes. Similarly, in social networks, nodes represent users on internet platforms, and edges indicate sharing a post. Node labels are typically derived from the users’ attributes and account statuses. The computational complexity is influenced by the sparsity of the edges; the datasets are classified into small- (Cora and squirrel), medium- (genius and Tolokers), and large-scale (ogbn-arxiv and arxiv-year) ones. The distribution of the node attributes and quantity within neighborhoods are key factors influencing the task-related performance of the models [39]. The dataset attributes are presented in Table 2, where the degree correlation coefficient reflects the correlation between the degrees of adjacent nodes, and the node homophily indicates the similarity of the node attributes within community structures. These datasets enable a comprehensive evaluation of DSGT. To optimize hardware resource usage, small graph datasets of similar scale were constructed by sampling nodes within a neighborhood range, with each node having six neighbors and a neighborhood radius of 10.

To split the training/validation/test sets, we followed the method described by the authors of [40,41,42], who proposed the datasets. Specifically, for Cora, we used the split (50/30/20, train/val/test) provided in PyG [43]. For the Tolokers and squirrel datasets, Platonov et al. [42] provides 10 different splits. Hence, we utilized the first five different splits from them in our experiments. For ogbn-arxiv and arxiv-year, the dataset-splitting strategy described by the authors of [40], who proposed the datasets, was used. Hence, we simply utilized it and initialized the model parameters by using different random seeds for each run. For genius, Lim et al. [41] proposes running each method on the same five random 50/25/25 train/val/test splits for each dataset. We followed the method from this paper to split the datasets.

4.1.2. Baselines

DSGT integrates directed GNN and GT architectures to approximate the arbitrary-order filter. To test the strength of the hybrid architecture in the node classification of directed graph, GTs (MagLapNet-II [18]) and directed GNNs (DGCN [6], DiGCN [9], Holonet [10], MagNet [7], and MagLapNet-I [32]) were used as baselines. MagLapNet employs the magnetic GLM to both embed the node position and aggregate messages in the complex domain. The convolutions of MagLapNet-I and MagLapNet-II for imaginary and real parts are, separately, GCN [5] and structure-aware transformer [18], while the node PE of both is derived from the eigendecomposition of the magnetic GLM. The dependence of WL test on in-neighbors suggests that WL PE is universal in directed and undirected graphs. Therefore, GT with WL PE [14] is naturally applicable for directed graph node classification.

4.1.3. Experimental Protocol

Neighborhood sampling extracts a subgraph with mini-batch seed nodes and a subset of edges between nodes within a K-hop neighborhood from a large graph. This approach circumvents the OOM issue caused by the quadratic computational complexity of global attention mechanisms. How the dataset is constructed and the batch size have certain impacts on classification performance, leading to differences from the results reported in the original paper. Subsequently, data preprocessing initializes structure-aware edge embeddings and node PE. The initial embeddings and node attributes are then fed into a single-layer DSGT to infer the multi-class joint probability distributions for the seed nodes.

For each epoch, the weight matrices are initialized using Xavier Normal, while all the biases are set to zero. Furthermore, macro accuracy is adopted as the evaluation metric for model performance, and the model parameters are continuously updated using the Adam optimizer with a learning rate of

10^{- 3}

and beta parameters (0.9 and 0.99). The loss function is a weighted sum of the multi-class cross-entropy loss and the PrimalAttention optimization objective. The dropout is 0.1 during model training. Additionally, the appropriate hyperparameters for DSGT and baselines are a directed graph filter order of

K = 4

, a dimension of node position encoding of

D_{p} = 18

, and a dimension of structure-aware edge embeddings of

D_{e} = 108

. The dimension of hidden layers is 100, while the output dimension goes up to the number of node classes.

Our workings for the data pipelines, model training, and deployment were built using the Pytorch Lightning framework. The EarlyStopping callback of the Pytorch Lightning framework with a threshold value of

10^{- 4}

, a minimum of 100 epochs and a maximum of 200 epochs can be used to monitor the prediction accuracy of the model on the validation set and stop the training when no improvement is observed. Meanwhile, we used the GradientAccumulationScheduler callback for gradient accumulation and scheduled the learning rate with ReduceLROnPlateau, with a maximum learning rate of

10^{- 2}

and minimum of

10^{- 4}

, monitoring the aforementioned metric to prevent overfitting during training. Meanwhile, we used the GradientAccumulationScheduler callback for gradient accumulation and scheduled the learning rate using ReduceLROnPlateau, with a maximum learning rate of

10^{- 2}

and minimum of

10^{- 4}

. The learning rate was adjusted based on the monitoring of the aforementioned metric to prevent overfitting during training.

4.1.4. Experimental Environment

The DSGT model was implemented based on Pytorch, and we employed the MPNN paradigm of the PyG library to design its convolution module, whose implementation can be seen in Appendix D. End-to-end training on a multi-GPU setup was achieved via Pytorch Lightning. All the experiments were repeated five times on two 12GB NVIDIA GeForce RTX 3080Ti GPUs and a 40-core Intel Xeon Gold 6248 CPU, with no fewer than 100 epochs each time.

4.2. Comparative Experiments

In this section, we present the advantages of our model architecture over the aforementioned baselines for the node classification task and demonstrate the effectiveness of our approach for node positional embedding compared to that of other methods.

4.2.1. Supervised Node Classification

The experimental results for DSGT and the baselines are shown in Table 2. DSGT achieved state-of-the-art performance across the Cora, genius, Tolokers, and ogbn-arxiv datasets, improving the classification accuracy of the graph nodes by 2.96%, 6.63%, 6.60%, and 0.41%, respectively. DGCN exhibited the poorest performance in the node classification task. Among the baselines, DGCN utilized first-order and second-order proximal matrices for aggregation. For a network with strong degree correlations, second-order proximal aggregation may introduce more additional edges, increasing the likelihood of overfitting. The average node classification accuracy of DGCN is positively correlated with the node homophily, suggesting that more homogeneous neighbors lead to more similar node representations through second-order proximal aggregation. In DiGCN, the PageRank algorithm approximates an irreducible and aperiodic graph Laplacian model (GLM), ultimately enabling stationary Markov distributions through step-wise convolution. However, DiGCN inherits the same drawbacks as DGCN due to the Inception network, as message aggregation may amplify the noise caused by neglecting the information diffusion direction, particularly when node degree distributions are positively correlated.

Holonet outperforms both DGCN and DiGCN by respecting the directed graph topology. Compared to MagNet, MagLapNet-I/II utilizes a magnetic GLM based on the directed adjacency matrix, allowing it to capture directed cyclic subgraphs through phase shifts, which is the key to the superior performance of MagLapNet-I/II over MagNet. In contrast to the aforementioned GCN architectures, both MagLapNet-II and DSGT belong to GT architectures, which typically leverage global attention mechanisms and the inclusivity for edge attributes to offer an advantage in node classification tasks. However, the arxiv-year dataset exhibits low node homogeneity and degree correlation, which prevents the GT from gaining more expression compared to GCN architectures through node PE and global adaptive aggregation. Among the GTs, GT-WL underperforms because the layer-wise hashing of the node’s neighborhood information fails to capture the various importance values of nodes under graph patterns. As shown in Table 3, effectively balancing inductive bias with the scope of the receptive field can significantly improve the expression of the model.

4.2.2. Effectiveness of Node Positional Embeddings

The comparative experiment for the effectiveness of node PE is presented in Table 4. PGDT with directed PE (ours) consistently outperformed the other methods across most of the datasets, achieving average improvements of 1.94%, 3.4%, 7.79%, 2.8%, and 0.41%, respectively. Despite the varying optimality of the feasible solutions obtained in each iteration, PGDT with directed PE (ours) remains robust by maintaining the relative importance of nodes under graph patterns [22]. Magnetic LapPE-ABS represents the node information gain and is critical for node classification in the genius dataset, in which both the degree correlation and node homophily are strong, leading to its good performance. Similar to GCN, WL PE iteratively integrates neighborhood structural information, progressively refining the position representation. Although a global WL PE theoretically exists, in practice, its discriminability diminishes as the neighborhood radius expands, causing DSGT with WL PE to underperform among GTs, primarily due to its failure to incorporate directional inductive bias. Whether node PE plays a relevant role in directed graphs or not determines its utility. Notably, the top 5% of singular values and corresponding singular vectors capture about 95% of the matrix’s characteristics, explaining why SVD PE outperforms Magnetic LapPE-ABS/REAL.

4.3. Ablation Experiment

The results of the ablation experiment are presented in Table 5. It can be observed that PrimalAttention, structural encoding, node PE, initial edge embedding, and adaptive generation of filter coefficient led to performance improvements of 2.23%, 3.56%, 11.67%, 3.25%, and 3.79%, respectively. Among these, node PE had the most significant impact on DSGT’s performance, while initial edge encoding had the least. In the arxiv-year dataset, the absence of structure encoding led to a 0.78% improvement in model performance. It is evident that node attributes play an overwhelming role in node classification because the dataset does not include edge attributes that reflect semantic connections between nodes. Node PE serves as a soft inductive bias, providing position-aware information to the GT. When node PE is removed, the GT behaves similarly to a GAT in a fully connected graph with over-smoothing and over-squashing. Additionally, asymmetric attention kernels [30] in DSGT helped to limit the performance degradation to within 10%. The adaptive generation of filtering coefficients unified the behavior of polynomial filters in spectral space, yielding an average performance improvement of 3.79% across all the datasets. Although the initial edge embedding and structural encoding did not significantly enhance the model classification performance across five datasets, they remain crucial for node-level representation learning in structurally informative datasets, such as traffic networks and protein–protein interactions (PPIs).

4.4. Hyperparameter Tuning

The hyperparameters that were tuned across all the datasets were the order of polynomial filters (

K = 4

), the dimension of node PE (

D_{p} = 18

), and the dimension of initial edge embedding (

D_{e} = 108

).

4.4.1. The Order of Polynomial Filters

The effect of node PE on model performance for the Cora dataset is presented in Table 6. DSGT with directed PE (ours) achieved the best average classification accuracy of DSGT, at 79.96%, when the filter order was set to four. SVD PE ranked only second in performance to directed PE (ours), indicating that the asymmetry of directed GLM can reflect the spatial distances between nodes under graph patterns. Given the non-uniqueness of the underlying undirected graph to directed graphs, DSGT with undirected LapPE yielded different inference results when dealing with isomorphic graphs, resulting in the worst performance in classification tasks. The majority of node PEs were the most computationally efficient when the filter order was set to four.

4.4.2. Dimension of Node Positional and Initial Edge Embedding

The effects of an increase in the dimension of node PE on the total number of parameters

Θ

(M) and average classification accuracy

Λ

(%) are shown in Table 7. Each increment of

△ D_{p} = 5

in the node PE dimension resulted in an approximate increase of 0.43M parameters. To describe the impact of dimension changes on the model computational efficiency, a utility measure was defined as

P = △ Λ / △ Θ

. As the dimension of most of the PE methods increased from 18, the computational efficiency tended to decline. Consequently, the dimension of node PE was set to 18. The relationships of the node PE dimension with two metrics are presented in Table 8. Each increment of

△ D_{e} = 18

in the dimension of the initial edge embedding resulted in an approximate increase of 0.21M parameters. Similarly, when the initial edge embedding dimension increased to six, the computational efficiency began to decrease. Therefore, the initial edge embedding dimension was set to six.

5. Conclusions

In the context of directed graph node representation learning, we proposed a hybrid architecture model, called DSGT, which integrates both GT and GCN architectures. Our approach introduces innovative methods for directed node PE and structure-aware edge embedding, specifically designed for the directed GT. Through comparative experiments, we not only demonstrated the SOTA performance of DSGT against baselines, but also highlighted its scalability across multi-scale datasets. Additionally, an ablation experiment confirmed the effectiveness of DSGT’s functional modules. We aim to pave the way for future advancements in node-level representation learning by further coupling graph transformer and GCN architectures.

Author Contributions

Conceptualization, G.H., Q.Y., F.C. and G.C.; conceiving and designing the experiments, and analyzing the data and writing the paper, G.H.; performing the experiments and contributing to data analysis, G.H. and Q.Y.; contributing analysis tools and reviewing the manuscript, F.C. and G.C.; supervising the project and providing critical feedback on the manuscript, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the PyG datasets. These data were derived from the following resources available in the public domain (PyG datasets website: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html, accessed on 10 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. First-Order Extremum Condition for Dual Problems

The primal problem of self-attention is formulated as follows:

\begin{matrix} max_{W_{e}, W_{r}, e_{i}, r_{j}} min_{h_{e_{i}}, h_{r_{j}}} L = \frac{1}{2} \sum_{i = 1}^{N} e_{i}^{T} Λ e_{i} + \frac{1}{2} \sum_{j = 1}^{N} r_{j}^{T} Λ r_{j} - Tr (W_{e}^{T} W_{r}) \end{matrix}

(A1)

\begin{matrix} KKT Conditions : \{\begin{matrix} h_{e_{i}}^{T} (e_{i} - W_{e}^{T} f (X) ϖ (X_{i})) = 0, h_{e_{i}} > 0 \\ h_{r_{j}}^{T} (r_{j} - W_{r}^{T} f (X) ϖ (X_{j})) = 0, h_{r_{j}} > 0 \\ e_{i} - W_{e}^{T} f (X) ϖ (X_{i}) = r_{j} - W_{r}^{T} f (X) ϖ (X_{j}) = 0 \end{matrix} \end{matrix}

(A2)

In Equation (A1), the KKT conditions successively consist of a condition of complementary slackness, and a dual and primal feasibility condition. The corresponding dual problem under the KKT conditions is as follows:

\begin{matrix} max_{W_{e}, W_{r}, e_{i}, r_{j}} min_{h_{e_{i}}, h_{r_{j}}} L = \frac{1}{2} \sum_{i = 1}^{N} e_{i}^{T} Λ e_{i} + \frac{1}{2} \sum_{j = 1}^{N} r_{j}^{T} Λ r_{j} - Tr (W_{e}^{T} W_{r}) \\ - \sum_{i = 1}^{N} h_{e_{i}}^{T} (e_{i} - W_{e}^{T} f (X) ϖ (X_{i})) - \sum_{i = 1}^{N} h_{r_{j}}^{T} (r_{j} - W_{r}^{T} f (X) ϖ (X_{j})) \end{matrix}

(A3)

The extremum conditions of the dual problem are as follows:

\begin{matrix} \frac{\partial L}{\partial W_{e}} = 0 & \to W_{r} = \sum_{i = 1}^{N} f (X) φ (X_{i}) h_{e_{i}}^{T} \in R^{N \times s} \frac{\partial L}{\partial e_{i}} = 0 & \to Λ e_{i} = h_{e_{i}} \\ \frac{\partial L}{\partial W_{r}} = 0 & \to W_{e} = \sum_{j = 1}^{N} f (X) φ (X_{j}) h_{r_{j}}^{T} \in R^{N \times s} \frac{\partial L}{\partial r_{j}} = 0 & \to Λ r_{j} = h_{r_{j}} \end{matrix}

(A4)

Appendix A.2. Stationary Condition

The definition of the dual problem under KKT conditions is shown in Equation (A4). We can formulate the extremum of

{max}_{W_{e}, W_{r}, e_{i}, r_{j}} L

and the KKT condition as the following equations:

\begin{matrix} Λ^{- 1} h_{e_{i}} = {(\sum_{j = 1}^{N} f (X) φ_{k} (x_{j}) h_{r_{j}}^{T})}^{T} f (X) φ_{q} (x_{i}) = \sum_{j = 1}^{N} h_{r_{j}} \underset{K_{i j}}{\underset{⏟}{φ_{k} {(x_{j})}^{T} f {(X)}^{T} f (X) φ_{q} (x_{i})}} \\ Λ^{- 1} h_{r_{j}} = {(\sum_{i = 1}^{N} f (X) φ_{q} (x_{i}) h_{e_{i}}^{T})}^{T} f (X) φ_{k} (x_{j}) = \sum_{i = 1}^{N} h_{e_{i}} \underset{K_{i j}}{\underset{⏟}{φ_{q} {(x_{i})}^{T} f {(X)}^{T} f (X) φ_{k} (x_{j})}} \end{matrix}

(A5)

In Equation (A5),

K \in R^{N \times N}

is the kernel matrix. Equation (A5) is formulated as a form of LSSVM.

\begin{matrix} \begin{matrix} Λ^{- 1} h_{e_{i}} = K_{i, :} H_{r} \\ Λ^{- 1} h_{r_{j}} = K_{j, :} H_{e} \end{matrix}, \Leftrightarrow \begin{matrix} Λ^{- 1} H_{e} = K H_{r} \\ Λ^{- 1} H_{r} = K^{T} H_{e} \end{matrix} \Leftrightarrow [\begin{matrix} 0 & K \\ K^{T} & 0 \end{matrix}] [\begin{matrix} H_{e} \\ H_{r} \end{matrix}] = [\begin{matrix} H_{e} \\ H_{r} \end{matrix}] Σ \end{matrix}

(A6)

\begin{matrix} \{\begin{matrix} Λ^{- 1} H_{e} = K H_{r} \\ Λ^{- 1} H_{r} = K^{T} H_{e} \end{matrix} \Leftrightarrow \{\begin{matrix} Λ^{- 1} \underset{Λ^{- 1} H_{r}}{\underset{⏟}{K^{T} H_{e}}} = K^{T} K H_{r} \\ Λ^{- 1} \underset{Λ^{- 1} H_{e}}{\underset{⏟}{K H_{r}}} = K K^{T} H_{e} \end{matrix} \Leftrightarrow \{\begin{matrix} Λ^{- 2} H_{r} = (K^{T} K) H_{r} \\ Λ^{- 2} H_{e} = (K K^{T}) H_{e} \end{matrix} \end{matrix}

(A7)

In Equation (A6),

H_{r} = {[h_{r_{1}}, \dots, h_{r_{N}}]}^{T} \in R^{N \times s}, H_{e} = {[h_{e_{1}}, \dots, h_{e_{N}}]}^{T} \in R^{N \times s}

, and

Λ^{- 1} = Σ

. The kernel matrix

K_{i j} = 〈f (X) φ_{q} (X_{i}), f (X) φ (X_{j})〈 = 〈φ_{q}^{'} (X_{i}), φ_{k}^{'} (X_{j})〈

.

Proof.

Equation (A6) is the stationary condition under which the objective value of the dual problem is constantly equal to 0. □

The proof is proved as follows. Putting Equation (A6) and Equation (A4) into the dual objective in Equation (A3), we have the following:

\begin{matrix} min_{h_{e_{i}}, h_{r_{j}}} max_{W_{e}, W_{r}, e_{i}, r_{j}} L & = \frac{1}{2} \sum_{i = 1}^{N} {(Λ^{- 1} h_{e_{i}})}^{T} Λ (Λ^{- 1} h_{e_{i}}) + \frac{1}{2} \sum_{j = 1}^{N} {(Λ^{- 1} h_{r_{j}})}^{T} Λ (Λ^{- 1} h_{r_{j}}) - Tr ({(\sum_{j = 1}^{N} f (X) φ_{k} (X_{j}) h_{r_{j}}^{T})}^{T} (\sum_{i = 1}^{N} f (X) φ_{q} (X_{i}) h_{e_{i}}^{T})) \\ - \sum_{i = 1}^{N} h_{e_{i}}^{T} (e_{i} - {(\sum_{j = 1}^{N} f (X) φ_{k} (X_{j}) h_{r_{j}}^{T})}^{T} f (X) φ_{q} (X_{i})) - \sum_{j = 1}^{N} h_{r_{j}}^{T} (r_{j} - {(\sum_{i = 1}^{N} f (X) φ_{q} (X_{i}) h_{e_{i}}^{T})}^{T} f (X) φ_{k} (X_{j})) \\ = \frac{1}{2} \sum_{i = 1}^{N} h_{e_{i}}^{T} Λ^{- 1} h_{e_{i}} + \frac{1}{2} \sum_{j = 1}^{N} h_{r_{j}}^{T} Λ^{- 1} h_{r_{j}} - Tr (\sum_{i = 1}^{N} \sum_{j = 1}^{N} h_{r_{j}} φ_{k} {(X_{j})}^{T} f {(X)}^{T} f (X) φ_{q} (X_{i}) h_{e_{i}}^{T}) \\ = \frac{1}{2} Tr (H_{r}^{T} Λ^{- 1} H_{r}) + \frac{1}{2} Tr (H_{e}^{T} Λ^{- 1} H_{e}) - Tr (H_{e}^{T} K H_{r}) \\ = \frac{1}{2} Tr (H_{r}^{T} Λ^{- 1} H_{r}) + \frac{1}{2} Tr (H_{e}^{T} Λ^{- 1} H_{e}) - Tr (H_{e}^{T} Λ^{- 1} H_{e}) \\ = \frac{1}{2} Tr (H_{r}^{T} Λ^{- 1} H_{r}) - \frac{1}{2} (H_{e}^{T} Λ^{- 1} H_{e}) \\ = \frac{1}{2} Tr (Λ^{- 1}) - \frac{1}{2} Tr (Λ^{- 1}) = 0, where, H_{e}, H_{r} \in R^{N \times s}, K \in R^{s \times s} \end{matrix}

(A8)

Appendix A.3. Advantages over Traditional Attention Kernel

The projection scores

e (X_{i})

and

r (X_{j})

are formulated in the primal and dual problem as follows:

Primal : \{\begin{matrix} e_{i} = W_{e | X} φ_{q} (x_{i}) \\ r_{j} = W_{r | X} φ_{k} (x_{j}) \end{matrix} Dual : \{\begin{matrix} e_{i} = \sum_{j = 1}^{N} h_{r_{j}} K_{i j} \\ r_{j} = \sum_{i = 1}^{N} h_{e_{i}} K_{i j} \end{matrix}

(A9)

In Equation (A9), traditional self-attention is formulated as

e_{i} = \sum_{j = 1}^{N} h_{r_{j}} K_{i} j

, where

K_{i j}

is the attention score between the

i_{t h}

and

j_{t h}

nodes, and

h_{r_{j}}

is a sample from the value subspace. In our work,

e_{i}

and

r_{j}

are considered to describe asymmetric dependence from mutual perspectives.

Appendix B

The abbreviations and notations employed in the following sections are detailed in Table A1 and Table A2, ensuring clarity and ease of reference.

Table A1. Abbreviations and Descriptions.

Abbreviation	Description
DSGT	Directed Spectral Graph Transformer
Node PE	Node Positional Embedding
SP	Signal Processing
KSVD	Kernel Singular Value Decomposition
TV	(Undirected) Total Variation
DV	Directed Total Variation
LSSVM	Least-Square Support Vector Machine
GLM	Graph Laplacian Matrix
GSO	Graph Shifted Operator (e.g., Graph Laplacian Matrix, Graph Adjacent Matrix)
GT	Graph Transformer
GCN	Graph Convolution Network
GNN	Graph Neural Network
GFT	Graph Fourier Transformer
IGFT	Inverse Graph Fourier Transformer
RKBS	Reproducing Kernel Banach Spaces
MLP	Multi-Layer Perception
GP	Global Maximum Pooling
FFN	Forward Feedback Network

Table A2. Symbols and Descriptions.

Symbols	Description
N	The total number of nodes in a graph
$A \in R^{N \times N}$	Binary graph adjacent matrix, where $A_{i j} \in {0, 1}$
$D_{in} \in R^{N \times N}$	Graph in-degree diagonal matrix, where $D_{i i}^{in} = \sum_{j = 1}^{N} A_{j, i}$
$d_{f}$	The dimension of the node feature
$d_{p}$	The dimension of the node positional embedding
$d_{e}$	The dimension of the initial edge embedding
$d_{a}$	The dimension of the edge attribute
$L \in R^{N \times N}$	Undirected graph Laplacian matrix
$T \in R^{N \times N}$	Directed graph Laplacian matrix
$Concat$	Stack operation in the feature dimension
$Transformer$	Transformer encoder
${GP}_{i}$	Global maximum pooling for the $i_{t h}$ dimension
$σ (\cdot)$	Nonlinear activation function
s	The dimension of the rank space, in other words, the number of principal components
p	The dimension of the basis vector in the rank space
·	Matrix multiplication
×	Scalar multiplication

Appendix C

In the section, we introduce the theoretical foundations of algorithm design, including undirected spectral graph convolution theory and the optimal decomposition of the nonlinear kernel SVD. When the edges are directional, this spectral graph theory cannot be applied, but it still provides an effective paradigm for constructing graph filters. In the global asymmetric GT, the attention kernel finds mutually asymmetric dependence from both perspectives, which aligns with the optimization objective of the nonlinear kernel SVD.

Appendix C.1. Spectral Graph Convolution Theory

In the spectral domain, node features can be reconstructed using the graph Fourier basis.

f_{o u t} (i) = \sum_{l = 0}^{N - 1} {\hat{f}}_{i n} (λ_{l}) \hat{h} (λ_{l}) u_{l} (i) = \sum_{j = 1}^{N} f_{i n} (j) \sum_{k = 0}^{K} a_{k} L_{i, j}^{l}

(A10)

In Equation (A10), the node signal is reconstructed by the GFT and IGFT.

u_{l} \in R^{N}

is the

l_{t h}

graph Fourier basis.

{(\cdot)}^{*}

is the conjugate transpose operator. The GLM

L \in R^{N \times N} = \sum_{l = 1}^{N} λ_{l} \times u_{l}^{*} \cdot u_{l} = U Λ U^{*}

is formulated as the weighted sum of N dyads

u_{l}^{*} \cdot u_{l}

[36].

{\hat{f}}_{i n} (λ_{l}) = \sum_{j = 1}^{N} f_{i n} (j) \hat{h} (λ_{l}) u_{l}^{*} (i)

is the frequency spectrum obtained by GFT, where

f_{i n} (j)

is the

j_{t h}

node signal and

\hat{h} (λ_{l})

is the K-order polynomial filter for the

l_{t h}

frequency in the spectral domain with

K + 1

coefficients

a = {a_{0}, \dots, a_{k}}

. Equation (A10) is also simply formulated as follows:

G_{θ} ★ X = U \cdot g_{θ} (Λ) \cdot (U^{*} X) = U \sum_{i = 0}^{k} θ_{i} Λ^{i} U^{*} X = \sum_{i = 0}^{k} θ_{i} {(U Λ U^{*})}^{i} X = \sum_{i = 0}^{k} θ_{i} L^{i} X

(A11)

In Equation (A11), the M-dimensional feature of N nodes is

X \in R^{N \times M}

, the eigendecomposition of GLM is

U Λ U^{T}

(for real domain,

U^{*} = U^{T}

), the frequency response is

g_{θ} (Λ) \cdot (U^{T} X)

with the scalar-to-scalar function

g_{θ} (Λ) = \sum_{i = 0}^{k} θ_{i} Λ^{i}

in polynomial space,

U^{T} X

is the frequency spectrum via the GFT, and, on the contrary,

U \cdot g_{θ} (Λ) \cdot (U^{T} X)

is the reconstruction of the node signal via the IGFT. However, the overhead of the power of GLM is prohibitive. The Chebyshev polynomial approximates arbitrary functions inside the unit hypersphere. Hence, combinational GLM would be normalized by its own spectral radius, and then, the Chebyshev polynomial would be employed for a basis in polynomial space.

y = σ (\sum_{k = 0}^{K} θ_{k} T_{k} (\hat{L}) X) \approx σ (\sum_{i = 0}^{k} θ_{i} L^{i} X)

(A12)

Equation (A12) shows the Chebyshev polynomial

{T_{k} (\hat{L}), k = 0, \dots, K}

, where

T_{0} (X) = I, T_{1} (X) = X, T_{k} (X) = 2 X T_{k - 1} (X) - T_{k - 2} (X)

, and the normalized GLM where

\hat{L} = L_{s y m} - I_{N}

, where combinational GLM

L_{s y m} = I_{N} - D^{- 1 / 2} A D^{- 1 / 2}

.

θ = {θ_{k}, k = 0, \dots, K}

are the

K + 1

coefficients of the K-order polynomial filter. The first-order polynomial filter can be approximately formulated as follows:

X^{l + 1} = σ ((θ_{0} + θ_{1} (\hat{L} - I_{N})) X^{l}) = σ (θ_{0} X^{l} + θ_{1} (- D^{- 1 / 2} A D^{- 1 / 2}) X^{l})

(A13)

In Equation (A13),

X^{l} \in R^{N \times M}

is the

l_{t h}

-layer node features. If

θ_{0} = θ_{1} = θ

, the simplest version of the first-order filter is as follows:

X^{l + 1} = σ (θ (I - D^{- 1 / 2} A D^{- 1 / 2}) X^{l}) X^{l + 1} = σ ((I - D^{- 1 / 2} A D^{- 1 / 2}) X^{l} W^{l})

(A14)

In Equation (A14),

θ \in R

is the scalar. Up to this point, the equivalence between spatial convolution and the spectral filter is proved for the self-adjoint GSO. Further, the presence of FFN allows GCN to identify the inter-dimensional complex correlation. Finally, spectral GCN is formulated as Equation (A14), where

W^{l} \in R^{M \times M}

is a linear transformation matrix.

Appendix C.2. Nonlinear Kernel SVD

The optimization objective is to separately find r projection vectors along which the similarity of one to another is the greatest for each sample space, where r is the rank of the matrix A. The left and right singular vectors of SVD decomposition

A = U Λ V^{T}

reflect the asymmetric similarity of the row and column spaces of matrix A, respectively. The linear dependence of both sample spaces is usually not evident in data space. Kernel SVD is built on SVD, identifying complex correlations on reproducing kernel Banach spaces (RKBS). Let us consider the row and column of matrix

A \in R^{N \times M}

as two sample spaces: N samples of row space

X = {A_{i, :} ≜ x_{i}}_{i = 1}^{N} \in R^{M}

and M samples of column space

Z = {A_{:, j} ≜ z_{j}}_{j = 1}^{M} \in R^{N}

. In RKBS,

ϕ (x_{i}) \in R^{p}

and

ψ (z_{j}) \in R^{p}

are samples in the row and column feature space, where

ϕ (\cdot) : R^{M} \to R^{p}

and

φ (\cdot) : R^{N} \to R^{p}

are used for nonlinear mapping from data space to RKBS and dimension compatibility. The r p-dimensional projection vectors are formulated as follows, with the constraint condition

p \geq r

.

A_{ϕ} = [a_{ϕ_{1}}, \dots, a_{ϕ_{r}}] \Leftrightarrow A_{ϕ} = Φ^{T} B_{ϕ} A_{ψ} = [a_{ψ_{1}}, \dots, a_{ψ_{r}}] \Leftrightarrow A_{ψ} = Ψ^{T} B_{ψ}

(A15)

In Equation (A15), according to the generic concept of RKBS, any vector can be expressed as a linear combination of samples in that space, so

a_{ϕ_{l}} = \sum_{i = 1}^{N} b_{ϕ_{l, i}} ϕ (x_{i}), a_{ψ_{l}} = \sum_{i = 1}^{N} b_{ψ_{l, j}} ψ (z_{j})

. The feature subspaces

A_{ϕ}

and

A_{ψ}

are components of r linearly independent vectors from the row and column space, respectively.

Φ = [ϕ (x_{1}), \dots, ϕ (x_{N})] \in R^{N \times p}

and

Ψ = [ψ (z_{1}), \dots, ψ (z_{M})] \in R^{M \times p}

are, separately, samples from row and column feature spaces. The maximum variances of a multi-distributional sample can obtain some principal components that represent the directions along which the characteristics of the sample distribution obviously appear. Meanwhile, the comprehensive asymmetry requires the feature subspaces to be mutually orthogonal. The optimization objective of KSVD is formulated as follows:

max_{B_{ϕ}, B_{ψ}} \frac{1}{2} (T r ({\hat{Σ}}_{ϕ}) + T r ({\hat{Σ}}_{ψ})) = \frac{1}{2} {∥G^{T} B_{ϕ}∥}_{2}^{2} + \frac{1}{2} {∥G B_{ψ}∥}_{2}^{2} s . t . A_{ϕ}^{T} A_{ψ} = B_{ϕ}^{T} G B_{ψ} = I_{r}

(A16)

In Equation (A16),

Σ_{ϕ} = {(Ψ A_{ϕ})}^{T} (Ψ A_{ϕ})

is the variance of samples from the row space of matrix A in the direction of the principal components in A’s column space. Similarly,

Σ_{ψ} = {(Φ A_{ψ})}^{T} (Φ A_{ψ})

is the variance of samples from A’s column space in the direction of the principal components in A’s row space. The kernel matrix

G \in R^{N \times M}

reflects the dependence of row and column samples, and each element

G_{i j} = κ (x_{i}, z_{j}) = ϕ {(x_{i})}^{T} ψ (z_{j})

is computed by either SNE or T kernel between the

i_{t h}

row sample and

j_{t h}

column sample. The orthogonality-preserved constraint highlights the independence of both feature subspaces where the correlation of row and column samples is evaluated in RKBS. The above is also called the shifted eigenvalue problem. The feasible solution is as follows:

\begin{matrix} G^{T} G B_{ψ} = G^{T} B_{ϕ} Λ \\ G G^{T} B_{ϕ} = G B_{ψ} Λ \end{matrix} \Leftrightarrow \begin{matrix} (G G^{T}) (G B_{ψ}) = (G B_{ψ}) (Λ Λ) \\ (G^{T} B_{ϕ}) (Λ Λ) = (G^{T} G) (G^{T} B_{ϕ}) \end{matrix}

(A17)

In Equation (A17),

G B_{ψ}

and

G^{T} B_{ϕ}

are the left and right singular matrices of G, and the Lagrange multipliers

Λ = diag ([λ_{1}, \dots, λ_{r}])

are singulars of G. Compared to SVD, KSVD is able to explore asymmetrically nonlinear dependence via a kernel trick. The SVD decomposition of kernel matrix G is formulated as

G = B_{ψ} Λ B_{ϕ}^{T}

, where

B_{ψ} \in R^{N \times r}, B_{ϕ} \in R^{M \times r}

, and

Λ \in R^{r \times r}

.

Appendix D

PyTorch geometric (PyG) efficiently creates graph data pipelines and accelerates graph convolutional network (GCN) training and inference using sparse matrix operations. The nn.MessagePassing class in PyG serves as the base for GCNs based on the message passing neural network (MPNN) paradigm. Its class member functions (i.e., message, edge_update, update, and forward) can be customized to implement message passing, edge feature updates, node feature updates, and graph convolution operations.

The functional connectivity of all the modules is shown in Figure A1. The structure-aware GCN and adaptive CheConv, both based on the MPNN paradigm, involve message passing, aggregation, and update operations for each node in the graph. Similar to GAT, the node PEs update biasedly aggregates the PEs in the one-hop neighborhood to update each node’s PE. In layers where

l \neq 0

, updating each edge attribute considers the current edge attribute as well as the target and source node features. Because these operations affect every element of the graph, their base class is required to be PyG.nn.MessagePassing. In contrast, the global asymmetric attention kernel, node feature update, and filter coefficient generation are implemented through parallel tensor computation, so their base class is Pytorch.nn.Module. In LaTeX, the minted package does not allow content to span across pages. Therefore, we mainly focus on demonstrating the implementation of modules using PyTorch and PyG libraries in the forward inference.

Figure A1. The forward inference of the spectral dynamic graph transformer (DSGT) layer is implemented through the PyG and Pytorch libraries.

References

Zhang, G.; Li, D.; Gu, H.; Lu, T.; Shang, L.; Gu, N. Simulating News Recommendation Ecosystems for Insights and Implications. IEEE Trans. Comput. Soc. Syst. 2024, 11, 5699–5713. [Google Scholar] [CrossRef]
Jain, S.; Hegade, P. E-commerce Product Recommendation Based on Product Specification and Similarity. In Proceedings of the 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Zallaq, Bahrain, 29–30 September 2021; pp. 620–625. [Google Scholar] [CrossRef]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral Networks and Locally Connected Networks on Graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar] [CrossRef]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the NIPS’16, 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3844–3852. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Tong, Z.; Liang, Y.; Sun, C.; Rosenblum, D.S.; Lim, A. Directed Graph Convolutional Network. arXiv 2020, arXiv:2004.13970. [Google Scholar] [CrossRef]
Zhang, X.; He, Y.; Brugnone, N.; Perlmutter, M.; Hirn, M. MagNet: A Neural Network for Directed Graphs. Adv. Neural Inf. Process. Syst. 2021, 34, 27003–27015. [Google Scholar]
Ma, Y.; Hao, J.; Yang, Y.; Li, H.; Jin, J.; Chen, G. Spectral-based Graph Convolutional Network for Directed Graphs. arXiv 2019, arXiv:1907.08990. [Google Scholar] [CrossRef]
Tong, Z.; Liang, Y.; Sun, C.; Li, X.; Rosenblum, D.; Lim, A. Digraph Inception Convolutional Networks. Adv. Neural Inf. Process. Syst. 2020, 33, 17907–17918. [Google Scholar]
Koke, C.; Cremer, D. HoloNets: Spectral Convolutions do extend to Directed Graphs. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=EhmEwfavOW (accessed on 10 October 2024).
Nguyen, K.; Nong, H.; Nguyen, V.; Ho, N.; Osher, S.; Nguyen, T. Revisiting over-smoothing and over-squashing using ollivier-ricci curvature. In Proceedings of the ICML’23, 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Akansha, S. Over-Squashing in Graph Neural Networks: A Comprehensive survey. arXiv 2023, arXiv:2308.15568. [Google Scholar]
Topping, J.; Di Giovanni, F.; Chamberlain, B.P.; Dong, X.; Bronstein, M.M. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv 2021, arXiv:2111.14522. [Google Scholar] [CrossRef]
Dwivedi, V.P.; Bresson, X. A Generalization of Transformer Networks to Graphs. arXiv 2020, arXiv:2012.09699. [Google Scholar]
Bastos, A.; Nadgeri, A.; Singh, K.; Kanezashi, H.; Suzumura, T.; Mulang’, I.O. Investigating Expressiveness of Transformer in Spectral Domain for Graphs. arXiv 2022, arXiv:2201.09332. [Google Scholar] [CrossRef]
Kreuzer, D.; Beaini, D.; Hamilton, W.; Létourneau, V.; Tossou, P. Rethinking Graph Transformers with Spectral Attention. Adv. Neural Inf. Process. Syst. 2021, 34, 21618–21629. [Google Scholar]
Dwivedi, V.P.; Luu, A.T.; Laurent, T.; Bengio, Y.; Bresson, X. Graph Neural Networks with Learnable Structural and Positional Representations. arXiv 2021, arXiv:2110.07875. [Google Scholar] [CrossRef]
Chen, D.; O’Bray, L.; Borgwardt, K. Structure-Aware Transformer for Graph Representation Learning. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 3469–3489. [Google Scholar]
Rampášek, L.; Galkin, M.; Dwivedi, V.P.; Luu, A.T.; Wolf, G.; Beaini, D. Recipe for a general, powerful, scalable graph transformer. In Proceedings of the NIPS ’22, 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Bo, D.; Shi, C.; Wang, L.; Liao, R. Specformer: Spectral Graph Neural Networks Meet Transformers. arXiv 2023, arXiv:2303.01028. [Google Scholar] [CrossRef]
Singh, R.; Chakraborty, A.; Manoj, B.S. Graph Fourier transform based on directed Laplacian. In Proceedings of the 2016 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 12–15 June 2016; pp. 1–5. [Google Scholar] [CrossRef]
Shafipour, R.; Khodabakhsh, A.; Mateos, G.; Nikolova, E. A Directed Graph Fourier Transform With Spread Frequency Components. IEEE Trans. Signal Process. 2019, 67, 946–960. [Google Scholar] [CrossRef]
Leus, G.; Segarra, S.; Ribeiro, A.; Marques, A.G. The Dual Graph Shift Operator: Identifying the Support of the Frequency Domain. J. Fourier Anal. Appl. 2021, 27, 49. [Google Scholar] [CrossRef]
Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Maron, H.; Ben-Hamu, H.; Shamir, N.; Lipman, Y. Invariant and Equivariant Graph Networks. arXiv 2018, arXiv:1812.09902. [Google Scholar]
Lim, D.; Robinson, J.; Zhao, L.; Smidt, T.E.; Sra, S.; Maron, H.; Jegelka, S. Sign and Basis Invariant Networks for Spectral Graph Representation Learning. arXiv 2022, arXiv:2202.13013. [Google Scholar]
Ma, G.; Wang, Y.; Wang, Y. Laplacian Canonization: A Minimalist Approach to Sign and Basis Invariant Spectral Embedding. Adv. Neural Inf. Process. Syst. 2023, 36, 11296–11337. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17, 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Nguyen, T.M.; Nguyen, T.; Ho, N.; Bertozzi, A.L.; Baraniuk, R.G.; Osher, S.J. A Primal-Dual Framework for Transformers and Neural Networks. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Chen, Y.; Tao, Q.; Tonin, F.; Suykens, J.A. Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Suykens, J.A. SVD revisited: A new variational principle, compatible feature maps and nonlinear extensions. Appl. Comput. Harmon. Anal. 2016, 40, 600–609. [Google Scholar] [CrossRef]
Geisler, S.; Li, Y.; Mankowitz, D.; Cemgil, A.T.; Günnemann, S.; Paduraru, C. Transformers meet directed graphs. In Proceedings of the ICML’23, 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Beaini, D.; Passaro, S.; Letourneau, V.; Hamilton, W.L.; Corso, G.; Liò, P. Directional graph networks. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Abadi, M. TensorFlow: Learning functions at scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, Nara, Japan, 18–24 September 2016. [Google Scholar]
Leon, S.J.; De Pillis, L.; De Pillis, L.G. Linear Algebra with Applications; Pearson Prentice Hall: Upper Saddle River, NJ, USA, 2006. [Google Scholar]
Moret, I.; Novati, P. The computation of functions of matrices by truncated Faber series. Numer. Funct. Anal. Optim. 2001, 22, 1–18. [Google Scholar] [CrossRef]
Cai, T.; Luo, S.; Xu, K.; He, D.; Liu, T.y.; Wang, L. Graphnorm: A principled approach to accelerating graph neural network training. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 1204–1215. [Google Scholar]
Maekawa, S.; Sasaki, Y.; Onizuka, M. A Simple and Scalable Graph Neural Network for Large Directed Graphs. arXiv 2023, arXiv:2306.08274. [Google Scholar]
Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; Leskovec, J. Open Graph Benchmark: Datasets for Machine Learning on Graphs. Adv. Neural Inf. Process. Syst. 2020, 33, 22118–22133. [Google Scholar]
Lim, D.; Hohne, F.; Li, X.; Huang, S.L.; Gupta, V.; Bhalerao, O.; Lim, S.N. Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods. Adv. Neural Inf. Process. Syst. 2021, 34, 20887–20902. [Google Scholar]
Platonov, O.; Kuznedelev, D.; Diskin, M.; Babenko, A.; Prokhorenkova, L. A critical look at the evaluation of GNNs under heterophily: Are we really making progress? In Proceedings of the Eleventh International Conference on Learning Representations, Virtual Event, 25–29 April 2022.
Fey, M.; Lenssen, J.E. Fast Graph Representation Learning with PyTorch Geometric. arXiv 2019, arXiv:1903.02428. [Google Scholar]
Hussain, M.S.; Zaki, M.J.; Subramanian, D. Global Self-Attention as a Replacement for Graph Convolution. In Proceedings of the KDD ’22, 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 655–665. [Google Scholar] [CrossRef]

Figure 1. The architecture of DSGT with a priori knowledge. The preprocessing obtains a directed node PE and an initial edge embedding, and the forward inference outputs a multi-class joint probability distribution.

Table 1. Summary of existing methods for node-level representation learning in graphs. The table compares three mainstream architectures and our working, i.e. Directed GNNs, Directed GTs, Hybrid architectures of GNN and GT and DSGT(our), for node-level representation learning in directed graphs across three key capacity: Directed Graph, Global Asymmetry Attention, and Approximate High-Order Filter. The symbol of ✓ indicates that the model possesses the capability, whereas that of ✗ denotes the absence of this function.

Methods	Directed Graph	Global Asymmetry Attention	Approximate High-Order Filter
Directed GNNs	✓	✗	✓
Directed GTs	✓	✓	✗
Hybrid Architectures of GNN and GT	✗	✓	✓
DSGT (our)	✓	✓	✓

Table 2. Summary of datasets. The degree correlation coefficients for the Tolokers, ogbn-arxiv, and arxiv-year datasets are close to 0, indicating that structure-aware information contributes less to node classification compared to node attribute. The arxiv-year and squirrel datasets exhibit low node homophily, introducing significant noise during aggregation, which is a key factor in poor model performance. In contrast, the higher degree correlation coefficient of the genius dataset suggests that nodes of the same type may play more similar roles, compensating for this dataset’s low node homophily.

Dataset	Cora	Squirrel	Genius	Tolokers	Ogbn-Arxiv	Arxiv-Year
Nodes	19,793	5200	421,961	11,758	169,343	169,343
Edges	126,842	217,065	984,979	1,038,000	1,166,243	1,166,243
Features	8710	2089	12	10	128	128
Classes	70	10	2	2	40	35
Directed Edge Ratio	0.942	0.828	0.874	0.000	0.986	0.986
Degree Assortativity Coefficient	0.011	0.374	0.477	−0.080	0.014	0.014
Node Homophily	0.363	0.089	−0.107	0.634	0.428	0.145

Table 3. Mean and standard deviation of macro accuracy for test set. Bold: the best performing model for each dataset.

Dataset (Unit: %)	Cora	squirrel	genius	Tolokers	ogbn-arxiv	arxiv-year
DGCN [6]	62.87 ± 1.36	41.64 ± 0.96	67.04 ± 2.26	70.79 ± 0.49	63.01 ± 1.08	44.51 ± 0.16
DiGCN [9]	61.19 ± 0.97	47.74 ± 1.54	69.56 ± 3.14	71.63 ± 0.36	63.36 ± 2.06	49.09 ± 0.10
Holonet [10]	76.46 ± 1.07	53.71 ± 1.92	72.10 ± 3.15	74.69 ± 0.36	73.09 ± 0.64	46.28 ± 0.29
MagNet [7]	71.77 ± 6.18	41.01 ± 1.93	80.30 ± 2.07	74.33 ± 0.38	76.63 ± 0.06	51.78 ± 0.26
MagLapNet-I [32]	77.00 ± 4.96	46.06 ± 2.64	83.42 ± 1.37	75.09 ± 0.45	79.92 ± 0.56	56.06 ± 1.34
MagLapNet-II [18]	73.14 ± 3.31	49.14 ± 3.66	77.94 ± 0.49	78.47 ± 0.27	76.35 ± 0.19	52.17 ± 0.26
GT-WL [14]	64.38 ± 2.22	50.31 ± 1.19	68.81 ± 3.27	70.77 ± 1.35	66.81 ± 1.47	47.15 ± 2.39
DSGT(our)	79.96 ± 3.67	50.36 ± 0.25	90.05 ± 0.31	84.53 ± 0.55	80.33 ± 1.19	54.62 ± 1.01

Table 4. Mean and standard deviation of the macro accuracy for the test set (unit: %). The dimension of the node PE is 18. Specifically, the SVE PE of the node is composed of the first

D_{p} / 2

left and right singular vectors. The hashing encoding is an integer and leverages a trainable 1D embedding vector that obtains

D_{p}

-dimensional WL PE. The bold represents the highest average classification accuracy under this dataset. Magnetic LapPE-ABS/REAL and Undirected LapPE come from the top

D_{p}

spectral features of directed GLM. Directed PE (ours) is the first

D_{p}

column vectors from the feasible solution to the constraint-preserved optimization problem. The bold represents the highest average classification accuracy under this dataset.

Table 4. Mean and standard deviation of the macro accuracy for the test set (unit: %). The dimension of the node PE is 18. Specifically, the SVE PE of the node is composed of the first

D_{p} / 2

left and right singular vectors. The hashing encoding is an integer and leverages a trainable 1D embedding vector that obtains

D_{p}

-dimensional WL PE. The bold represents the highest average classification accuracy under this dataset. Magnetic LapPE-ABS/REAL and Undirected LapPE come from the top

D_{p}

spectral features of directed GLM. Directed PE (ours) is the first

D_{p}

column vectors from the feasible solution to the constraint-preserved optimization problem. The bold represents the highest average classification accuracy under this dataset.

Dataset (Unit: %)	Cora	Squirrel	Genius	Tolokers	Ogbn-Arxiv	Arxiv-Year
Magnetic LapPE-ABS [32]	74.67 ± 0.73	46.53 ± 3.63	82.08 ± 1.26	77.70 ± 5.09	74.58 ± 4.01	55.13 ± 1.37
Magnetic LapPE-REAL [32]	73.62 ± 3.01	44.07 ± 2.26	79.22 ± 1.11	77.70 ± 5.09	70.44 ± 2.01	49.04 ± 0.99
WL PE [14]	66.59 ± 1.72	42.11 ± 1.47	77.05 ± 2.06	71.55 ± 1.47	66.03 ± 3.09	44.16 ± 3.25
SVD PE [44]	78.02 ± 2.35	46.96 ± 3.93	80.90 ± 5.49	81.73 ± 1.59	83.34 ± 2.76	54.21 ± 2.11
Undirected LapPE [14]	73.91 ± 1.43	43.06 ± 4.14	79.85 ± 3.29	80.53 ± 0.87	75.93 ± 1.26	47.14 ± 4.27
Directed PE (ours) [22]	79.96 ± 3.67	50.36 ± 0.25	90.05 ± 0.31	84.53 ± 0.55	80.33 ± 1.19	54.62 ± 1.01

Table 5. Mean and standard deviation of the macro accuracy for the test set. DSGT-I: PrimalAttention [30] is replaced with Full GT [14]; DSGT-II: there is no structural encoding [18] before the attention mechanism; DSGT-III: there is no node PE [22], and DGN [33] is replaced with ChebConv [43]; DSGT-V: there is no initial edge encoding; DSGT-IV: adaptive filter coefficient generation [15] is replaced with trainable parameters in directed GCN [10]. The bold represents the highest average classification accuracy under this dataset.

Dataset (Unit: %)	Cora	Squirrel	Genius	Tolokers	Ogbn-Arxiv	Arxiv-Year
DSGT-I	76.93 ± 2.16	49.91 ± 0.63	86.15 ± 1.93	83.55 ± 0.53	78.81 ± 2.08	53.33 ± 2.10
DSGT-II	77.87 ± 2.15	48.07 ± 3.21	83.38 ± 0.76	79.96 ± 6.33	77.38 ± 0.94	55.40 ± 0.96
DSGT-III	73.00 ± 3.01	40.61 ± 1.99	77.01 ± 1.24	78.18 ± 3.92	66.71 ± 1.08	46.06 ± 3.87
DSGT-V	78.67 ± 1.19	49.12 ± 1.21	85.74 ± 0.75	81.96 ± 0.67	75.11 ± 1.64	53.00 ± 3.33
DSGT-IV	75.14 ± 1.68	47.65 ± 1.01	87.41 ± 2.36	80.52 ± 0.31	78.89 ± 2.21	51.31 ± 4.07
DSGT	79.96 ± 3.67	50.36 ± 0.25	90.05 ± 0.31	84.53 ± 0.55	80.33 ± 1.19	54.62 ± 1.01

Table 6. The impact of filter order on model classification performance was studied using the Cora dataset. The dimensions of node PE and initial edge embedding were

D_{p} = 18

and

D_{e} = 108

, respectively. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.

Table 6. The impact of filter order on model classification performance was studied using the Cora dataset. The dimensions of node PE and initial edge embedding were

D_{p} = 18

and

D_{e} = 108

, respectively. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.

Number of Filter Order K	1	2	3	4	5	6
Magnetic LapPE-ABS	66.15 ± 4.86	71.78 ± 2.33	75.51 ± 2.31	74.67 ± 1.73	74.07 ± 2.69	72.91 ± 1.03
Magnetic LapPE-REAL	66.05 ± 1.19	63.37 ± 5.52	70.39 ± 4.17	70.96 ± 3.38	73.62 ± 3.01	72.91 ± 3.43
SVD PE	68.55 ± 3.50	72.18 ± 4.04	76.01 ± 2.54	78.02 ± 2.35	77.96 ± 2.01	77.48 ± 2.58
Undirected LapPE	54.18 ± 3.73	60.26 ± 2.96	67.87 ± 2.24	70.71 ± 2.09	70.35 ± 2.05	70.91 ± 2.06
Directed PE (our)	69.09 ± 2.93	74.91 ± 2.77	76.60 ± 2.58	79.96 ± 3.67	78.81 ± 1.34	77.70 ± 1.35

Table 7. Mean and standard of the macro accuracy and computation complexity for the test set. The effect of the dimension of node PE on the average classification accuracy for the Cora dataset when the order of the filter and the dimension of the initial edge embedding were

K = 4

and

D_{e} = 108

, respectively. Considering that SVD PE derives from both left and right singular vectors, the dimension of node PE was even and the increment in it was set to two. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.

Table 7. Mean and standard of the macro accuracy and computation complexity for the test set. The effect of the dimension of node PE on the average classification accuracy for the Cora dataset when the order of the filter and the dimension of the initial edge embedding were

K = 4

and

D_{e} = 108

, respectively. Considering that SVD PE derives from both left and right singular vectors, the dimension of node PE was even and the increment in it was set to two. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.

Dimension of Node PE $D_{p}$	10	14	18	22	26	30
Total Parameter Number (Unit: M)	4.9	5.2	5.4	5.8	6.3	6.7
Magnetic LapPE-ABS	73.43 ± 1.11	73.73 ± 2.09	74.67 ± 1.73	74.16 ± 1.27	75.44 ± 1.94	75.18 ± 1.68
Magnetic LapPE-REAL	68.99 ± 1.91	69.69 ± 2.73	70.96 ± 3.38	71.02 ± 2.06	70.03 ± 1.03	71.22 ± 2.57
SVD PE	74.44 ± 2.11	75.68 ± 1.52	78.02 ± 2.35	78.00 ± 1.78	78.64 ± 1.91	78.03 ± 2.08
Undirected LapPE	62.11 ± 1.77	66.63 ± 1.73	70.71 ± 2.09	71.96 ± 2.23	72.02 ± 1.99	73.36 ± 1.64
Directed PE (our)	77.21 ± 1.62	77.59 ± 2.37	79.96 ± 3.67	79.93 ± 2.45	80.00 ± 1.86	80.32 ± 3.69

Table 8. Mean and standard of the macro accuracy and computation complexity for the test set. The difference in the graph Fourier basis can yield a directional field matrix. By employing three different aggregators and scalers (Appendix A [33]), directional smoothing and derivative matrices with a feature dimension of 9 can be obtained. Therefore, a graph Fourier basis generated an 18-dimensional feature vector. Here, K is the number of graph Fourier bases and

D_{e}

denotes the dimension of the initial edge embedding, which is a multiple of 18. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.

Table 8. Mean and standard of the macro accuracy and computation complexity for the test set. The difference in the graph Fourier basis can yield a directional field matrix. By employing three different aggregators and scalers (Appendix A [33]), directional smoothing and derivative matrices with a feature dimension of 9 can be obtained. Therefore, a graph Fourier basis generated an 18-dimensional feature vector. Here, K is the number of graph Fourier bases and

D_{e}

denotes the dimension of the initial edge embedding, which is a multiple of 18. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.

Dimension of Edge Attributes $D_{e}$	54	72	90	108	126	144
Number of Graph Fourier Bases N	3	4	5	6	7	8
Total Parameter Number (Unit: M)	4.7	5.0	5.2	5.4	5.6	5.8
Cora	79.24 ± 2.44	79.31 ± 3.99	79.22 ± 2.63	79.96 ± 3.67	80.00 ± 2.76	79.01 ± 3.03
squirrel	49.77 ± 1.62	49.86 ± 1.23	50.11 ± 0.97	50.36 ± 0.25	50.54 ± 0.61	50.44 ± 1.08
genius	87.53 ± 1.19	88.74 ± 0.71	89.86 ± 1.10	90.05 ± 0.31	90.23 ± 0.11	90.34 ± 0.23
Tolokers	83.21 ± 1.00	83.00 ± 0.82	84.86 ± 1.30	84.53 ± 0.55	84.07 ± 1.24	84.41 ± 0.83
ogbn-arxiv	77.11 ± 1.20	79.23 ± 0.99	79.51 ± 1.15	80.33 ± 1.19	80.34 ± 1.34	80.96 ± 0.74
arxiv-year	53.87 ± 1.37	54.16 ± 1.18	54.01 ± 1.71	54.62 ± 1.01	54.44 ± 2.09	53.37 ± 1.63

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, G.; Yu, Q.; Chen, F.; Chen, G. Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs. Mathematics 2024, 12, 3689. https://doi.org/10.3390/math12233689

AMA Style

Hou G, Yu Q, Chen F, Chen G. Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs. Mathematics. 2024; 12(23):3689. https://doi.org/10.3390/math12233689

Chicago/Turabian Style

Hou, Guoqiang, Qiwen Yu, Fan Chen, and Guang Chen. 2024. "Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs" Mathematics 12, no. 23: 3689. https://doi.org/10.3390/math12233689

APA Style

Hou, G., Yu, Q., Chen, F., & Chen, G. (2024). Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs. Mathematics, 12(23), 3689. https://doi.org/10.3390/math12233689

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs

Abstract

1. Introduction

2. Related Work

2.1. Undirected Graph Neural Network

2.2. Directed Graph Neural Network

2.3. Graph Transformer

3. Proposed Model

3.1. Graph Preprocessing

3.1.1. Global Node Positional Encoding

3.1.2. Structure-Aware Edge Embedding

3.2. Directed Spectral Graph Transformer

3.2.1. Global Asymmetric Graph Transformer

3.2.2. K-Order Directed Graph Convolution Network

3.2.3. Feature Decoupling Update

3.3. End-to-End Optimization for DSGT

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Dataset Settings

4.1.2. Baselines

4.1.3. Experimental Protocol

4.1.4. Experimental Environment

4.2. Comparative Experiments

4.2.1. Supervised Node Classification

4.2.2. Effectiveness of Node Positional Embeddings

4.3. Ablation Experiment

4.4. Hyperparameter Tuning

4.4.1. The Order of Polynomial Filters

4.4.2. Dimension of Node Positional and Initial Edge Embedding

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. First-Order Extremum Condition for Dual Problems

Appendix A.2. Stationary Condition

Appendix A.3. Advantages over Traditional Attention Kernel

Appendix B

Appendix C

Appendix C.1. Spectral Graph Convolution Theory

Appendix C.2. Nonlinear Kernel SVD

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI