Next Article in Journal
Prediction of China’s Polysilicon Prices: A Combination Model Based on Variational Mode Decomposition, Sparrow Search Algorithm and Long Short-Term Memory
Previous Article in Journal
Conceptual Framework for Adaptive Bacterial Memetic Algorithm Parameterization in Storage Location Assignment Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs

1
College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China
2
College of Intelligent Manufacturing, Chongqing Vocational and Technical College of Industry and Trade, Chongqing 401120, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(23), 3689; https://doi.org/10.3390/math12233689
Submission received: 25 October 2024 / Revised: 15 November 2024 / Accepted: 16 November 2024 / Published: 25 November 2024

Abstract

:
Knowledge graph embedding has been identified as an effective method for node-level classification tasks in directed graphs, the objective of which is to ensure that nodes of different categories are embedded as far apart as possible in the feature space. The directed graph is a general representation of unstructured knowledge graphs. However, existing methods lack the ability to simultaneously approximate high-order filters and globally pay attention to the task-related connectivity between distant nodes for directed graphs. To address this limitation, a directed spectral graph transformer (DSGT), a hybrid architecture model, is constructed by integrating the graph transformer and directed spectral graph convolution networks. The graph transformer leverages multi-head attention mechanisms to capture the global connectivity of the feature graph from different perspectives in the spatial domain, which bridges the gap between frequency responses and, further, naturally couples the graph transformer and directed graph convolutional neural networks (GCNs). In addition to the inherent hard inductive bias of DSGT, we introduce directed node positional and structure-aware edge embedding to provide topological prior knowledge. Extensive experiments demonstrate that the DSGT exhibits state-of-the-art (SOTA) or competitive node-level representation capabilities across datasets of varying attributes and scales. Furthermore, the experimental results indicate that the homophily and degree of correlation of the nodes significantly influence the classification performance of the model. This finding opens significant avenues for future research.

1. Introduction

A knowledge graph is composed of an underlying topological structure and nodes, which can materialize a complex relationship between entries. The graph topology is a set of edges that describes the relative layout of nodes, regardless of their specific spatial positions. Node representation learning involves abstracting node-level semantic knowledge from its context and local neighborhood structure with continuously informative aggregation along edges in a one-hop neighborhood. For instance, in social networks, a node and edge represent a user and an interaction between users, respectively, providing crucial insights for analyzing user behavior and designing recommendation systems [1]. In e-commerce, a user, an item, and user–item interaction behaviors can be regarded as a graph network that is used to analyze a user’s preference for an item and recommend interesting products to the user [2]. Efficient representation learning that respects the graph’s topological structure is crucial for graph mining. Permutation invariance and scalability overcome the inductive biases imposed by canonical ordering and the finite number of grid units, respectively. Graph neural networks have garnered significant interest due to their ability to handle unstructured data of arbitrary size through constructing K-hop neighborhoods and performing scale-invariant information aggregation.
Spectral graph theory underlies classical graph convolution networks (GNNs (See Appendix B for abbreviations and corresponding full names.)) [3,4,5], relating the spectral polynomial filter with spatial convolution for a self-adjoint graph shift operator (GSO). In practice, one-way interactions between entities are more common, such as a user collecting an item, but not vice versa, resulting in a directed graph without a self-adjoint graph shift operator (GSO). The absence of a self-adjoint GSO, such as a positive-definite graph Laplacian matrix (GLM), makes it impossible to capture frequency responses in the eigenspace. Many researchers redefine the symmetry GLM for directed graphs for the generalization of spectral graph theory, whose workings are roughly summarized in three parts: structural decomposition [6], attribute decomposition [7], and random walk [8,9]. These methods mainly involve considering biased spatial informative diffusion in directed graphs, which contributes to geometrically unmeaning interpretation in the frequency domain. Others focus on the characteristics of the frequency response in a generalized eigenspace and directed spectral graph convolution networks [10], which is beneficial for finding the signal diffusion pattern in directed graphs. In the GCN framework, step-wise spatial convolution propagates as many messages along the edge as possible, but there is an over-smoothing and over-squashing dilemma [11,12,13]. The use of a graph transformer can alleviate both issues, being able to pay attention to the dependence between points that are far apart.
Neglecting the directional edges, graph transformers (GTs) work in the spatial [14,15,16,17,18,19] or spectral domain [20]. Node positional embedding (node PE) and the multi-head attention mechanism are the key to spatial GT. Instead of ordered identification and absolute positioning in a spatial coordinate system, the roles played in frequency responses are unique for a node. Node PE is a soft inductive bias that compensates for the structure-aware inability in the global attention mechanism [14]. Node PE suffers from sign and basis ambiguity due to sign/rotation-invariant eigendecomposition. On the one hand, a canonical trick or permutation-invariant layer mathematically unifies the eigenspace. On the other hand, considering all the spectral features (eigenvalues and vectors) leads the model to have some mathematical equivalence. The aim of a spectral graph transformer is to adaptively generate a data-dependent and structure-aware GLM [10]. Existing directed GTs mainly work in the spatial domain. Node PE is designed to represent the strength and directional distortion of a single point across the node in different graph patterns. Graph patterns are discovered using various methods for directed graphs [21,22]. Additionally, the expression of an attention map has a critical impact on the computational efficiency of GTs. A greater information capacity in the node feature [18] and comprehensive symmetry in the attention kernel lead to a more reasonable attention score. However, GT has been shown to only approximate low-order filters within a finite error [15]. Researchers have employed a hybrid architecture of GCN and graph transformer to simultaneously achieve high-order filtering and bypass the information explosion from a high-order GCN. However, the hybrid architecture has rarely been implemented in directed graphs to date.
As shown in Table 1, the current methods for node representation learning in directed graphs take into account either the approximation of a high-order filter or a global receptive field. Our aim was to functionally complement each directed GCN and GT more fruitfully. The multi-head attention mechanism of GT could provide abundant global connectivity from different perspectives, reflecting an irregular support set of graph frequencies [23]. In spectral graph theory, the polynomial filter coefficients of a GCN dominate the smoothing of frequency components. Therefore, polynomial filter coefficients should be mutually influential and data-dependent, such that the task-related filter performs better and is more flexible.
Inspired by the latest research in deep GNNs and spectral graph theory, we propose a powerful hybrid architecture model for directed graphs, called the directed spectral graph transformer (DSGT). The algorithm is a two-stage model with the ability to approximate the frequency responses of directed graphs of arbitrary order; it consists of data preprocessing and forward inference. In the preprocessing stage, we initialize the structure-aware embedding of the node position and edge. In the forward inference stage, an asymmetric attention kernel and a high-order directed GNN can approximate arbitrary-order graph filters while comprehensively capturing global node dependencies. Specifically, a complete set of graph Fourier bases is derived from the feasible solution of constraint-preserved optimization. Then, the relative significance between nodes is regarded as a discrete gradient flow along the edge in different graph patterns. Subsequently, directed GT is responsible for global aggregation and providing a multi-head attention map. Directed GNNs extract the multi-dimensional support of polynomial filter coefficients and aggregate information inside a K-hop neighborhood. Further, node-level representation is obtained by fusing local and global aggregation. Finally, we decouple the updates of edge attributes and node PE. In node classification tasks, DSGT consistently outperforms other models, with over 10% higher accuracy on some datasets (see Section 4 for details). It also shows lower standard deviations across multiple datasets, indicating strong stability and robustness.
The contributions of our paper are as follows:
  • Novel mixed architecture: This is a new hybrid architecture model for directed graphs that can approximate an arbitrary-order filter without over-squashing and over-smoothing issues.
  • New directed PE: Concepts from continuous signal processing (SP) are generalized to a discrete directed graph. A new method is adopted to evenly search for a complete set of graph Fourier bases across as wide a frequency band as possible and to further obtain a directed node PE from them.
  • Benchmarking: Extensive experiments demonstrate the SOTA and competitive performance of our DSGT compared to baselines and the effectiveness of its module. We experientially analyzed the experimental results.
The rest of this article is organized as follows. In Section 2, the previous work related to graph representation learning models is briefly reviewed and analyzed. The details of the model architecture are mathematically formulated in Section 3. The experimental results and corresponding discussions for several public datasets are elaborated in Section 4. Finally, we summarize our work based on the experimental results in Section 5. In addition, full descriptions of the abbreviations and symbols are provided in Table A1 and Table A2.

2. Related Work

For node representation learning, we briefly outline the development of GNNs and state the design motivations behind these models, laying the foundation for our current research.

2.1. Undirected Graph Neural Network

The spectrum theory of undirected graphs was first applied to the field of graph representation learning, as formulated in Appendix C.1. SGCNN is an early spectral graph convolutional model with higher computational complexity [3]. In SGCNN, the polynomial filter would be equivalent to the power series of GLM under the condition of a self-adjoint undirected GLM. ChebConv reduces the computational complexity from the exponential to polynomial level via Chebyshev polynomials that can approximate the power series of an undirected GLM normalized by its spectral radius [4]. To further simplify the model, GCN [5] only takes the first-order form of ChebConv and the neighborhood continuously expands through step-wise iteration.

2.2. Directed Graph Neural Network

Major efforts have been made to define symmetric directed GLMs, such as random walks [8,9], and structural [6] and attribute decomposition [7], enabling the application of traditional spectral GCNs to directed graphs. DGCN decomposes the adjacency matrix of directed graphs into first-order proximal, in-neighbor, and out-neighbor matrices, which represent basic connection, in-degree, and out-degree distributions, respectively [6]. Nodes sharing the same out-/in-neighbors exhibit similar diffusion/aggregation behaviors; hence, the number of shared in-/out-neighbors reflects the in-/out-degree similarity between nodes. When the directionality of information propagation in the graph is ignored, multiple non-isomorphic directed graphs can correspond to the same undirected graph. As a result, applying an undirected GNN to a directed graph naturally inevitably leads to information loss. To address this, the directionality of edges is uniquely decomposed into the strength of the connections between nodes and the direction of information propagation. Specifically, the directed GLM is regarded as a Hermitian matrix, constructed from the underlying undirected graph and phase. The underlying undirected graph topology describes the connectivity relationships between nodes, while the phase is applied to capture the cyclic subgraph. In the complex domain, MagNet carries out a spatial convolution operation via the directed GLM [7]. The graph adjacent matrix, normalized by the out-degree, represents the random diffusion probability of the node signal, referred to as the random walk matrix. In strongly connected directed graphs, the adjacency matrix is irreducible and non-negative. Based on the Perron–Frobenius theory and the stationary distribution of Markov chains, the Perron vector describes the steady-state distribution of nodes. A symmetric directed graph Laplacian matrix is then constructed by normalizing the random walk matrix with the Perron vector [8]. However, the requirement of strong connectivity limits the model’s applicability. To address this, a state transition matrix for the Markov chain is constructed by incorporating long-distance random jumps into the PageRank algorithm, resulting in a symmetric directed approximate GLM being derived [9]. Another research direction explores spectral graph theory for directed graphs. In this case, the behavior of directed GLM is analyzed within the generalized eigenvalue space. Additionally, linear operator perturbation theory demonstrates that the frequency response of a directed graph can be fully understood through holomorphic functions in the complex domain. Similar to the Chebyshev polynomials, Faber polynomials are used in the complex domain to approximate arbitrary holomorphic functions [10].

2.3. Graph Transformer

The first GT adaptively fuses multi-scale neighborhood aggregation through a channel attention mechanism, but lacks the global attention capability of the transformer architecture [24]. Dwivedi V.P. et al. first truly generalized the transformer to a discrete graph domain via node PE and feature update rules [14]. The node position encoding, serving as a soft inductive bias, is crucially aware of the diffusion distance between nodes in the GT. However, eigendecomposition is insensitive to sign flip and orthogonal transformation groups, resulting in the sign and basis being ambiguous. To overcome both of these issues, Lim et al. leveraged IGN [25] to achieve sign- and basis-invariant networks [26], ensuring the uniqueness of isomorphic graph node embeddings. The output of the parametric network violates the constraints of the Fourier basis. Ma et al. designed a non-parametric method for basis canonicalization [27]. Due to computational overhead, Dwivedi et al. employed random walks to obtain unique representations of node positions and decoupled the updates of edge and position embeddings [17]. Nevertheless, spatially structure-aware embeddings from random walks are less expressive than their spectral counterparts. Kreuzer et al. incorporated spectral features from eigendecomposition to balance computational complexity and model performance degradation [16]. Given that an equivalent complete set of graph Fourier bases from eigendecomposition is not enumerable, it is challenging for the model to fully determine both ambiguities. Additionally, Bo et al. leveraged a transformer encoder–decoder architecture to adaptively extract a task-related GLM, known as spectral GT, from spectral features [20].
Improving the effectiveness of attention maps can also observably improve the expression of GT. Traditional attention kernels, such as additive and dot-product attention kernels [28], have been proven to inadequately reveal the asymmetry between nodes from a single perspective [29,30]. Inspired by KSVD [31], Chen Y. et al. proposed an attention kernel that incorporates bidirectional asymmetry between nodes, thereby further enhancing the asymmetric expression of transformer [30]. Experiments have shown that the similarity of local neighborhood structures is crucial for evaluating the relative dependence between nodes, suggesting that the local neighborhood structure affects the attention score [18]. There is less research on directed GTs, whose bottleneck is the lack of a valid and universal directed node PE. Research on directed GTs remains relatively sparse, with a key bottleneck being the lack of an effective and universal directed node PE. Geisler et al. introduced the magnetic GLM to describe the biased connections between nodes and performed aggregation on both the imaginary and real parts separately through GTs [32]. However, the method for feature processing lacks geometric justification, resulting in suboptimal performance.

3. Proposed Model

For node-level graph representation learning, DSGT fruitfully integrates GT and directed GNN with the ability to approximate an arbitrary-order filter. As shown in Figure 1, the DSGT is a two-stage model, including preprocessing and forward inference.

3.1. Graph Preprocessing

This section primarily elaborates on node PE and the initialization of structure-aware features for the edge. Graph data consist of graph topology and node features, which fall within the domain of irregular data. The graph topology describes the complex and flexible relationships between nodes, meaning that reordering the graph nodes results in a series of equivalent isomorphic graphs. Unlike grid data, the importance of nodes within the graph topology under different graph patterns is unique to them. Additionally, to respect the dominant direction of signal propagation under different graph frequencies, the graph connectivity under various graph frequencies is used as the initial feature for edges. To some extent, the initial edge features and global positional encoding of nodes can help to improve their expression in specific tasks.

3.1.1. Global Node Positional Encoding

To diversify directed graph patterns, the graph frequency is expected for an arithmetic sequence across as wide a frequency band as possible. Specifically, a two-stage optimization strategy is employed: first, the spectral radius and corresponding graph Fourier basis are determined by maximizing the directed total variation, and then, a set of complete Fourier bases is identified, with the corresponding graph frequencies as evenly distributed across the entire frequency band as possible. The node PE of fixed dimensions is derived from a complete set of graph Fourier bases.
p ^ i = MLP Concat λ , u i P ( 0 ) = GP 1 Transformer P ^
In Equation (1), the first-layer d p -dimensional node PE P ( 0 ) R N × d p , the frequency spectrum of the directed graph λ = { λ k , k = 1 , , N } R N , and the importance of i t h under graph frequency u i = { u k i , k = 1 , , N } R N . The N × 2 -dimensional tensor is obtained via Concat ( · ) for each node. GP i ( · ) is the global pooling operation to the i t h dimension of the tensor. MLP : R 2 R d p ; Transformer : R d p R d p . The total variation is used to measure the smoothness of the node signal variations on the graph topology, quantifying the expected value of the discrete derivatives of graph signals, which can be used to define graph frequencies. The directed and undirected total variations are separately formulated as DV x = i , j = 1 N A i j x i x j + 2 and TV x = x L T x = i = 1 , j > i N A i j x i x j 2 , where [ x ] + 2 = ( [ x ] + ) 2 = ( max ( 0 , x ) ) 2 . For a self-adjoint adjacent matrix A, either A i j x i x j + 2 or A j i x j x i + 2 is zero if A i j = A j i 0 ; in other words, TV ( x ) DV ( x ) , suggesting that the method for node PE is also applicable for undirected graphs. Notably, the total variation is insensitive to sign and permutation, which implies that there are no ambiguities for the definition of directed node PE. The expected frequencies follow an arithmetic progression, namely f k = DV ( u k ) = k 1 N 1 f max , k = 1 , , N , where f max is the spectral radius/largest frequency of the directed graph. The uniqueness of the frequency avoids basis ambiguity. First, the graph Fourier basis of the largest frequency is obtained through the maximum directed total variation and ensures the frequency band. Second, a group of approximately evenly distributed spectral features is the solution to the minimum spectral dispersion function across the frequency band.
min u DV ( u ) s . t . u T u = 1 min U ϕ ( U ) = δ ( U ) + λ 2 u 1 u min 2 + u N u max 2 s . t . U T U = I N , δ ( U ) = i = 1 N 1 DV ( u i + 1 ) DV ( u i ) 2
In Equation (2), u min = 1 N / N is the pattern of uniform diffusion and is orthogonal with u max (Proposition 4 [22]). The orthogonal constraint causes non-convex optimization. Wen et al. proposed a feasible iterative strategy for processing an orthogonality-preserved optimization problem with the form of min U R N × M Φ ( U ) s.t. U T U = I M . The iterative method for Equation (2) is shown in Algorithms 1 and 2. For each iteration, the step size is chosen by Algorithm 3 to satisfy the strong Armijo–Wolfe conditions. Ultimately, we can obtain the abundant graph patterns by optimizing objectives in Equation (2).
Algorithm 1 Directed variation maximization.
Input: 
Adjacency matrix A R N × N and arbitrarily positive ϵ > 0
Output: 
Maximum graph frequency f m a x and corresponding graph Fourier basis u m a x
  1:
Initialize  k = 0 and unit-norm u 0 R N at random
  2:
while  u k u k 1 ϵ  do
  3:
    Evaluate the objective ϕ ( u k ) = −DV ( u k ) = i , j = 1 N A i j [ x i x j ] + 2
  4:
    Compute the gradient of the objective function g ¯ = DV u R N , the i t h element of which is g ¯ i = 2 A : , i T u u i 1 N + A i , : u i 1 N u + , 1 i N
  5:
    Compute the skew-symmetric matrix B k = g ¯ k u k T u k g ¯ k T R N × N
  6:
    Select the step length τ k of the k t h iteration by Algorithm 1 or 3
  7:
    Update u k + 1 ( τ k ) = ( I N + τ k 2 B k ) 1 ( I N τ k 2 B k ) u k
  8:
    Update the index of iteration k k + 1
  9:
end while
10:
Return  u m a x = u k and f m a x = DV ( u m a x )
Algorithm 2 Spectral dispersion minimization.
Input: 
Adjacency matrix A R N × N and arbitrarily positive ϵ > 0 , minimum and maximum graph frequency u m i n = 1 N 1 N , u m a x from the directed variation maximization algorithm, and regularization coefficient of the objective function λ ( 0 , 1 )
Output: 
A set of graph Fourier bases { u i , i = { 2 , , N 1 } } whose corresponding graph frequencies approximate the arithmetic sequence
  1:
Initialize  k = 0 and orthonormal U 0 R N × N at random
  2:
while  U k U k 1 F ϵ  do
  3:
    Evaluate the objective with a measure of the constraint violations ϕ ( U ) = δ U + λ 2 u 1 u min 2 2 + u N u max 2 2 , where, δ U = i = 1 N 1 DV u i + 1 DV u i 2 2
  4:
    Compute the gradient of the objective function G = ϕ ( U ) R N × N , the rows of which are
G 1 , : = DV u 1 DV u 2 g ¯ u 1 + λ u 1 u min G i , : = 2 DV u i DV u i + 1 DV u i 1 g ¯ u 1 , 1 i N G N , : = DV u N 1 DV u N g ¯ u N + λ u N u max
  5:
    Compute the skew-symmetric matrix B k = G k U k T U k G k T R N × N
  6:
    Select the step length τ k of the k t h iteration by Algorithm 1 or 3
  7:
    Update U k + 1 ( τ k ) = ( I N + τ k 2 B k ) 1 ( I N τ k 2 B k ) U k
  8:
    Update the index of iteration k k + 1
  9:
end while
10:
Return  U ^ = U k and f i = DV ( U : , i ) , 1 < i < N
Algorithm 3 Non-monotone curvilinear search algorithm.
Input: 
Maximum step length τ max ; minimum step length τ min ; hyperparameters 0 < c 1 < c 2 < 1 , where generally, c 1 = 10 4 , c 2 = 0.9 ; the objective function F ( X ) and its gradient X F ( X ) , δ = 0.1 , η = 0.85 , ϵ = 10 5 ; chosen initial step length τ > 0 , τ max = 10 4 , τ min = 10 4 , τ init = 10 3 ; maximum number of loop L max = 25 , Q 0 = 1 , C 0 = F ( X 0 ) ; initial point X 0 in the Stiefel manifold
Output: 
The step length τ * which satisfies the condition
  1:
function VariableUpdate( X k , F , X F , τ )
  2:
     N , K = X .shape
  3:
    if  N > 2 × K  then
  4:
         U = [ F ( X k ) , X k ] , V = [ X k , F ( X k ) ]
  5:
         X k + 1 = X k τ × U ( I 2 K + 0.5 τ × V T U ) 1 V T U
  6:
    else
  7:
         A = F ( X k ) X k T X k F ( X k ) T
  8:
         X k + 1 = ( I N + 0.5 τ × A ) 1 ( I N 0.5 τ × A ) X k
  9:
    end if
10:
    return  X k + 1
11:
end function
12:
while  F ( X k ) 2 2 > ϵ  do
13:
    while  F ( Y k ( τ ) ) C k + c 1 τ F ( Y k ( 0 ) )  do
14:
        Scale step length τ to satisfy the condition, namely, τ δ × τ
15:
        if current loop index >= L max and F ( X k ) 2 2 > ϵ  then
16:
           Sample τ N ( 0 , 1 )
17:
           break
18:
        end if
19:
    end while
20:
     X k + 1 = VariableUpdate ( X k , F , F ) , Q k + 1 = η Q k + 1
21:
     C k + 1 = ( η Q k C k + F ( X k + 1 ) ) / Q k + 1
22:
     Update τ k + 1 by alternating between the following two methods:
( 1 ) τ k , 1 = tr S k 1 T S k 1 tr S n 1 T Y k 1 ( 2 ) τ k , 2 = tr S k 1 T Y k 1 tr Y k 1 T Y k 1 which , S k 1 = X k X k 1 , Y k 1 = F X k F X k 1
23:
    Set τ = max ( min ( τ k + 1 , τ max ) , τ min )
24:
    if  k > = L max  then
25:
        break
26:
    else
27:
         k k + 1
28:
    end if
29:
end while
30:
Return a feasible step length τ * = τ k + 1

3.1.2. Structure-Aware Edge Embedding

Directional derivative and smoothing operators represent the global pattern of information flow, deriving from the difference of graph Fourier bases. Subsequently, graph frequencies, derivative, and smoothing operators are superimposed along a new dimension to obtain a 2D tensor that encapsulates the trend of information flow for each edge. Afterwards, the 2D tensor is transformed into a one-dimensional initial edge feature through an MLP and a transformer successively. The derivative of continuous signal provides information about the direction and intensity of its propagation. For discrete graph domains, the graph Fourier basis behaves similarly to the sine and cosine waves in continuous domains. The derivative of graph Fourier basis reflects the directional flow under the graph pattern. If we hope to further smooth discrete gradient of graph, the arcsine function can be used to postprocess the derivative.
F k = u k or arcsin u k max ( u k )
In Equation (3), the graph Fourier basis u k R N , the directional field F k R N × N , and the discrete derivative operator is ∇. In Appendix A and Equation (11) of [33], the author provides three aggregators (soft and hard softmax aggregators and a center-balanced aggregator) and scalers (degree-attenuation, degree-amplification, and identity), suggesting that B av k , B dx k R ( 9 × K ) × N × N is the nine-dimensional tensor for the k t h graph Fourier basis. When the number of graph Fourier bases is K, the dimension of the directional derivative B av and smoothing matrix B dx is ( 9 × K ) × N × N . Consideration of either the graph frequency or graph Fourier basis is insufficient to distinguish non-isomorphic graphs. The method for structure-aware edge embedding is formulated as follows.
e ˜ i j = Concat ( B av [ : , i , j ] , B dx [ : , i , j ] , Λ ) e ¯ i j = GP 1 Transformer e ˜ i j T W 0 e i j 0 = Concat ( e ¯ i j , e i j )
In Equation (4), each element of { λ i , i = 1 , , K } is replicated nine times to form a new 9 × K -dimensional vector linear transformation matrix W 0 R 3 × d e . The transformer encoder Transformer : R K × d e K × d e captures the dependence of K features to encode context information, and then, the global maximum pooling GP 2 : R K × d e R d e performs dimensional reduction. The e ¯ i j R d e is the structure-aware edge feature between the i t h and j t h nodes. The structure-aware edge feature e ¯ i j R d e and attribute e i j R d a are stacked to obtain the initial edge feature e i j 0 R d e + d a .

3.2. Directed Spectral Graph Transformer

The directed spectral graph transformer (DSGT) consists of a global asymmetric graph transformer (GT), a K-order directed graph convolution network (GCN), and a feature decoupling update. Specifically, the global asymmetric GT computes an attention map and then performs global aggregation from which the coefficients of the K-order GCN are extracted. Subsequently, the K-order directed GCN aggregates local information. Finally, inspired by the research of Dwivedi V. et al. [17], the model decouples the update processes for structure (node and edge attributes) and position (global node PE).

3.2.1. Global Asymmetric Graph Transformer

As shown in Appendix A.3, traditional attention kernels have been proved to underexploit the asymmetric relationships between nodes [29]. The global asymmetric graph transformer includes structure encoding, PrimalAttention, and global aggregation. Specifically, the first-order DGN [33] is performed to embed structural information into node attributes. Then, the asymmetric attention map is computed from PrimalAttention and edge attributes. Finally, the global aggregation is performed to obtain global node-level representations. Directional convolution is performed via DGN as follows:
X ^ ( l ) = MLP Concat D in 1 A X ( l ) , X ˜ k ( l ) k = 1 , , K X ¯ α , i , k ( l ) = S ^ ( D in , α ) · B av k [ : , : , i ] A X ( l ) S ^ ( D in , α ) · B dx [ : , : , i ] k A X ( l ) X ˜ k ( l ) = Concat X ¯ α , i , k , α = 1 , 0 , + 1 , i = 1 , 2 , 3 S ^ ( D , α ) = diag S ( D i i , α ) i = { 1 , , N } , where , S ( d , α ) = log ( d + 1 ) δ α δ = 1 N i = 1 N ln D i i + 1 α [ 1 , + 1 ]
In Equation (5), ⊙ is the element-wise product, and · · is the concatenation of node features. The aggregators and scalers are set the same as in Section 3.1.2, where X ( l ) , X ^ ( l ) R N × d f , d f is the dimension of the node feature, and N is the number of nodes in the directed graph. For each graph Fourier basis, the aggregation matrix of three aggregators B av k , B dx k R N × N × 3 , and the scalers include the degree-attenuation S ^ ( D , 1 ) R N × N , the degree-amplification S ^ ( D , + 1 ) , the identity S ^ ( D , 0 ) = I N , and the concatenation of node features after aggregation X ˜ k l R N × ( 9 × d f ) . We stack the node features and go through MLP : R N × ( ( 9 × K + 1 ) × d f ) R N × d f to obtain the l t h node representation R N × d f . The evaluation of attention scores for structure-aware node features is as follows. The mathematical formulation of the asymmetric attention kernel is shown in Appendix C.2.
e X i ( l ) = f ( X ) T W e T g q q X i ( l ) r X j ( l ) = f ( X ) T W r T g k k X j ( l )
ATT i , : = Softmax A T T ^ i , :                       A T T ^ i , j = Concat e X i ( l ) , r X j ( l )
In Equation (6), q X i ( l ) = W q X i ( l ) R d q and k X j ( l ) = W k X j ( l ) R d k are samples of query and key space, respectively, with g q ( · ) : R d p R d and g k ( · ) : R d k R d ensuring dimension compatibility. The injection of data dependence transforms the projection matrix W e , W r R N × s to its revised counterpart f X ( l ) T W e = W e | x , f X ( l ) T W r = W r | x R p × s , where f ( X l ) = X ^ ( l ) R T R N × p is derived from the random projection of X ( l ) according to the Johnson–Lindenstrauss lemma; the random projection matrix R R d f × p , each element of which follows a standard normal distribution, namely R i j N ( 0 , 1 ) . e X i ( l ) , r X j ( l ) R s , are, separately, the s projection scores of the i t h value and j t h key sample in the key and value feature space. The edge feature is integrated into the attention map ATT R N × N through the linear transformation matrix W 1 R 2 s , W 2 R d e + d a . Global aggregation with the in-degree scaler [18] is performed using the weighted sum of neighborhood features.
X g l + 1 = σ D 1 / 2 · ATT · X l
In Equation (8), the global node representation at the l + 1 t h layer X g l + 1 R N × d f is obtained by aggregating X ( l ) R N × d f using the global asymmetric GT. In the global asymmetric GT, the optimal projection matrices W e | x , W r | x along which mutual information is greatest are obtained through the minimum regularization term in the total loss. The optimization objective for the self-attention kernel can be formulated as follows [31]:
max W e , W r , e i , r j J = 1 2 i = 1 N e i T Λ e i + 1 2 j = 1 N r j T Λ r j Tr W e | x T W r | x s . t . e i = e X i ( l ) = f ( X ) T W e | x T φ q X i r j = r X j ( l ) = f ( X ) T W r | x T φ k X j
In Equation (9), Λ R s × s is the diagonal matrix filled by Lagrange multipliers, which represents the penalty intensity for objects that violate the constraints. Generally, the number of principal components s is not larger than the dimension p; namely, s p . The feasible solution to the optimization objective in Equation (9) is formulated as the least-square support vector machine (LSSVM) (Appendix A.2). We demonstrate that the objective in Equation (9) is constantly equal to 0 if the stationary condition in Equation (A6) is workable (Appendix A.2). To satisfy the stationary condition, the objective of the primal problem is formulated as a regularization term in the loss function. This goal is implemented using automatic differentiation libraries (e.g., PyTorch [34] and TensorFlow [35]). In Equation (10), the regularization term J a is expected to approach 0. Although GTs are confronted with the inability to unbiasedly approximate the frequency responses of the high-order filter, both K-order directed GCNs and GTs are complementary with each other such that a hybrid architecture model is crucially constructed.

3.2.2. K-Order Directed Graph Convolution Network

The assumption for a self-adjoint GSO in the traditional spectral graph theory is not applicable for directed graphs. We elaborate the spectral graph theory completely defined in the directed graph and then construct directed spectral graph convolution based on the theory. Specifically, directed GLM is defined as T = I N D in 1 A via graph diffusion [10], where the directed GLM T is not a self-adjoint matrix, suggesting that its frequency response is not described in the eigenspace of GLM T. However, the frequency response of the directed graph can be fully represented in the generalized eigenspace of GLM T. The generalized spectral features of the directed graph are obtained via Jordan decomposition.
In Equation (11), m λ is the algebraic multiplicity of the eigenvalue λ of directed GLM T, satisfying T λ · I N · P λ = T λ · I N · P λ m λ = 0 . P 1 T P = J is the Jordanization of GLM T, where the dyad of GLM T satisfies P λ = v λ 1 , · , v λ m λ v λ 1 , · , v λ m λ T R N × N , where v λ i , i = 1 , , m λ is a set of bases in the generalized feature subspace according to the eigenvalue λ [36]. In the generalized eigenspace, the polynomial of GLM T is formulated as follows:
J a W c , W r , Λ = 1 2 i = 1 N e i T Λ e i + 1 2 j = 1 N r j T Λ r j Tr W e T W r = 1 2 i = 1 N W e | X Λ 1 / 2 T φ q ( X i ) 2 2 + 1 2 j = 1 N W r | X Λ 1 / 2 T φ k ( X j ) 2 2 Tr W e | X T W r | X
T λ · I N · P λ = λ g ( λ ) P λ + λ n = 1 m λ 1 g ( n ) ( λ ) n ! ( T λ · I N ) n P λ
g ( T ) = λ g ( λ ) P λ + λ n = 1 m λ 1 g ( n ) ( λ ) n ! Y λ · I N n P λ
In Equation (12), the second term of g ( T ) represents the difference in frequency response between a self-adjoint and no self-adjoint GSO. The sum of the dimensions of the feature subspace m λ is equal to the number of nodes N, namely λ m λ = N . The polynomial of GLM T g ( T ) is also formulated as a holomorphic functional calculus g ( T ) = 1 2 π i Γ g ( z ) · ( T z · I N ) 1 d z , where Γ is the curve enveloping all the eigenvalues of GLM T, and g ( z ) is the polynomial of z. The Faber polynomial is the best for approximating any holomorphic function in the complex space. Fortunately, if all the eigenvalues are located inside the unit hypersphere, the Faber polynomial is equivalent to the Chebyshev polynomial [37]. Therefore, directed GLM T is normalized by the modulus of the spectral radius, namely T ˜ = T / λ max . The faber polynomial of g ( T ) is formulated as follows:
g θ ( T ˜ ) = k = 1 K θ i T ˜ i k = 1 K θ i T k T ˜ , where , θ = MLP 1 N i = 1 N X g ( l + 1 ) [ i , : ] T 0 ( X ) = I N , T 1 ( X ) = X T k ( X ) = 2 X · T k 1 ( X ) T k 1 ( X ) , k 2
In Equation (13), the polynomial coefficients θ = [ θ 0 , , θ K ] R K + 1 are obtained from global aggregation. The readout function is a successive component of mean pooling and MLP. T k ( T ˜ ) approximates the basis of the polynomial space T ˜ k . For the Liouville theorem, an ideal filter for fully suppressing high-frequency noise does not exist; namely, P λ · g ( n ) ( λ ) / n ! 0 is not possible as λ 0 . To theoretically achieve the goal, we redefine GLM T in the punctured complex space C / y as g θ ( T ˜ ) = k = 1 K θ i T ˜ i k = 1 K θ i T k ( T ˜ ) , where T ˜ = 1 λ max T y × I N , and y ( 0 , 1 ) is the singular point inside the unit hypersphere, which can adjust the behavior of the holomorphic functional calculus. Finally, local aggregation is performed through a K-order directed GCN.
X l ( l + 1 ) = FFN D in 1 · k = 1 K θ i T k T ˜ X ^ l X ^ ( l ) = Concat X ( l ) , P ( l )
In Equation (14), the l t h -layer forward feedback network FFN : R d f + d p R d f is for dimension compatibility, and X ^ ( l ) R N × d f is the concatenation of the l t h -layer node features X ( l ) R N × d f and node PE P ( l ) R N × d p . Finally, we design the rules for decoupling edge attributes, and node feature and PE updates, which are key to improving training stability and reference efficiency.

3.2.3. Feature Decoupling Update

Decoupling the update of structural features and positional coordinates helps to enhance the interpretability of the model’s inference process. Specifically, the fusion of the l t h -layer local and global node representation X ( l ) R d f is formulated as X ˘ ( l ) = MLP Concat X g ( l ) , X l ( l ) , where the multi-layer perceptron MLP : R 2 d f R d f is for dimension compatibility. The edge acts as the channel through which information flows in the graph, so its attribute is updated based on the feature of neighbor nodes, namely e ˘ k j ( l ) = MLP Concat x k ( l ) , x j ( l ) . Unlike structural updates, the node PE needs to reflect the l t h -layer diffusion distance between nodes. Because the attention map reflects the global connectivity of the graph, the update of the l t h -layer node PE is formulated as P ˘ ( l ) = ATT · P ( l 1 ) . Additionally, GraphNorm [38] and residual connection are leveraged to achieve a deeper structure and abstracter representation.
H ( l ) = Norm Dropout H ˘ ( l ) + H ( l 1 ) , where H ˘ ( l ) X ˘ ( l ) , P ˘ ( l ) , e ˘ k j ( l ) H ( l 1 ) X ( l 1 ) , P ( l 1 ) , e k j ( l 1 )
In Equation (15), randomly inactivating some parameters via Dropout ( · ) can avoid overfitting and improve the robustness of DSGT; generally, the dropout ratio is set to 0.1.

3.3. End-to-End Optimization for DSGT

For node-level classification tasks, the multi-class joint probability distribution is obtained through an affine transformation and softmax operation.
Y ^ = softmax X ( L ) W o T
In Equation (16), the linear transformation matrix W o R d f × d c , and softmax ( x ) = exp ( x i ) j exp ( x j ) . The number of layers of DSGT is L. Y ^ R N × d c represents the probability of the node class, where d c is the number of classes. Our goal is to satisfy the stationary condition, force the node PE to form a coordinate system constrained by the graph topology, and obtain robust performance at the node level. To achieve these goals, the loss function is formulated as follows:
L = 1 N i = 1 N L cls Y ^ i , Y i + η J a W e | X , W r | X , Λ + J e Y ^ + J p P ( L )
In Equation (17), Y ^ R N × d c and Y i j { 0 , , d c 1 } are the joint probability distributions of the node class and ground-truth node label. The entropy regularization J e ( Y ^ ) = i = 0 N l = 1 d c Y ^ i j ln Y ^ i j + ϵ encourages uncertainty in the classification to a certain extent. The presence of J a ( W e | X , W r | X , Λ ) in Equation (17) satisfies the stationary condition of the optimization objective in Equation (A6). The regularization term for node PE J p ( P ( L ) ) = 1 / d p + Tr P ( l ) T P ( l ) + λ / d p × P ( l ) T P ( l ) I d p 2 2 constrains the node PE to form an orthogonal coordinate system, where = I N D in 1 A R N × N is the directed GLM and p is the final/ L t h -layer node PE P ( L ) R N × d p . The L cls ( · ) is a multi-class cross-entropy loss function. The η is a regularization coefficient; generally, η = 0.1 . The model was constructed using the PyTorch framework, and the parameters were updated using the Adam optimizer. The inference and optimization processes for the DSGT model are illustrated in Algorithm 4.
Algorithm 4 Supervised node-level embedding algorithm.
Input: 
Adjacency matrix A R N × N and node feature matrix X R N × d f ; layer number L; learning rate lr = 10 3 ; true node’s label Y R N , y i { 0 , , d c 1 } ; label number d c ; total epoch number N
Output: 
Node feature matrix X L ; edge feature matrix E L ; and node relative position P L in the L t h /final layer
  1:
function Feature Engineering(adjacent matrix A)
  2:
    Compute directed graph frequencies { λ k , k = , , N } and a complete set of Fourier basis { u k , k = 1 , , N } from directed graph shift operator A, and get global positional encoding P ( 0 ) from them
  3:
    Compute directional smoothing matrix B a v and derivative matrix B d x by the Fourier basis { u k , k = 1 , , N } , and then, initialize edge feature E ( 0 ) with graph spectrum { λ i , i = 1 , , N } and them
  4:
    return edge feature matrix E ( 0 ) and positional embedding P ( 0 )
  5:
end function
  6:
function DSGT(node feature matrix X ( 0 ) , edge feature matrix   ( 0 ) E , node positional embeding P ( 0 ) )
  7:
    for  k { 1 , , K }  do
  8:
        Compute global node feature matrix of k t h layer X g ( k + 1 ) using the global asymmetric graph transformer and the average loss of attention kernel J a ( W e , W r , Λ ) for each head
  9:
        Compute local node feature matrix of k t h layer X l ( k + 1 ) by K-order Directed Graph Convolution
10:
        Update edge feature E ( k + 1 ) , node feature X ( k + 1 ) , and relative positional encoding P ( k + 1 ) with X g ( k + 1 ) , X l ( k + 1 ) , E ( k ) , and P ( k )
11:
         k k + 1
12:
    end for
13:
    return output node feature matrix X ( L )
14:
end function
15:
for current epoch i N  do
16:
     X ( 0 ) , A the i t h subgraph of training dataset
17:
     E ( 0 ) , P ( 0 ) = FEATURE ENGINEERING(A)
18:
     X ( L ) = DSGT( X ( 0 ) , E ( 0 ) , P ( 0 ) )
19:
    Compute joint probability distribution of node’s multi-label Y ^ by affine transformation and softmax for X ( L ) and its entropy regularization J e ( Y ^ )
20:
    Compute average loss function L = 1 N L c l s ( Y ^ , Y ) + η L r e g = 1 N L c l s ( Y ^ , Y ) + η J a ( W e , W r , Λ ) + J e ( Y ^ )
21:
    Carry out backpropagation algorithm for total loss L and update the learnable parameter Θ by Adam( lr = lr , β 1 = 0.9 , β 2 = 0.999 )
22:
end for
23:
Return Predictive graph-level label y ^ = arg max (softmax(z))

4. Results and Discussion

In this section, we evaluate the effectiveness of DSGT compared to various baseline models and the rationality of modules across multi-scale datasets. In our experiments, we employed the uniform splitting and neighbor sampling strategies for datasets with diverse attributes, in order to test the scalability of the models from a unified perspective.

4.1. Experimental Setup

In the section, we first describe the setting of the hyperparameters in our model, which are justified in Section 4.4, and introduce the dataset settings, the details of the baselines, and the configurations of the hardware and software used in the experiment.

4.1.1. Dataset Settings

Homogeneous datasets include bibliographic references (Cora, squirrel, ogbn-arxiv, and arxiv-year) and social networks (genius and Tolokers). In bibliographic reference networks, nodes represent academic papers and edges denote directed citation relationships between papers. The research topics of the papers provide categorical information for the nodes. Similarly, in social networks, nodes represent users on internet platforms, and edges indicate sharing a post. Node labels are typically derived from the users’ attributes and account statuses. The computational complexity is influenced by the sparsity of the edges; the datasets are classified into small- (Cora and squirrel), medium- (genius and Tolokers), and large-scale (ogbn-arxiv and arxiv-year) ones. The distribution of the node attributes and quantity within neighborhoods are key factors influencing the task-related performance of the models [39]. The dataset attributes are presented in Table 2, where the degree correlation coefficient reflects the correlation between the degrees of adjacent nodes, and the node homophily indicates the similarity of the node attributes within community structures. These datasets enable a comprehensive evaluation of DSGT. To optimize hardware resource usage, small graph datasets of similar scale were constructed by sampling nodes within a neighborhood range, with each node having six neighbors and a neighborhood radius of 10.
To split the training/validation/test sets, we followed the method described by the authors of [40,41,42], who proposed the datasets. Specifically, for Cora, we used the split (50/30/20, train/val/test) provided in PyG [43]. For the Tolokers and squirrel datasets, Platonov et al. [42] provides 10 different splits. Hence, we utilized the first five different splits from them in our experiments. For ogbn-arxiv and arxiv-year, the dataset-splitting strategy described by the authors of [40], who proposed the datasets, was used. Hence, we simply utilized it and initialized the model parameters by using different random seeds for each run. For genius, Lim et al. [41] proposes running each method on the same five random 50/25/25 train/val/test splits for each dataset. We followed the method from this paper to split the datasets.

4.1.2. Baselines

DSGT integrates directed GNN and GT architectures to approximate the arbitrary-order filter. To test the strength of the hybrid architecture in the node classification of directed graph, GTs (MagLapNet-II [18]) and directed GNNs (DGCN [6], DiGCN [9], Holonet [10], MagNet [7], and MagLapNet-I [32]) were used as baselines. MagLapNet employs the magnetic GLM to both embed the node position and aggregate messages in the complex domain. The convolutions of MagLapNet-I and MagLapNet-II for imaginary and real parts are, separately, GCN [5] and structure-aware transformer [18], while the node PE of both is derived from the eigendecomposition of the magnetic GLM. The dependence of WL test on in-neighbors suggests that WL PE is universal in directed and undirected graphs. Therefore, GT with WL PE [14] is naturally applicable for directed graph node classification.

4.1.3. Experimental Protocol

Neighborhood sampling extracts a subgraph with mini-batch seed nodes and a subset of edges between nodes within a K-hop neighborhood from a large graph. This approach circumvents the OOM issue caused by the quadratic computational complexity of global attention mechanisms. How the dataset is constructed and the batch size have certain impacts on classification performance, leading to differences from the results reported in the original paper. Subsequently, data preprocessing initializes structure-aware edge embeddings and node PE. The initial embeddings and node attributes are then fed into a single-layer DSGT to infer the multi-class joint probability distributions for the seed nodes.
For each epoch, the weight matrices are initialized using Xavier Normal, while all the biases are set to zero. Furthermore, macro accuracy is adopted as the evaluation metric for model performance, and the model parameters are continuously updated using the Adam optimizer with a learning rate of 10 3 and beta parameters (0.9 and 0.99). The loss function is a weighted sum of the multi-class cross-entropy loss and the PrimalAttention optimization objective. The dropout is 0.1 during model training. Additionally, the appropriate hyperparameters for DSGT and baselines are a directed graph filter order of K = 4 , a dimension of node position encoding of D p = 18 , and a dimension of structure-aware edge embeddings of D e = 108 . The dimension of hidden layers is 100, while the output dimension goes up to the number of node classes.
Our workings for the data pipelines, model training, and deployment were built using the Pytorch Lightning framework. The EarlyStopping callback of the Pytorch Lightning framework with a threshold value of 10 4 , a minimum of 100 epochs and a maximum of 200 epochs can be used to monitor the prediction accuracy of the model on the validation set and stop the training when no improvement is observed. Meanwhile, we used the GradientAccumulationScheduler callback for gradient accumulation and scheduled the learning rate with ReduceLROnPlateau, with a maximum learning rate of 10 2 and minimum of 10 4 , monitoring the aforementioned metric to prevent overfitting during training. Meanwhile, we used the GradientAccumulationScheduler callback for gradient accumulation and scheduled the learning rate using ReduceLROnPlateau, with a maximum learning rate of 10 2 and minimum of 10 4 . The learning rate was adjusted based on the monitoring of the aforementioned metric to prevent overfitting during training.

4.1.4. Experimental Environment

The DSGT model was implemented based on Pytorch, and we employed the MPNN paradigm of the PyG library to design its convolution module, whose implementation can be seen in Appendix D. End-to-end training on a multi-GPU setup was achieved via Pytorch Lightning. All the experiments were repeated five times on two 12GB NVIDIA GeForce RTX 3080Ti GPUs and a 40-core Intel Xeon Gold 6248 CPU, with no fewer than 100 epochs each time.

4.2. Comparative Experiments

In this section, we present the advantages of our model architecture over the aforementioned baselines for the node classification task and demonstrate the effectiveness of our approach for node positional embedding compared to that of other methods.

4.2.1. Supervised Node Classification

The experimental results for DSGT and the baselines are shown in Table 2. DSGT achieved state-of-the-art performance across the Cora, genius, Tolokers, and ogbn-arxiv datasets, improving the classification accuracy of the graph nodes by 2.96%, 6.63%, 6.60%, and 0.41%, respectively. DGCN exhibited the poorest performance in the node classification task. Among the baselines, DGCN utilized first-order and second-order proximal matrices for aggregation. For a network with strong degree correlations, second-order proximal aggregation may introduce more additional edges, increasing the likelihood of overfitting. The average node classification accuracy of DGCN is positively correlated with the node homophily, suggesting that more homogeneous neighbors lead to more similar node representations through second-order proximal aggregation. In DiGCN, the PageRank algorithm approximates an irreducible and aperiodic graph Laplacian model (GLM), ultimately enabling stationary Markov distributions through step-wise convolution. However, DiGCN inherits the same drawbacks as DGCN due to the Inception network, as message aggregation may amplify the noise caused by neglecting the information diffusion direction, particularly when node degree distributions are positively correlated.
Holonet outperforms both DGCN and DiGCN by respecting the directed graph topology. Compared to MagNet, MagLapNet-I/II utilizes a magnetic GLM based on the directed adjacency matrix, allowing it to capture directed cyclic subgraphs through phase shifts, which is the key to the superior performance of MagLapNet-I/II over MagNet. In contrast to the aforementioned GCN architectures, both MagLapNet-II and DSGT belong to GT architectures, which typically leverage global attention mechanisms and the inclusivity for edge attributes to offer an advantage in node classification tasks. However, the arxiv-year dataset exhibits low node homogeneity and degree correlation, which prevents the GT from gaining more expression compared to GCN architectures through node PE and global adaptive aggregation. Among the GTs, GT-WL underperforms because the layer-wise hashing of the node’s neighborhood information fails to capture the various importance values of nodes under graph patterns. As shown in Table 3, effectively balancing inductive bias with the scope of the receptive field can significantly improve the expression of the model.

4.2.2. Effectiveness of Node Positional Embeddings

The comparative experiment for the effectiveness of node PE is presented in Table 4. PGDT with directed PE (ours) consistently outperformed the other methods across most of the datasets, achieving average improvements of 1.94%, 3.4%, 7.79%, 2.8%, and 0.41%, respectively. Despite the varying optimality of the feasible solutions obtained in each iteration, PGDT with directed PE (ours) remains robust by maintaining the relative importance of nodes under graph patterns [22]. Magnetic LapPE-ABS represents the node information gain and is critical for node classification in the genius dataset, in which both the degree correlation and node homophily are strong, leading to its good performance. Similar to GCN, WL PE iteratively integrates neighborhood structural information, progressively refining the position representation. Although a global WL PE theoretically exists, in practice, its discriminability diminishes as the neighborhood radius expands, causing DSGT with WL PE to underperform among GTs, primarily due to its failure to incorporate directional inductive bias. Whether node PE plays a relevant role in directed graphs or not determines its utility. Notably, the top 5% of singular values and corresponding singular vectors capture about 95% of the matrix’s characteristics, explaining why SVD PE outperforms Magnetic LapPE-ABS/REAL.

4.3. Ablation Experiment

The results of the ablation experiment are presented in Table 5. It can be observed that PrimalAttention, structural encoding, node PE, initial edge embedding, and adaptive generation of filter coefficient led to performance improvements of 2.23%, 3.56%, 11.67%, 3.25%, and 3.79%, respectively. Among these, node PE had the most significant impact on DSGT’s performance, while initial edge encoding had the least. In the arxiv-year dataset, the absence of structure encoding led to a 0.78% improvement in model performance. It is evident that node attributes play an overwhelming role in node classification because the dataset does not include edge attributes that reflect semantic connections between nodes. Node PE serves as a soft inductive bias, providing position-aware information to the GT. When node PE is removed, the GT behaves similarly to a GAT in a fully connected graph with over-smoothing and over-squashing. Additionally, asymmetric attention kernels [30] in DSGT helped to limit the performance degradation to within 10%. The adaptive generation of filtering coefficients unified the behavior of polynomial filters in spectral space, yielding an average performance improvement of 3.79% across all the datasets. Although the initial edge embedding and structural encoding did not significantly enhance the model classification performance across five datasets, they remain crucial for node-level representation learning in structurally informative datasets, such as traffic networks and protein–protein interactions (PPIs).

4.4. Hyperparameter Tuning

The hyperparameters that were tuned across all the datasets were the order of polynomial filters ( K = 4 ), the dimension of node PE ( D p = 18 ), and the dimension of initial edge embedding ( D e = 108 ).

4.4.1. The Order of Polynomial Filters

The effect of node PE on model performance for the Cora dataset is presented in Table 6. DSGT with directed PE (ours) achieved the best average classification accuracy of DSGT, at 79.96%, when the filter order was set to four. SVD PE ranked only second in performance to directed PE (ours), indicating that the asymmetry of directed GLM can reflect the spatial distances between nodes under graph patterns. Given the non-uniqueness of the underlying undirected graph to directed graphs, DSGT with undirected LapPE yielded different inference results when dealing with isomorphic graphs, resulting in the worst performance in classification tasks. The majority of node PEs were the most computationally efficient when the filter order was set to four.

4.4.2. Dimension of Node Positional and Initial Edge Embedding

The effects of an increase in the dimension of node PE on the total number of parameters Θ (M) and average classification accuracy Λ (%) are shown in Table 7. Each increment of D p = 5 in the node PE dimension resulted in an approximate increase of 0.43M parameters. To describe the impact of dimension changes on the model computational efficiency, a utility measure was defined as P = Λ / Θ . As the dimension of most of the PE methods increased from 18, the computational efficiency tended to decline. Consequently, the dimension of node PE was set to 18. The relationships of the node PE dimension with two metrics are presented in Table 8. Each increment of D e = 18 in the dimension of the initial edge embedding resulted in an approximate increase of 0.21M parameters. Similarly, when the initial edge embedding dimension increased to six, the computational efficiency began to decrease. Therefore, the initial edge embedding dimension was set to six.

5. Conclusions

In the context of directed graph node representation learning, we proposed a hybrid architecture model, called DSGT, which integrates both GT and GCN architectures. Our approach introduces innovative methods for directed node PE and structure-aware edge embedding, specifically designed for the directed GT. Through comparative experiments, we not only demonstrated the SOTA performance of DSGT against baselines, but also highlighted its scalability across multi-scale datasets. Additionally, an ablation experiment confirmed the effectiveness of DSGT’s functional modules. We aim to pave the way for future advancements in node-level representation learning by further coupling graph transformer and GCN architectures.

Author Contributions

Conceptualization, G.H., Q.Y., F.C. and G.C.; conceiving and designing the experiments, and analyzing the data and writing the paper, G.H.; performing the experiments and contributing to data analysis, G.H. and Q.Y.; contributing analysis tools and reviewing the manuscript, F.C. and G.C.; supervising the project and providing critical feedback on the manuscript, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the PyG datasets. These data were derived from the following resources available in the public domain (PyG datasets website: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html, accessed on 10 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. First-Order Extremum Condition for Dual Problems

The primal problem of self-attention is formulated as follows:
max W e , W r , e i , r j min h e i , h r j L = 1 2 i = 1 N e i T Λ e i + 1 2 j = 1 N r j T Λ r j Tr W e T W r
KKT Conditions : h e i T e i W e T f ( X ) ϖ ( X i ) = 0 , h e i > 0 h r j T r j W r T f ( X ) ϖ ( X j ) = 0 , h r j > 0 e i W e T f ( X ) ϖ ( X i ) = r j W r T f ( X ) ϖ ( X j ) = 0
In Equation (A1), the KKT conditions successively consist of a condition of complementary slackness, and a dual and primal feasibility condition. The corresponding dual problem under the KKT conditions is as follows:
max W e , W r , e i , r j min h e i , h r j L = 1 2 i = 1 N e i T Λ e i + 1 2 j = 1 N r j T Λ r j Tr W e T W r i = 1 N h e i T e i W e T f ( X ) ϖ ( X i ) i = 1 N h r j T r j W r T f ( X ) ϖ ( X j )
The extremum conditions of the dual problem are as follows:
L W e = 0 W r = i = 1 N f ( X ) φ ( X i ) h e i T R N × s L e i = 0 Λ e i = h e i L W r = 0 W e = j = 1 N f ( X ) φ ( X j ) h r j T R N × s L r j = 0 Λ r j = h r j

Appendix A.2. Stationary Condition

The definition of the dual problem under KKT conditions is shown in Equation (A4). We can formulate the extremum of max W e , W r , e i , r j L and the KKT condition as the following equations:
Λ 1 h e i = j = 1 N f X φ k x j h r j T T f X φ q x i = j = 1 N h r j φ k x j T f X T f X φ q x i K i j Λ 1 h r j = i = 1 N f X φ q x i h e i T T f X φ k x j = i = 1 N h e i φ q x i T f X T f X φ k x j K i j
In Equation (A5), K R N × N is the kernel matrix. Equation (A5) is formulated as a form of LSSVM.
Λ 1 h e i = K i , : H r Λ 1 h r j = K j , : H e , Λ 1 H e = K H r Λ 1 H r = K T H e 0 K K T 0 H e H r = H e H r Σ
Λ 1 H e = K H r Λ 1 H r = K T H e Λ 1 K T H e Λ 1 H r = K T K H r Λ 1 K H r Λ 1 H e = K K T H e Λ 2 H r = K T K H r Λ 2 H e = K K T H e
In Equation (A6), H r = h r 1 , , h r N T R N × s , H e = h e 1 , , h e N T R N × s , and Λ 1 = Σ . The kernel matrix K i j = f ( X ) φ q ( X i ) , f ( X ) φ ( X j ) = φ q ( X i ) , φ k ( X j ) .
Proof. 
Equation (A6) is the stationary condition under which the objective value of the dual problem is constantly equal to 0.    □
The proof is proved as follows. Putting Equation (A6) and Equation (A4) into the dual objective in Equation (A3), we have the following:
min h e i , h r j max W e , W r , e i , r j L = 1 2 i = 1 N Λ 1 h e i T Λ Λ 1 h e i + 1 2 j = 1 N Λ 1 h r j T Λ Λ 1 h r j Tr j = 1 N f ( X ) φ k ( X j ) h r j T T i = 1 N f ( X ) φ q ( X i ) h e i T i = 1 N h e i T e i j = 1 N f ( X ) φ k ( X j ) h r j T T f ( X ) φ q ( X i ) j = 1 N h r j T r j i = 1 N f ( X ) φ q ( X i ) h e i T T f ( X ) φ k ( X j ) = 1 2 i = 1 N h e i T Λ 1 h e i + 1 2 j = 1 N h r j T Λ 1 h r j Tr i = 1 N j = 1 N h r j φ k ( X j ) T f ( X ) T f ( X ) φ q ( X i ) h e i T = 1 2 Tr H r T Λ 1 H r + 1 2 Tr H e T Λ 1 H e Tr H e T K H r = 1 2 Tr H r T Λ 1 H r + 1 2 Tr H e T Λ 1 H e Tr H e T Λ 1 H e = 1 2 Tr H r T Λ 1 H r 1 2 H e T Λ 1 H e = 1 2 Tr Λ 1 1 2 Tr Λ 1 = 0 , where , H e , H r R N × s , K R s × s

Appendix A.3. Advantages over Traditional Attention Kernel

The projection scores e ( X i ) and r ( X j ) are formulated in the primal and dual problem as follows:
Primal : e i = W e | X φ q x i r j = W r | X φ k x j Dual : e i = j = 1 N h r j K i j r j = i = 1 N h e i K i j
In Equation (A9), traditional self-attention is formulated as e i = j = 1 N h r j K i j , where K i j is the attention score between the i t h and j t h nodes, and h r j is a sample from the value subspace. In our work, e i and r j are considered to describe asymmetric dependence from mutual perspectives.

Appendix B

The abbreviations and notations employed in the following sections are detailed in Table A1 and Table A2, ensuring clarity and ease of reference.
Table A1. Abbreviations and Descriptions.
Table A1. Abbreviations and Descriptions.
AbbreviationDescription
DSGTDirected Spectral Graph Transformer
Node PENode Positional Embedding
SPSignal Processing
KSVDKernel Singular Value Decomposition
TV(Undirected) Total Variation
DVDirected Total Variation
LSSVMLeast-Square Support Vector Machine
GLMGraph Laplacian Matrix
GSOGraph Shifted Operator (e.g., Graph Laplacian Matrix, Graph Adjacent Matrix)
GTGraph Transformer
GCNGraph Convolution Network
GNNGraph Neural Network
GFTGraph Fourier Transformer
IGFTInverse Graph Fourier Transformer
RKBSReproducing Kernel Banach Spaces
MLPMulti-Layer Perception
GPGlobal Maximum Pooling
FFNForward Feedback Network
Table A2. Symbols and Descriptions.
Table A2. Symbols and Descriptions.
SymbolsDescription
NThe total number of nodes in a graph
A R N × N Binary graph adjacent matrix, where A i j { 0 , 1 }
D in R N × N Graph in-degree diagonal matrix, where D i i in = j = 1 N A j , i
d f The dimension of the node feature
d p The dimension of the node positional embedding
d e The dimension of the initial edge embedding
d a The dimension of the edge attribute
L R N × N Undirected graph Laplacian matrix
T R N × N Directed graph Laplacian matrix
Concat Stack operation in the feature dimension
Transformer Transformer encoder
GP i Global maximum pooling for the i t h dimension
σ ( · ) Nonlinear activation function
sThe dimension of the rank space, in other words, the number of principal components
pThe dimension of the basis vector in the rank space
·Matrix multiplication
×Scalar multiplication

Appendix C

In the section, we introduce the theoretical foundations of algorithm design, including undirected spectral graph convolution theory and the optimal decomposition of the nonlinear kernel SVD. When the edges are directional, this spectral graph theory cannot be applied, but it still provides an effective paradigm for constructing graph filters. In the global asymmetric GT, the attention kernel finds mutually asymmetric dependence from both perspectives, which aligns with the optimization objective of the nonlinear kernel SVD.

Appendix C.1. Spectral Graph Convolution Theory

In the spectral domain, node features can be reconstructed using the graph Fourier basis.
f o u t ( i ) = l = 0 N 1 f ^ i n ( λ l ) h ^ ( λ l ) u l ( i ) = j = 1 N f i n ( j ) k = 0 K a k L i , j l
In Equation (A10), the node signal is reconstructed by the GFT and IGFT. u l R N is the l t h graph Fourier basis. ( · ) * is the conjugate transpose operator. The GLM L R N × N = l = 1 N λ l × u l * · u l = U Λ U * is formulated as the weighted sum of N dyads u l * · u l  [36]. f ^ i n ( λ l ) = j = 1 N f i n ( j ) h ^ ( λ l ) u l * ( i ) is the frequency spectrum obtained by GFT, where f i n ( j ) is the j t h node signal and h ^ ( λ l ) is the K-order polynomial filter for the l t h frequency in the spectral domain with K + 1 coefficients a = { a 0 , , a k } . Equation (A10) is also simply formulated as follows:
G θ X = U · g θ ( Λ ) · ( U * X ) = U i = 0 k θ i Λ i U * X = i = 0 k θ i ( U Λ U * ) i X = i = 0 k θ i L i X
In Equation (A11), the M-dimensional feature of N nodes is X R N × M , the eigendecomposition of GLM is U Λ U T (for real domain, U * = U T ), the frequency response is g θ ( Λ ) · ( U T X ) with the scalar-to-scalar function g θ ( Λ ) = i = 0 k θ i Λ i in polynomial space, U T X is the frequency spectrum via the GFT, and, on the contrary, U · g θ ( Λ ) · ( U T X ) is the reconstruction of the node signal via the IGFT. However, the overhead of the power of GLM is prohibitive. The Chebyshev polynomial approximates arbitrary functions inside the unit hypersphere. Hence, combinational GLM would be normalized by its own spectral radius, and then, the Chebyshev polynomial would be employed for a basis in polynomial space.
y = σ k = 0 K θ k T k ( L ^ ) X σ i = 0 k θ i L i X
Equation (A12) shows the Chebyshev polynomial { T k ( L ^ ) , k = 0 , , K } , where T 0 ( X ) = I , T 1 ( X ) = X , T k ( X ) = 2 X T k 1 ( X ) T k 2 ( X ) , and the normalized GLM where L ^ = L s y m I N , where combinational GLM L s y m = I N D 1 / 2 A D 1 / 2 . θ = { θ k , k = 0 , , K } are the K + 1 coefficients of the K-order polynomial filter. The first-order polynomial filter can be approximately formulated as follows:
X l + 1 = σ θ 0 + θ 1 ( L ^ I N ) X l = σ θ 0 X l + θ 1 D 1 / 2 A D 1 / 2 X l
In Equation (A13), X l R N × M is the l t h -layer node features. If θ 0 = θ 1 = θ , the simplest version of the first-order filter is as follows:
X l + 1 = σ θ I D 1 / 2 A D 1 / 2 X l X l + 1 = σ I D 1 / 2 A D 1 / 2 X l W l
In Equation (A14), θ R is the scalar. Up to this point, the equivalence between spatial convolution and the spectral filter is proved for the self-adjoint GSO. Further, the presence of FFN allows GCN to identify the inter-dimensional complex correlation. Finally, spectral GCN is formulated as Equation (A14), where W l R M × M is a linear transformation matrix.

Appendix C.2. Nonlinear Kernel SVD

The optimization objective is to separately find r projection vectors along which the similarity of one to another is the greatest for each sample space, where r is the rank of the matrix A. The left and right singular vectors of SVD decomposition A = U Λ V T reflect the asymmetric similarity of the row and column spaces of matrix A, respectively. The linear dependence of both sample spaces is usually not evident in data space. Kernel SVD is built on SVD, identifying complex correlations on reproducing kernel Banach spaces (RKBS). Let us consider the row and column of matrix A R N × M as two sample spaces: N samples of row space X = { A i , : x i } i = 1 N R M and M samples of column space Z = { A : , j z j } j = 1 M R N . In RKBS, ϕ ( x i ) R p and ψ ( z j ) R p are samples in the row and column feature space, where ϕ ( · ) : R M R p and φ ( · ) : R N R p are used for nonlinear mapping from data space to RKBS and dimension compatibility. The r p-dimensional projection vectors are formulated as follows, with the constraint condition p r .
A ϕ = [ a ϕ 1 , , a ϕ r ] A ϕ = Φ T B ϕ A ψ = [ a ψ 1 , , a ψ r ] A ψ = Ψ T B ψ
In Equation (A15), according to the generic concept of RKBS, any vector can be expressed as a linear combination of samples in that space, so a ϕ l = i = 1 N b ϕ l , i ϕ ( x i ) , a ψ l = i = 1 N b ψ l , j ψ ( z j ) . The feature subspaces A ϕ and A ψ are components of r linearly independent vectors from the row and column space, respectively. Φ = [ ϕ ( x 1 ) , , ϕ ( x N ) ] R N × p and Ψ = [ ψ ( z 1 ) , , ψ ( z M ) ] R M × p are, separately, samples from row and column feature spaces. The maximum variances of a multi-distributional sample can obtain some principal components that represent the directions along which the characteristics of the sample distribution obviously appear. Meanwhile, the comprehensive asymmetry requires the feature subspaces to be mutually orthogonal. The optimization objective of KSVD is formulated as follows:
max B ϕ , B ψ 1 2 T r Σ ^ ϕ + T r Σ ^ ψ = 1 2 G T B ϕ 2 2 + 1 2 G B ψ 2 2 s . t . A ϕ T A ψ = B ϕ T G B ψ = I r
In Equation (A16), Σ ϕ = Ψ A ϕ T Ψ A ϕ is the variance of samples from the row space of matrix A in the direction of the principal components in A’s column space. Similarly, Σ ψ = Φ A ψ T Φ A ψ is the variance of samples from A’s column space in the direction of the principal components in A’s row space. The kernel matrix G R N × M reflects the dependence of row and column samples, and each element G i j = κ ( x i , z j ) = ϕ ( x i ) T ψ ( z j ) is computed by either SNE or T kernel between the i t h row sample and j t h column sample. The orthogonality-preserved constraint highlights the independence of both feature subspaces where the correlation of row and column samples is evaluated in RKBS. The above is also called the shifted eigenvalue problem. The feasible solution is as follows:
G T G B ψ = G T B ϕ Λ G G T B ϕ = G B ψ Λ G G T G B ψ = G B ψ Λ Λ G T B ϕ Λ Λ = G T G G T B ϕ
In Equation (A17), G B ψ and G T B ϕ are the left and right singular matrices of G, and the Lagrange multipliers Λ = diag ( [ λ 1 , , λ r ] ) are singulars of G. Compared to SVD, KSVD is able to explore asymmetrically nonlinear dependence via a kernel trick. The SVD decomposition of kernel matrix G is formulated as G = B ψ Λ B ϕ T , where B ψ R N × r , B ϕ R M × r , and Λ R r × r .

Appendix D

PyTorch geometric (PyG) efficiently creates graph data pipelines and accelerates graph convolutional network (GCN) training and inference using sparse matrix operations. The nn.MessagePassing class in PyG serves as the base for GCNs based on the message passing neural network (MPNN) paradigm. Its class member functions (i.e., message, edge_update, update, and forward) can be customized to implement message passing, edge feature updates, node feature updates, and graph convolution operations.
The functional connectivity of all the modules is shown in Figure A1. The structure-aware GCN and adaptive CheConv, both based on the MPNN paradigm, involve message passing, aggregation, and update operations for each node in the graph. Similar to GAT, the node PEs update biasedly aggregates the PEs in the one-hop neighborhood to update each node’s PE. In layers where l 0 , updating each edge attribute considers the current edge attribute as well as the target and source node features. Because these operations affect every element of the graph, their base class is required to be PyG.nn.MessagePassing. In contrast, the global asymmetric attention kernel, node feature update, and filter coefficient generation are implemented through parallel tensor computation, so their base class is Pytorch.nn.Module. In LaTeX, the minted package does not allow content to span across pages. Therefore, we mainly focus on demonstrating the implementation of modules using PyTorch and PyG libraries in the forward inference.
Figure A1. The forward inference of the spectral dynamic graph transformer (DSGT) layer is implemented through the PyG and Pytorch libraries.
Figure A1. The forward inference of the spectral dynamic graph transformer (DSGT) layer is implemented through the PyG and Pytorch libraries.
Mathematics 12 03689 g0a1
Mathematics 12 03689 i001
Mathematics 12 03689 i002
Mathematics 12 03689 i003
Mathematics 12 03689 i004
Mathematics 12 03689 i005
Mathematics 12 03689 i006
Mathematics 12 03689 i007

References

  1. Zhang, G.; Li, D.; Gu, H.; Lu, T.; Shang, L.; Gu, N. Simulating News Recommendation Ecosystems for Insights and Implications. IEEE Trans. Comput. Soc. Syst. 2024, 11, 5699–5713. [Google Scholar] [CrossRef]
  2. Jain, S.; Hegade, P. E-commerce Product Recommendation Based on Product Specification and Similarity. In Proceedings of the 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Zallaq, Bahrain, 29–30 September 2021; pp. 620–625. [Google Scholar] [CrossRef]
  3. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral Networks and Locally Connected Networks on Graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar] [CrossRef]
  4. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the NIPS’16, 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3844–3852. [Google Scholar]
  5. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
  6. Tong, Z.; Liang, Y.; Sun, C.; Rosenblum, D.S.; Lim, A. Directed Graph Convolutional Network. arXiv 2020, arXiv:2004.13970. [Google Scholar] [CrossRef]
  7. Zhang, X.; He, Y.; Brugnone, N.; Perlmutter, M.; Hirn, M. MagNet: A Neural Network for Directed Graphs. Adv. Neural Inf. Process. Syst. 2021, 34, 27003–27015. [Google Scholar]
  8. Ma, Y.; Hao, J.; Yang, Y.; Li, H.; Jin, J.; Chen, G. Spectral-based Graph Convolutional Network for Directed Graphs. arXiv 2019, arXiv:1907.08990. [Google Scholar] [CrossRef]
  9. Tong, Z.; Liang, Y.; Sun, C.; Li, X.; Rosenblum, D.; Lim, A. Digraph Inception Convolutional Networks. Adv. Neural Inf. Process. Syst. 2020, 33, 17907–17918. [Google Scholar]
  10. Koke, C.; Cremer, D. HoloNets: Spectral Convolutions do extend to Directed Graphs. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna Austria, 7–11 May 2024; Available online: https://openreview.net/forum?id=EhmEwfavOW (accessed on 10 October 2024).
  11. Nguyen, K.; Nong, H.; Nguyen, V.; Ho, N.; Osher, S.; Nguyen, T. Revisiting over-smoothing and over-squashing using ollivier-ricci curvature. In Proceedings of the ICML’23, 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  12. Akansha, S. Over-Squashing in Graph Neural Networks: A Comprehensive survey. arXiv 2023, arXiv:2308.15568. [Google Scholar]
  13. Topping, J.; Di Giovanni, F.; Chamberlain, B.P.; Dong, X.; Bronstein, M.M. Understanding over-squashing and bottlenecks on graphs via curvature. arXiv 2021, arXiv:2111.14522. [Google Scholar] [CrossRef]
  14. Dwivedi, V.P.; Bresson, X. A Generalization of Transformer Networks to Graphs. arXiv 2020, arXiv:2012.09699. [Google Scholar]
  15. Bastos, A.; Nadgeri, A.; Singh, K.; Kanezashi, H.; Suzumura, T.; Mulang’, I.O. Investigating Expressiveness of Transformer in Spectral Domain for Graphs. arXiv 2022, arXiv:2201.09332. [Google Scholar] [CrossRef]
  16. Kreuzer, D.; Beaini, D.; Hamilton, W.; Létourneau, V.; Tossou, P. Rethinking Graph Transformers with Spectral Attention. Adv. Neural Inf. Process. Syst. 2021, 34, 21618–21629. [Google Scholar]
  17. Dwivedi, V.P.; Luu, A.T.; Laurent, T.; Bengio, Y.; Bresson, X. Graph Neural Networks with Learnable Structural and Positional Representations. arXiv 2021, arXiv:2110.07875. [Google Scholar] [CrossRef]
  18. Chen, D.; O’Bray, L.; Borgwardt, K. Structure-Aware Transformer for Graph Representation Learning. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 3469–3489. [Google Scholar]
  19. Rampášek, L.; Galkin, M.; Dwivedi, V.P.; Luu, A.T.; Wolf, G.; Beaini, D. Recipe for a general, powerful, scalable graph transformer. In Proceedings of the NIPS ’22, 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  20. Bo, D.; Shi, C.; Wang, L.; Liao, R. Specformer: Spectral Graph Neural Networks Meet Transformers. arXiv 2023, arXiv:2303.01028. [Google Scholar] [CrossRef]
  21. Singh, R.; Chakraborty, A.; Manoj, B.S. Graph Fourier transform based on directed Laplacian. In Proceedings of the 2016 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 12–15 June 2016; pp. 1–5. [Google Scholar] [CrossRef]
  22. Shafipour, R.; Khodabakhsh, A.; Mateos, G.; Nikolova, E. A Directed Graph Fourier Transform With Spread Frequency Components. IEEE Trans. Signal Process. 2019, 67, 946–960. [Google Scholar] [CrossRef]
  23. Leus, G.; Segarra, S.; Ribeiro, A.; Marques, A.G. The Dual Graph Shift Operator: Identifying the Support of the Frequency Domain. J. Fourier Anal. Appl. 2021, 27, 49. [Google Scholar] [CrossRef]
  24. Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  25. Maron, H.; Ben-Hamu, H.; Shamir, N.; Lipman, Y. Invariant and Equivariant Graph Networks. arXiv 2018, arXiv:1812.09902. [Google Scholar]
  26. Lim, D.; Robinson, J.; Zhao, L.; Smidt, T.E.; Sra, S.; Maron, H.; Jegelka, S. Sign and Basis Invariant Networks for Spectral Graph Representation Learning. arXiv 2022, arXiv:2202.13013. [Google Scholar]
  27. Ma, G.; Wang, Y.; Wang, Y. Laplacian Canonization: A Minimalist Approach to Sign and Basis Invariant Spectral Embedding. Adv. Neural Inf. Process. Syst. 2023, 36, 11296–11337. [Google Scholar]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17, 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  29. Nguyen, T.M.; Nguyen, T.; Ho, N.; Bertozzi, A.L.; Baraniuk, R.G.; Osher, S.J. A Primal-Dual Framework for Transformers and Neural Networks. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  30. Chen, Y.; Tao, Q.; Tonin, F.; Suykens, J.A. Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  31. Suykens, J.A. SVD revisited: A new variational principle, compatible feature maps and nonlinear extensions. Appl. Comput. Harmon. Anal. 2016, 40, 600–609. [Google Scholar] [CrossRef]
  32. Geisler, S.; Li, Y.; Mankowitz, D.; Cemgil, A.T.; Günnemann, S.; Paduraru, C. Transformers meet directed graphs. In Proceedings of the ICML’23, 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  33. Beaini, D.; Passaro, S.; Letourneau, V.; Hamilton, W.L.; Corso, G.; Liò, P. Directional graph networks. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  34. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  35. Abadi, M. TensorFlow: Learning functions at scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, Nara, Japan, 18–24 September 2016. [Google Scholar]
  36. Leon, S.J.; De Pillis, L.; De Pillis, L.G. Linear Algebra with Applications; Pearson Prentice Hall: Upper Saddle River, NJ, USA, 2006. [Google Scholar]
  37. Moret, I.; Novati, P. The computation of functions of matrices by truncated Faber series. Numer. Funct. Anal. Optim. 2001, 22, 1–18. [Google Scholar] [CrossRef]
  38. Cai, T.; Luo, S.; Xu, K.; He, D.; Liu, T.y.; Wang, L. Graphnorm: A principled approach to accelerating graph neural network training. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 1204–1215. [Google Scholar]
  39. Maekawa, S.; Sasaki, Y.; Onizuka, M. A Simple and Scalable Graph Neural Network for Large Directed Graphs. arXiv 2023, arXiv:2306.08274. [Google Scholar]
  40. Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; Leskovec, J. Open Graph Benchmark: Datasets for Machine Learning on Graphs. Adv. Neural Inf. Process. Syst. 2020, 33, 22118–22133. [Google Scholar]
  41. Lim, D.; Hohne, F.; Li, X.; Huang, S.L.; Gupta, V.; Bhalerao, O.; Lim, S.N. Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods. Adv. Neural Inf. Process. Syst. 2021, 34, 20887–20902. [Google Scholar]
  42. Platonov, O.; Kuznedelev, D.; Diskin, M.; Babenko, A.; Prokhorenkova, L. A critical look at the evaluation of GNNs under heterophily: Are we really making progress? In Proceedings of the Eleventh International Conference on Learning Representations, Virtual Event, 25–29 April 2022.
  43. Fey, M.; Lenssen, J.E. Fast Graph Representation Learning with PyTorch Geometric. arXiv 2019, arXiv:1903.02428. [Google Scholar]
  44. Hussain, M.S.; Zaki, M.J.; Subramanian, D. Global Self-Attention as a Replacement for Graph Convolution. In Proceedings of the KDD ’22, 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 655–665. [Google Scholar] [CrossRef]
Figure 1. The architecture of DSGT with a priori knowledge. The preprocessing obtains a directed node PE and an initial edge embedding, and the forward inference outputs a multi-class joint probability distribution.
Figure 1. The architecture of DSGT with a priori knowledge. The preprocessing obtains a directed node PE and an initial edge embedding, and the forward inference outputs a multi-class joint probability distribution.
Mathematics 12 03689 g001
Table 1. Summary of existing methods for node-level representation learning in graphs. The table compares three mainstream architectures and our working, i.e. Directed GNNs, Directed GTs, Hybrid architectures of GNN and GT and DSGT(our), for node-level representation learning in directed graphs across three key capacity: Directed Graph, Global Asymmetry Attention, and Approximate High-Order Filter. The symbol of ✓ indicates that the model possesses the capability, whereas that of ✗ denotes the absence of this function.
Table 1. Summary of existing methods for node-level representation learning in graphs. The table compares three mainstream architectures and our working, i.e. Directed GNNs, Directed GTs, Hybrid architectures of GNN and GT and DSGT(our), for node-level representation learning in directed graphs across three key capacity: Directed Graph, Global Asymmetry Attention, and Approximate High-Order Filter. The symbol of ✓ indicates that the model possesses the capability, whereas that of ✗ denotes the absence of this function.
MethodsDirected GraphGlobal Asymmetry AttentionApproximate High-Order Filter
Directed GNNs
Directed GTs
Hybrid Architectures of GNN and GT
DSGT (our)
Table 2. Summary of datasets. The degree correlation coefficients for the Tolokers, ogbn-arxiv, and arxiv-year datasets are close to 0, indicating that structure-aware information contributes less to node classification compared to node attribute. The arxiv-year and squirrel datasets exhibit low node homophily, introducing significant noise during aggregation, which is a key factor in poor model performance. In contrast, the higher degree correlation coefficient of the genius dataset suggests that nodes of the same type may play more similar roles, compensating for this dataset’s low node homophily.
Table 2. Summary of datasets. The degree correlation coefficients for the Tolokers, ogbn-arxiv, and arxiv-year datasets are close to 0, indicating that structure-aware information contributes less to node classification compared to node attribute. The arxiv-year and squirrel datasets exhibit low node homophily, introducing significant noise during aggregation, which is a key factor in poor model performance. In contrast, the higher degree correlation coefficient of the genius dataset suggests that nodes of the same type may play more similar roles, compensating for this dataset’s low node homophily.
DatasetCoraSquirrelGeniusTolokersOgbn-ArxivArxiv-Year
Nodes19,7935200421,96111,758169,343169,343
Edges126,842217,065984,9791,038,0001,166,2431,166,243
Features871020891210128128
Classes7010224035
Directed Edge Ratio0.9420.8280.8740.0000.9860.986
Degree Assortativity Coefficient0.0110.3740.477−0.0800.0140.014
Node Homophily0.3630.089−0.1070.6340.4280.145
Table 3. Mean and standard deviation of macro accuracy for test set. Bold: the best performing model for each dataset.
Table 3. Mean and standard deviation of macro accuracy for test set. Bold: the best performing model for each dataset.
Dataset (Unit: %)CorasquirrelgeniusTolokersogbn-arxivarxiv-year
DGCN [6]62.87 ± 1.3641.64 ± 0.9667.04 ± 2.2670.79 ± 0.4963.01 ± 1.0844.51 ± 0.16
DiGCN [9]61.19 ± 0.9747.74 ± 1.5469.56 ± 3.1471.63 ± 0.3663.36 ± 2.0649.09 ± 0.10
Holonet [10]76.46 ± 1.0753.71 ± 1.9272.10 ± 3.1574.69 ± 0.3673.09 ± 0.6446.28 ± 0.29
MagNet [7]71.77 ± 6.1841.01 ± 1.9380.30 ± 2.0774.33 ± 0.3876.63 ± 0.0651.78 ± 0.26
MagLapNet-I [32]77.00 ± 4.9646.06 ± 2.6483.42 ± 1.3775.09 ± 0.4579.92 ± 0.5656.06 ± 1.34
MagLapNet-II [18]73.14 ± 3.3149.14 ± 3.6677.94 ± 0.4978.47 ± 0.2776.35 ± 0.1952.17 ± 0.26
GT-WL [14]64.38 ± 2.2250.31 ± 1.1968.81 ± 3.2770.77 ± 1.3566.81 ± 1.4747.15 ± 2.39
DSGT(our)79.96 ± 3.6750.36 ± 0.2590.05 ± 0.3184.53 ± 0.5580.33 ± 1.1954.62 ± 1.01
Table 4. Mean and standard deviation of the macro accuracy for the test set (unit: %). The dimension of the node PE is 18. Specifically, the SVE PE of the node is composed of the first D p / 2 left and right singular vectors. The hashing encoding is an integer and leverages a trainable 1D embedding vector that obtains D p -dimensional WL PE. The bold represents the highest average classification accuracy under this dataset. Magnetic LapPE-ABS/REAL and Undirected LapPE come from the top D p spectral features of directed GLM. Directed PE (ours) is the first D p column vectors from the feasible solution to the constraint-preserved optimization problem. The bold represents the highest average classification accuracy under this dataset.
Table 4. Mean and standard deviation of the macro accuracy for the test set (unit: %). The dimension of the node PE is 18. Specifically, the SVE PE of the node is composed of the first D p / 2 left and right singular vectors. The hashing encoding is an integer and leverages a trainable 1D embedding vector that obtains D p -dimensional WL PE. The bold represents the highest average classification accuracy under this dataset. Magnetic LapPE-ABS/REAL and Undirected LapPE come from the top D p spectral features of directed GLM. Directed PE (ours) is the first D p column vectors from the feasible solution to the constraint-preserved optimization problem. The bold represents the highest average classification accuracy under this dataset.
Dataset (Unit: %)CoraSquirrelGeniusTolokersOgbn-ArxivArxiv-Year
Magnetic LapPE-ABS [32]74.67 ± 0.7346.53 ± 3.6382.08 ± 1.2677.70 ± 5.0974.58 ± 4.0155.13 ± 1.37
Magnetic LapPE-REAL [32]73.62 ± 3.0144.07 ± 2.2679.22 ± 1.1177.70 ± 5.0970.44 ± 2.0149.04 ± 0.99
WL PE [14]66.59 ± 1.7242.11 ± 1.4777.05 ± 2.0671.55 ± 1.4766.03 ± 3.0944.16 ± 3.25
SVD PE [44]78.02 ± 2.3546.96 ± 3.9380.90 ± 5.4981.73 ± 1.5983.34 ± 2.7654.21 ± 2.11
Undirected LapPE [14]73.91 ± 1.4343.06 ± 4.1479.85 ± 3.2980.53 ± 0.8775.93 ± 1.2647.14 ± 4.27
Directed PE (ours) [22]79.96 ± 3.6750.36 ± 0.2590.05 ± 0.3184.53 ± 0.5580.33 ± 1.1954.62 ± 1.01
Table 5. Mean and standard deviation of the macro accuracy for the test set. DSGT-I: PrimalAttention [30] is replaced with Full GT [14]; DSGT-II: there is no structural encoding [18] before the attention mechanism; DSGT-III: there is no node PE [22], and DGN [33] is replaced with ChebConv [43]; DSGT-V: there is no initial edge encoding; DSGT-IV: adaptive filter coefficient generation [15] is replaced with trainable parameters in directed GCN [10]. The bold represents the highest average classification accuracy under this dataset.
Table 5. Mean and standard deviation of the macro accuracy for the test set. DSGT-I: PrimalAttention [30] is replaced with Full GT [14]; DSGT-II: there is no structural encoding [18] before the attention mechanism; DSGT-III: there is no node PE [22], and DGN [33] is replaced with ChebConv [43]; DSGT-V: there is no initial edge encoding; DSGT-IV: adaptive filter coefficient generation [15] is replaced with trainable parameters in directed GCN [10]. The bold represents the highest average classification accuracy under this dataset.
Dataset (Unit: %)CoraSquirrelGeniusTolokersOgbn-ArxivArxiv-Year
DSGT-I76.93 ± 2.1649.91 ± 0.6386.15 ± 1.9383.55 ± 0.5378.81 ± 2.0853.33 ± 2.10
DSGT-II77.87 ± 2.1548.07 ± 3.2183.38 ± 0.7679.96 ± 6.3377.38 ± 0.9455.40 ± 0.96
DSGT-III73.00 ± 3.0140.61 ± 1.9977.01 ± 1.2478.18 ± 3.9266.71 ± 1.0846.06 ± 3.87
DSGT-V78.67 ± 1.1949.12 ± 1.2185.74 ± 0.7581.96 ± 0.6775.11 ± 1.6453.00 ± 3.33
DSGT-IV75.14 ± 1.6847.65 ± 1.0187.41 ± 2.3680.52 ± 0.3178.89 ± 2.2151.31 ± 4.07
DSGT79.96 ± 3.6750.36 ± 0.2590.05 ± 0.3184.53 ± 0.5580.33 ± 1.1954.62 ± 1.01
Table 6. The impact of filter order on model classification performance was studied using the Cora dataset. The dimensions of node PE and initial edge embedding were D p = 18 and D e = 108 , respectively. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.
Table 6. The impact of filter order on model classification performance was studied using the Cora dataset. The dimensions of node PE and initial edge embedding were D p = 18 and D e = 108 , respectively. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.
Number of Filter Order K123456
Magnetic LapPE-ABS66.15 ± 4.8671.78 ± 2.3375.51 ± 2.3174.67 ± 1.7374.07 ± 2.6972.91 ± 1.03
Magnetic LapPE-REAL66.05 ± 1.1963.37 ± 5.5270.39 ± 4.1770.96 ± 3.3873.62 ± 3.0172.91 ± 3.43
SVD PE68.55 ± 3.5072.18 ± 4.0476.01 ± 2.5478.02 ± 2.3577.96 ± 2.0177.48 ± 2.58
Undirected LapPE54.18 ± 3.7360.26 ± 2.9667.87 ± 2.2470.71 ± 2.0970.35 ± 2.0570.91 ± 2.06
Directed PE (our)69.09 ± 2.9374.91 ± 2.7776.60 ± 2.5879.96 ± 3.6778.81 ± 1.3477.70 ± 1.35
Table 7. Mean and standard of the macro accuracy and computation complexity for the test set. The effect of the dimension of node PE on the average classification accuracy for the Cora dataset when the order of the filter and the dimension of the initial edge embedding were K = 4 and D e = 108 , respectively. Considering that SVD PE derives from both left and right singular vectors, the dimension of node PE was even and the increment in it was set to two. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.
Table 7. Mean and standard of the macro accuracy and computation complexity for the test set. The effect of the dimension of node PE on the average classification accuracy for the Cora dataset when the order of the filter and the dimension of the initial edge embedding were K = 4 and D e = 108 , respectively. Considering that SVD PE derives from both left and right singular vectors, the dimension of node PE was even and the increment in it was set to two. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.
Dimension of Node PE D p 101418222630
Total Parameter Number (Unit: M) 4.9 5.2 5.4 5.8 6.3 6.7
Magnetic LapPE-ABS73.43 ± 1.1173.73 ± 2.0974.67 ± 1.7374.16 ± 1.2775.44 ± 1.9475.18 ± 1.68
Magnetic LapPE-REAL68.99 ± 1.9169.69 ± 2.7370.96 ± 3.3871.02 ± 2.0670.03 ± 1.0371.22 ± 2.57
SVD PE74.44 ± 2.1175.68 ± 1.5278.02 ± 2.3578.00 ± 1.7878.64 ± 1.9178.03 ± 2.08
Undirected LapPE62.11 ± 1.7766.63 ± 1.7370.71 ± 2.0971.96 ± 2.2372.02 ± 1.9973.36 ± 1.64
Directed PE (our)77.21 ± 1.6277.59 ± 2.3779.96 ± 3.6779.93 ± 2.4580.00 ± 1.8680.32 ± 3.69
Table 8. Mean and standard of the macro accuracy and computation complexity for the test set. The difference in the graph Fourier basis can yield a directional field matrix. By employing three different aggregators and scalers (Appendix A [33]), directional smoothing and derivative matrices with a feature dimension of 9 can be obtained. Therefore, a graph Fourier basis generated an 18-dimensional feature vector. Here, K is the number of graph Fourier bases and D e denotes the dimension of the initial edge embedding, which is a multiple of 18. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.
Table 8. Mean and standard of the macro accuracy and computation complexity for the test set. The difference in the graph Fourier basis can yield a directional field matrix. By employing three different aggregators and scalers (Appendix A [33]), directional smoothing and derivative matrices with a feature dimension of 9 can be obtained. Therefore, a graph Fourier basis generated an 18-dimensional feature vector. Here, K is the number of graph Fourier bases and D e denotes the dimension of the initial edge embedding, which is a multiple of 18. The bold represents the best average classification accuracy achieved by the model under different hyperparameter configurations.
Dimension of Edge Attributes D e 547290108126144
Number of Graph Fourier Bases N 3 4 5 6 7 8
Total Parameter Number (Unit: M) 4.7 5.0 5.2 5.4 5.6 5.8
Cora79.24 ± 2.4479.31 ± 3.9979.22 ± 2.6379.96 ± 3.6780.00 ± 2.7679.01 ± 3.03
squirrel49.77 ± 1.6249.86 ± 1.2350.11 ± 0.9750.36 ± 0.2550.54 ± 0.6150.44 ± 1.08
genius87.53 ± 1.1988.74 ± 0.7189.86 ± 1.1090.05 ± 0.3190.23 ± 0.1190.34 ± 0.23
Tolokers83.21 ± 1.0083.00 ± 0.8284.86 ± 1.3084.53 ± 0.5584.07 ± 1.2484.41 ± 0.83
ogbn-arxiv77.11 ± 1.2079.23 ± 0.9979.51 ± 1.1580.33 ± 1.1980.34 ± 1.3480.96 ± 0.74
arxiv-year53.87 ± 1.3754.16 ± 1.1854.01 ± 1.7154.62 ± 1.0154.44 ± 2.0953.37 ± 1.63
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hou, G.; Yu, Q.; Chen, F.; Chen, G. Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs. Mathematics 2024, 12, 3689. https://doi.org/10.3390/math12233689

AMA Style

Hou G, Yu Q, Chen F, Chen G. Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs. Mathematics. 2024; 12(23):3689. https://doi.org/10.3390/math12233689

Chicago/Turabian Style

Hou, Guoqiang, Qiwen Yu, Fan Chen, and Guang Chen. 2024. "Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs" Mathematics 12, no. 23: 3689. https://doi.org/10.3390/math12233689

APA Style

Hou, G., Yu, Q., Chen, F., & Chen, G. (2024). Directed Knowledge Graph Embedding Using a Hybrid Architecture of Spatial and Spectral GNNs. Mathematics, 12(23), 3689. https://doi.org/10.3390/math12233689

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop