Geometric Graph Learning Network for Node Classification

Wang, Lei; Xu, Xitong; Li, Zhuqiang

doi:10.3390/electronics15030696

Open AccessArticle

Geometric Graph Learning Network for Node Classification

by

Lei Wang

¹,

Xitong Xu

² and

Zhuqiang Li

^2,*

¹

Jilin Gaofen Remote Sensing Application Institute Co., Ltd., Changchun 130012, China

²

College of Geo-Exploration Science and Technology, Jilin University, Changchun 130026, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 696; https://doi.org/10.3390/electronics15030696

Submission received: 28 December 2025 / Revised: 3 February 2026 / Accepted: 4 February 2026 / Published: 5 February 2026

(This article belongs to the Special Issue Celebrating the 70th Anniversary of Beijing University of Posts and Telecommunications—Computer Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

Graph attention improves neighbor discrimination, but it remains limited by local receptive fields and by a strong dependence on the input topology, which is often unreliable on heterophilous graphs. We propose Geometric Graph Learning Network (G2LNet), a structure-learning framework that infers message-passing probabilities from an explicit geometric topology learned in latent Euclidean or hyperbolic spaces. G2LNet combines (i) a geometric mapping module, (ii) distance- or inner-product-based relation operators with perceptual connectivity to control the influence of the given graph, and (iii) end-to-end constraint objectives enforcing stability, sparsity, and (optional) symmetry of the learned topology. This design yields unified local, non-local, and graph-free neighborhoods, enabling systematic analysis of when non-local aggregation helps. Experiments on node classification across nine publicly available benchmark datasets demonstrate that G2LNet’s controlled variant consistently achieves higher accuracy than representative strong baseline models–both local and non-local–on most datasets. This establishes a robust alternative for smaller scale node classification tasks.

Keywords:

graph neural networks; graph structure learning; geometric topology; non-local aggregation

1. Introduction

Graph neural networks (GNNs) extend deep learning to graph-structured data by propagating and aggregating information along edges, enabling representation learning on non-Euclidean domains [1,2,3]. They have achieved strong performance in diverse applications, including social networks [4], bioinformatics [5], and recommender systems [6]. Existing GNNs are commonly categorized into spatial and spectral approaches. Spatial methods define neighborhood-wise aggregation rules [4,7,8,9,10], whereas spectral methods derive convolution operators from graph signal processing and Laplacian spectra [3,11,12,13]. Both paradigms can be unified under the message-passing neural network (MPNN) framework [14], where each layer updates node representations through permutation-invariant aggregation of neighborhood information.

Semi-supervised node classification is one of the most representative and practically important tasks for GNNs, where only a small subset of nodes is labeled and the goal is to infer labels for the remaining nodes by jointly exploiting node attributes and relational structure. In this setting, the graph provides an inductive bias for learning: neighborhood aggregation serves as a mechanism to propagate supervisory signals from labeled nodes to unlabeled ones, which can significantly improve generalization when labels are scarce. However, this advantage is inherently conditional on the quality of the relational structure: if observed edges are noisy, missing, or weakly aligned with class semantics, message passing may propagate misleading information and degrade class separability, especially under limited supervision [15,16]. Therefore, semi-supervised node classification offers a stringent testbed for studying the robustness of message passing under diverse graph structural regimes.

However, current semi-supervised node classification methods still face two major core limitations. First, many graph learning networks rely heavily on a given topology, or only reweight an existing graph, and the learned edge weights may inherit biases from the input adjacency [9,10,17,18,19,20,21]. This issue becomes pronounced on heterophilous graphs, where connected nodes often carry different labels and observed edges are a weak proxy for semantic similarity. Second, local shared filters provide limited access to long-range yet semantically related nodes [22,23]. Although deeper and wider architectures have been explored [24,25,26,27], their effectiveness is often constrained by oversmoothing and by the difficulty of separating heterogeneous information as depth increases. Recent advances in graph homophily enhancement and non-local aggregation [28,29,30,31] improve long-range dependency modeling, but they typically depend on predefined structures or attention mechanisms under restricted spatial assumptions, leaving the geometry of node signals and the design space of relation operators insufficiently explored.

These two limitations are tightly coupled in practice. When the observed adjacency is unreliable (a common situation on heterophilous graphs), local propagation can introduce label-inconsistent messages; in such cases, merely reweighting observed edges may still be constrained by the same biased candidate set. Conversely, when one attempts to remedy locality by enlarging receptive fields, naive depth expansion may amplify oversmoothing and oversquashing effects, making it difficult to preserve discriminative information while aggregating broader contexts. This suggests that robust semi-supervised learning requires not only better aggregation operators, but also a principled way to control which node pairs communicate and how strongly they do so, under both homophilous and heterophilous regimes.

Motivated by the view that node signals can be regarded as samples from a latent geometric space, we ask whether message-passing probabilities can be learned from geometric relations among node representations rather than being tightly coupled to the observed adjacency. In particular, can mapping node signals into appropriate spaces mitigate correlation distortions induced by local graph convolution approximations, and can such relations induce a more informative topology for aggregation?

From this perspective, the observed graph can be treated as an imperfect measurement of latent relations, while node representations provide an alternative source of evidence for constructing communication patterns. If node signals admit an underlying geometric organization, then distances or inner products in an appropriate space can serve as a proxy for dependency strength and yield a topology that is more aligned with task semantics. Moreover, allowing multiple geometries provides additional flexibility: Euclidean spaces are naturally suited for locally smooth structures, while hyperbolic spaces can better represent uneven, hierarchical, or tree-like organizations that may arise in relational data. This motivates a unified framework that learns a geometry-aware topology from representations, and explicitly regulates the influence of the given adjacency rather than assuming it is fully reliable.

To address these questions, we propose a geometric graph learning paradigm that takes pairs of node signals, potentially after composite mappings across spaces, as input and outputs learnable communication probabilities. The paradigm supports three neighborhood regimes to compare local and non-local attention under both homophily and heterophily: (i) graph-free neighborhoods learned without relying on the input graph, (ii) local neighborhoods restricted to observed edges, and (iii) non-local neighborhoods that extend beyond the observed adjacency.

Based on this paradigm, we develop Geometric Graph Learning Network (G2LNet), a graph learning architecture compatible with MPNNs. G2LNet learns permutation-invariant node representations and infers geometric topologies via relation operators defined in Euclidean and hyperbolic spaces, together with aggregation schemes and constraint functions tailored to each neighborhood regime. We evaluate G2LNet on nine public benchmarks under multiple standard splits, including three homophilous citation networks and six heterophilous web and actor graphs. Experimental results demonstrate that the controlled variant of G2LNet consistently achieves significantly higher node classification accuracy compared to representative local and non-local benchmark models, confirming that geometric topology inference provides an effective pathway for robust message passing across graph mechanisms. Our main contributions include:

We propose G2LNet, a geometric graph learning framework that infers message-passing probabilities from geometric relations, reducing reliance on the observed adjacency.
We unify three neighborhood regimes within one architecture–graph-free, local, and non-local–enabling a controlled comparison of aggregation behaviors under homophily and heterophily.
We design Euclidean and hyperbolic relation operators (distance and inner-product) with a perceptual connectivity mechanism to regulate the influence of the input graph.
We introduce an end-to-end constraint objective to achieve stable structural learning. Experimental results demonstrate that the proposed architecture achieves superior classification performance to strong baseline models on most datasets.

2. Related Work

This section reviews the semi-supervised node classification setting, hyperbolic neural networks, and representative graph learning networks, and clarifies the distinctions between prior work and G2LNet.

2.1. Semi-Supervised Node Classification

We consider a graph

G = (V, E, X)

, where V and E denote the node and edge sets, and

X \in R^{| V | \times d}

denotes node features. The adjacency matrix is

A \in R^{| V | \times | V |}

, where

A_{i j} = 1

if

(V_{i}, V_{j}) \in E

and

A_{i j} = 0

otherwise for unweighted graphs. The one-hop neighborhood of node

V_{i}

is

N_{i} = {V_{j} \in V : (V_{i}, V_{j}) \in E}

. Given labels on a subset of nodes, the goal is to learn a classifier

f : V \to Y

for all nodes. Many GNNs can be viewed under the message passing neural network (MPNN) formulation [14], where node representations are updated by permutation-invariant aggregation of neighborhood messages across multiple rounds.

2.2. Hyperbolic Neural Networks

Beyond Euclidean embeddings, hyperbolic neural networks map representations to non-Euclidean manifolds to better capture hierarchical and tree-like structures [32,33,34]. In this work, We adopt the d-dimensional Poincaré ball model with curvature

- c

(

c > 0

),

B_{c}^{d} = \{x \in R^{d} : c {∥ x ∥}_{2}^{2} < 1\},

(1)

equipped with the Riemannian metric

g_{x}^{B_{c}^{d}} = {(λ_{x}^{c})}^{2} g^{E^{d}}, λ_{x}^{c} = \frac{2}{1 - {c ∥ x ∥}_{2}^{2}},

(2)

where

g^{E^{d}} = I_{d}

is the Euclidean metric tensor. The corresponding geodesic distance is

d_{B}^{c} (x, y) = \frac{1}{\sqrt{c}} {cosh}^{- 1} (1 + \frac{{2 c ∥ x - y ∥}_{2}^{2}}{{(1 - c ∥ x ∥}_{2}^{2}) (1 - {c ∥ y ∥}_{2}^{2})}) .

(3)

Hyperbolic space is not a vector space; therefore, linear operations are typically performed in tangent spaces. For a point

p \in B_{c}^{d}

, the tangent space

T_{p} B_{c}^{d}

is a Euclidean vector space that locally approximates the manifold. Hyperbolic multilayer perceptrons (HMLPs) [34] implement linear transformations in tangent space via logarithmic/exponential maps and then map the results back to the manifold. Concretely, for an input

x^{B} \in B_{c}^{d}

, a typical HMLP layer can be written as

y^{B} = σ^{\otimes^{c}} ((W \otimes^{c} x^{B}) \oplus^{c} b^{B}),

(4)

where

\otimes^{c}

and

\oplus^{c}

denote Möbius scalar multiplication and Möbius addition, respectively. These operations are realized by mapping

x^{B}

to a tangent space using

{log}_{p}^{c} (\cdot)

, applying the corresponding Euclidean operation, and projecting back to

B_{c}^{d}

using

{exp}_{p}^{c} (\cdot)

. Similarly, the point-to-point computation of the nonlinear activation either needs to be performed in tangent space, which can be simply realized in the Poincaré ball model using the tangential method. The remaining formulas and their symbols are explained in Table A1 of Appendix A.

2.3. Graph Learning Networks

Graph learning networks improve representation learning by adapting edge weights or neighborhood structure during training. Early work learns task-driven graphs from dense parameterizations [35]. Existing methods are broadly grouped into (i) edge reweighting on observed neighborhoods and (ii) regularized topology refinement. In the first group, GAT [10] reweights observed edges via attention, AGCN [17] learns affinities through parameterized distances, and GLCN [18] and its variants [19] replace hand-crafted kernels with learned pairwise similarities, but are largely confined to local neighborhoods. In the second group, GLNN [20] refines an initialized adjacency using Laplacian regularization [36] with sparsity and symmetry constraints, yet remains sensitive to the quality and homophily bias of the input graph. To capture long-range dependencies, non-local connectivity has been explored. Geom-GCN [28] builds non-local neighborhoods in latent spaces via distance-based grouping, but depends on thresholding and distance-only relations. NLGNN [29] models non-local dependencies in Euclidean space and performs well on heterophilous graphs, but uses variants tailored to different graph regimes.

In this work, we derive message passing probabilities through explicit geometric relationships and seamlessly support three modeling paradigms–aplatonic structures, local neighborhoods, and non-local neighborhoods–within the same framework. Unlike GEOM-GCN [28], which employs fixed message probabilities, our approach introduces distance- and inner-product-based metric operators to adaptively learn message probabilities between nodes across different geometric spaces. This effectively mitigates the model’s excessive reliance on the input graph structure. Furthermore, unlike non-local methods such as NLGNN [29] that often neglect inherent patterns of global interactions, we explicitly regulate the influence of input graphs on the model through learnable connectivity perception mechanisms and structural constraints, capable of adaptively integrating structural information at both global and local levels.

3. Materials and Methods

This section presents the proposed Geometric Graph Learning Network (G2LNet). As illustrated in Figure 1, each layer consists of four components: (i) node mapping, (ii) geometric relation measure, (iii) neighborhood aggregation, and (iv) constraint regularization. The key idea is to infer a geometric topology–represented by learnable communication probabilities–from explicit relations between node representations in a chosen geometric space, and then perform message passing on the inferred neighborhoods. We denote the geometric space by the

{superscript}^{V}

and the network layer by ^ℓ. Given node features

{x_{i}^{ℓ}}_{i = 1}^{| V |}

at layer ℓ, G2LNet updates features in two coupled steps:

(1) Graph learning. In the graph learning phase, the graph learning function P calculates the probability of information transfer

P_{i j}^{ℓ + 1, V}

in the layer

ℓ + 1

-th, under the geometric space

V

based on the characteristic signals of the central node

V_{i}

and the neighboring node

V_{j}

. This probability is primarily determined by the metric function

R

and the constraint function

L

, defined as follows:

\begin{matrix} S_{i j}^{ℓ + 1, V} & = R (ψ (x_{i}^{ℓ}), ψ (x_{j}^{ℓ})) + α P_{i j}^{ℓ, V}, \\ P_{i j}^{ℓ + 1, V} & = Softmax (S_{i j}^{ℓ + 1, V}), \\ min_{Θ} L & = L_{task} + \sum_{ℓ} L_{G 2 L N e t}^{ℓ + 1} \end{matrix}

(5)

Extending the above equation to the entire graph constitutes a geometrically structured neighborhood

{\bar{N}}_{i}^{ℓ} = {V_{j} | V_{j} \in V : (V_{i}, V_{j}) \in P^{ℓ, V}}

, where

P_{i j}^{ℓ, V} \in P^{ℓ, V}

, with the initial condition that

S^{0, V} = S

, and S can be either the unitary matrix I or the neighboring matrix of the graph A. When

S = I

, it implies that the learning function P captures geometric dependency relationships from the fully graph-free initialization to capture geometric dependencies. The mapping function

ψ

is used to project the signal from the nodes into the new feature space. The geometric relation function

R

combines the projected features and the previous message passing probabilities to update the passing relations between the nodes, while

+ α P_{i j}^{ℓ, V}

preserves the selectivity of the prior topology (Perceptual Connectivity in Figure 1). The constraint function

L

is responsible for optimizing the update process of the neighborhood of the geometric structure.

(2) Geometric neighborhood aggregation. In the geometric neighborhood aggregation phase, the central node aggregates in a weighted manner neighbor information from the geometric structured neighborhood

{\bar{N}}_{i}^{ℓ}

quantified by the graph learning function P via the neighborhood aggregation function

ϕ

to update its own feature representation:

\begin{matrix} x_{i}^{ℓ + 1} = {x_{i}^{ℓ}, ϕ (x_{i}^{ℓ}, x_{j}^{ℓ}, P_{i j}^{ℓ + 1, V}), \forall_{j} \in {{\bar{N}}_{i}^{ℓ + 1}, N_{i}}}, \\ ϕ (x_{i}^{ℓ}, x_{j}^{ℓ}, P_{i j}^{ℓ + 1, V}) = A G G {P_{i j}^{ℓ + 1, V} ψ (x_{j}^{ℓ})} + ψ (x_{i}^{ℓ}) \end{matrix}

(6)

where,

N_{i} = {V_{j} | V_{j} \in V : (V_{i}, V_{j}) \in E}

,

A G G

is the permutation-invariant aggregation function, such as summation, mean and maximum.

ψ

is a layer of MLP.

ψ (x_{i}^{ℓ})

is Center Correction Unit in Figure 1.

3.1. Node Mapping

Classical neural networks have shown significant advantages in dealing with smooth, localized and combinatorial problems in continuous spaces [37]. Through local sensory fields, parameter sharing and hierarchical feature extraction, these networks are able to efficiently learn discriminant functions for abstract features. However, given the local smoothness of the graph signal, its non-Euclidean properties and its topological complexity, and the frequency domain approximation of the graph convolution operator in Euclidean space, this may lead to distortions in the local correlations and global patterns of the graph signal. Therefore, in order to better capture the underlying local structure and patterns of the node signals

x \in R^{d}

on the graph, we construct the permutation-equivariant geometric mapping function

ψ

, which is composite of the projection function

ψ_{ν}

and the embedding function

ψ_{z}

, given by the following equations:

\begin{matrix} x^{Z} = ψ (x) = ψ_{z} (ψ_{ν} (x)), \end{matrix}

(7)

Space Projection. For the projection function

ψ_{ν}

, it maps the graph signal to the vector space

V

, i.e.:

\begin{matrix} ψ_{ν} (x) = {ψ_{ν} : x \to x^{V}, V \in R^{d}, B^{d}} \end{matrix}

(8)

When a nodal signal on the graph is projected into hyperbolic space, i.e.,

V \in B^{d}

, we use the exponential mapping to project the feature

x_{i}^{ℓ} \in R^{d}

that is in Euclidean space to the Poincaré ball model

B

:

x_{i}^{ℓ, B} = {exp}_{o}^{c_{ℓ}} (x_{i}^{ℓ})

(9)

where

c_{ℓ}

is the curvature of the ℓ-th layer,

o

is the the origin, and

x_{i}^{ℓ} \in T_{o} B_{c_{ℓ}}^{d}

interpreted as a node located in the tangent space at the origin.

Feature Embedding. Next, we then use the embedding function

ψ_{ν}

to reduce the dimensionality of the node signals

x_{i}^{ℓ, V}

in this space and to improve the model’s ability to adapt to nonlinear structures. This embedding function is denoted as:

\begin{matrix} ψ_{z} (x^{V}) = {ψ_{z} : x^{V} \to x^{Z}, z \in V^{f}} \end{matrix}

(10)

Usually the embedding function

ψ_{z}

is a layer of simple MLP networks [10,18], which can be simply expressed as:

x_{i}^{ℓ, Z} =

M L P (x_{i}^{ℓ, V})

, where

x_{i}^{ℓ, Z} \in V^{f}

, and the MLP is shared among all nodes in the graph, and only the nodes themselves are considered without aggregating information from the neighborhood. In particular, when

V \in B^{d}

, the

M L P (\cdot)

function should be carried out in hyperbolic space, i.e., a layer of HMLP embeddings is denoted as:

x_{i}^{ℓ, Z} = σ^{\otimes^{c_{ℓ - 1}, c_{ℓ}}} ((W^{ℓ} \otimes_{c_{ℓ - 1}}^{B} x_{i}^{ℓ - 1, B}) \oplus_{c_{ℓ - 1}}^{B} b^{ℓ, B})

(11)

Here,

W^{ℓ} \in R^{f \times d}

and

x_{i}^{ℓ, Z} \in B^{f}

, the point-to-point computation of nonlinear activations can be simply realized in the Poincaré ball model using the tangential method:

σ^{\otimes^{c_{ℓ - 1}, c_{ℓ}}} (x^{ℓ, B}) = {exp}_{o}^{c_{ℓ}} (σ ({log}_{o}^{c_{ℓ - 1}} (x^{ℓ - 1, B})))

(12)

Compared with the traditional MLP, the advantages of HMLP mainly come from the unique geometric properties of the hyperbolic space.

3.2. Geometric Relation Measure

We first empirically define the “potential geometric dependency between nodes” as the dependency that increases with the similarity between nodes. For in-depth analysis, the geometric metric function

R

is then classified into two types based on the distance

R_{d}

as well as based on the inner product

R_{I}

using Gaussian and linear kernels.

Assuming that the node signals

x_{i}^{ℓ}

and

x_{j}^{ℓ}

have been embedded into the corresponding geometric space by the geometric mapping function

ψ

in Equation (7), and then the graphs are computed using the geometric relation function

R

. In order to avoid the over-independence of the graphs at each layer as well as the influence of the heterogeneous graphs, we establish perceptual connectivity for the graphs at each layer:

P_{i j}^{ℓ + 1, V} = S o f t m a x (R (x_{i}^{ℓ, Z}, x_{j}^{ℓ, Z}) + α P_{i j}^{ℓ, V})

(13)

where

V \in {R, B}

,

R \in {R_{d}, R_{I}}

,

α

is the perception factor, which serves as a hyperparameter in the network. Note that while the embedding function can still be a GCN network [29], it is not suitable for deeper GCN networks–its limitation lies in the inability to learn affine functions capable of distinguishing features of nodes belonging to different categories (see Proof A2 in Appendix B for details).

Distance metrics. Distance metrics are crucial in quantifying the similarity of data points. The kernel function, as an efficient similarity metric, possesses smoothness and locality, and its value decreases with increasing distance between nodes, reflecting the tendency of similarity to decrease with increasing distance. Specifically, the distance metric based on the kernel function can be defined as:

R_{d} (x_{i}^{ℓ, Z}, x_{j}^{ℓ, Z}) = e x p (- \frac{d (x_{i}^{ℓ, Z}, x_{j}^{ℓ, Z})}{δ})

(14)

where

d (x_{i}^{ℓ, Z}, x_{j}^{ℓ, Z}) = | | x_{i}^{ℓ, Z} - x_{j}^{ℓ, Z} {| |}^{2}

, it should be noted that

d (x_{i}^{ℓ, Z}, x_{j}^{ℓ, Z}) = d_{B}^{c_{ℓ}} {(x_{i}^{ℓ, B}, x_{j}^{ℓ, B})}^{2}

when

Z \in B^{f}

.

δ

determines the rate at which the distance decays, with smaller making the effect of distance on similarity more significant and larger making the effect of distance relatively weaker. In this work, we take

δ = 1

in order to contrast it with the shortest path in hyperbolic space.

Inner Product Measure. Compared to distance metrics, inner product metrics focus more on how well the node feature vectors are aligned in the same space. To quantify this similarity, this paper uses Pearson’s correlation coefficient to explain the strength of the connection between nodes:

R_{I} (x_{i}^{ℓ, Z}, x_{j}^{ℓ, Z}) = d o t {(\frac{x_{i}^{ℓ, Z} - μ_{i}}{σ_{i} + ϵ}, \frac{x_{j}^{ℓ, Z} - μ_{j}}{σ_{j} + ϵ})}^{2}

(15)

where,

μ_{i}

and

σ_{i}

are the mean and standard deviation of the eigenvectors

x_{i}^{ℓ, Z}

of node

V_{i}

, respectively, and

ϵ

is a very small number that avoids dividing by 0. Since

R_{I} \in [- 1, 1]

, we perform the squaring operation. When

Z \in B^{f}

, dot product operation should be performed in tangent space, i.e.,:

d o t_{o}^{c_{ℓ}} (x_{i}^{ℓ, B}, x_{j}^{ℓ, B}) = {log}_{o}^{c_{ℓ}} (x_{i}^{ℓ, B}) \cdot {log}_{o}^{c_{ℓ}} (x_{j}^{ℓ, B})

.

3.3. Neighborhood Aggregation

In Graph Neural Networks (GNNs), the aggregation function, as a core component of information integration, is designed to efficiently extract the features of nodes and their local neighborhoods through an iterative process at each layer. The design of the aggregation function must follow Permutation Invariance (PI), which means that the aggregation result should remain unchanged no matter how the order of the nodes is changed. Given an input graph

G = (V, E, X)

, the graph learning function P constructs a geometrically structured neighborhood

{\bar{N}}_{i}^{ℓ} = {V_{j} | V_{j} \in V : (V_{i}, V_{j}) \in P^{ℓ, V}}

under the joint action of the geometric metric function

R

in Equation (13) and the constraint function

L

in Equation (16), which constitutes a metric space connecting potentially similar nodes in the graph. We can directly use this neighborhood in the aggregation function to aggregate and update node features. In order to enhance the sensitivity of the aggregation function to different structural features, especially the local structural information of the nodes, we introduce a trick called the center correction unit

ψ (x_{i}^{ℓ})

to adjust the features of the center node in the aggregation process. Ultimately, we designed an alignment-invariant aggregation function in Equation (6) (see Proof A5 in Appendix B).

3.4. Constraint Function

Learning a topology from features is inherently ill-posed: without explicit regularization, the inferred connectivity can easily collapse to (i) a nearly dense graph (trivial high-connectivity solution) or (ii) a highly smooth topology that accelerates oversmoothing in deep message passing [36]. We therefore train G2LNet with a joint objective that combines the node classification loss and an end-to-end topology regularizer:

L = L_{task} + \sum_{ℓ = 0}^{L - 1} L_{G 2 L N e t}^{ℓ + 1}, L_{task} = CE (\hat{Y}, Y),

(16)

where

CE (\cdot)

is the cross-entropy on labeled nodes, and

L_{G 2 L N e t}^{ℓ + 1}

regularizes the learned communication probabilities

P^{ℓ + 1, V}

.

Firstly, to make the regularization well-defined across different embedding spaces, we explicitly specify the spatial state

V

in which feature discrepancies are measured. Let

x_{i}^{ℓ, Z} = ψ (x_{i}^{ℓ})

denote the mapped node embedding used for topology inference. When

V = R

, we measure differences directly in Euclidean space; when

V = B

, we first project hyperbolic embeddings to the tangent space (at

o

) via the logarithmic map, so that standard vector norms are applicable. Concretely,

{\tilde{x}}_{i}^{ℓ, Z} = \{\begin{matrix} x_{i}^{ℓ, Z}, & V = R, \\ {log}_{o}^{c_{ℓ}} (x_{i}^{ℓ, Z}), & V = B, \end{matrix}

(17)

and we impose a Laplacian-style smoothness penalty that assigns large communication probability only when the mapped features are close:

L_{sm}^{ℓ + 1} = λ_{1} \sum_{i, j} P_{i j}^{ℓ + 1, V} ∥ {\tilde{x}}_{i}^{ℓ, Z} - {\tilde{x}}_{j}^{ℓ, Z} ∥_{2}^{2} = λ_{1} Tr ({\tilde{X}}^{ℓ ⊤} L_{P}^{ℓ + 1} {\tilde{X}}^{ℓ}),

(18)

where

{\tilde{X}}^{ℓ}

stacks

{\tilde{x}}_{i}^{ℓ, Z}

,

L_{P}^{ℓ + 1}

is the (possibly normalized) Laplacian of

P^{ℓ + 1, V}

, and

λ_{1}

controls the strength of smoothing. Since

P^{ℓ + 1, V}

is inferred from geometric relations measure, Equation (18) regularizes how topology is formed, discouraging high weights between geometrically dissimilar nodes while still permitting non-local edges when they are geometrically supported.

Secondly, to prevent the inferred topology from degenerating into a dense graph and to enforce a selective communication pattern, we penalize the magnitude of

P^{ℓ + 1, V}

:

L_{spa}^{ℓ + 1} = λ_{2} \sum_{i, j} {(P_{i j}^{ℓ + 1, V})}^{2},

(19)

which is particularly important in the non-local and graph-free regimes where the candidate node pairs grow rapidly. If an undirected topology is required, we further add a symmetry regularizer:

L_{sym}^{ℓ + 1} = \frac{λ_{3}}{2} \sum_{i, j} {(P_{i j}^{ℓ + 1, V} - P_{j i}^{ℓ + 1, V})}^{2},

(20)

where

λ_{3}

controls the strength of symmetry enforcement. We treat this term as optional because directed communication can be beneficial when P is interpreted as asymmetric information flow.

Finally, the topology regularizer at layer

ℓ + 1

is

L_{G 2 L N e t}^{ℓ + 1} = L_{sm}^{ℓ + 1} + L_{spa}^{ℓ + 1} + L_{sym}^{ℓ + 1} .

(21)

Together, these constraints (i) couple topology inference to geometric feature consistency, (ii) avoid trivial dense solutions by enforcing a communication budget, and (iii) stabilize training by regulating propagation strength, thereby supporting topology inference under local, non-local, and graph-free regimes.

3.5. Computational Complexity Analysis

We compare G2LNet with representative local and non-local methods, focusing on the additional cost of relation computation. Let

| V |

be the number of nodes,

| E |

the number of observed edges, and f the embedding dimension.

Local Aggregation. When candidates are restricted to

j \in N_{i}

(i.e.,

O (| E |)

pairs), relation computation scales as

O (| E | f)

for inner-product relations and

O (| E | f)

for distance relations (after embedding).

Non-local/graph-free Aggregation. When candidates include all pairs (or a large expanded set), relation computation scales as

O (| V |^{2} f)

(inner-product) or

O (| V |^{2} f)

(distance after embedding).

4. Experiments

4.1. Model and Experimental Setup

4.1.1. Datasets

To systematically evaluate the node classification performance of the proposed method under different graph structural regimes, we adopt nine widely used public benchmarks, including three citation networks with strong homophily (Cora, Citeseer, Pubmed) and six web/actor graphs that are commonly regarded as heterophilous benchmarks (Chameleon, Squirrel, Actor, Cornell, Texas, Wisconsin). Their basic statistics and structural attributes are summarized in Table 1. We evaluate all methods on the semi-supervised node classification task.

Homophilous graphs. Cora, Citeseer, and Pubmed were introduced in [38,39], where nodes represent scientific publications and edges represent citation links. Each node is associated with a bag-of-words feature vector extracted from the document content, and the node label corresponds to the research topic/category. For these citation datasets, we follow two widely adopted semi-supervised evaluation protocols:

Experimental Setting A. We randomly select 20 labeled nodes per class for training, and follow the standard validation/test split used by GCN [12] and GAT [10]. To reduce the effect of randomness, we repeat experiments under random seeds 0–100 (100 runs) and report the average test accuracy.
Experimental Setting B. Following GEOM-GCN [28], nodes in each class are randomly split into 60%/20%/20% for training/validation/testing. We report the average test performance over 10 different random splits.

Heterophilous graphs. Chameleon and Squirrel were introduced by [40], where nodes denote Wikipedia pages, edges correspond to hyperlinks, node features are bag-of-words representations of nouns extracted from pages, and labels are grouped into five classes according to the average monthly traffic. Actor was derived from [41], where nodes represent actors, edges encode co-occurrence relations, features are keyword bag-of-words vectors, and labels correspond to five film/genre topic categories. Cornell, Texas, and Wisconsin are from the WebKB project, where nodes correspond to webpages, edges are hyperlinks, node features are bag-of-words representations, and labels belong to five categories (student, project, course, staff, and faculty). All heterophilous benchmarks are obtained from the unified preprocessing released by GEOM-GCN [28]. For these datasets, we adopt Experimental Setting B to ensure direct comparability with prior work [28,40].

Structural Metrics. The metric

H (G)

in Table 1 is the graph homophily ratio introduced in [28], which measures the fraction of same-label neighbors in the observed topology. To further quantify the label consistency of the geometrically structured neighborhood

{\bar{N}}_{i}^{ℓ}

learned by G2LNet at layer ℓ, we extend

H (G)

to a weighted neighborhood homophily metric and analyze its evolution across layers in Section 4.2.2. Formally, the weighted homophily at layer ℓ is defined as

H (G) = \frac{1}{| V |} \sum_{i \in V} \frac{\sum_{j \in {\bar{N}}_{i}^{ℓ}} P_{i j} I [Y_{i} = Y_{j}]}{\sum_{j \in {\bar{N}}_{i}^{ℓ}} P_{i j}},

(22)

where

P_{i j}

denotes the learned communication weight, and

I [\cdot]

is the indicator function.

Moreover, considering that the learned geometric neighborhood is typically sparse and may capture non-local dependencies beyond explicit edges, we introduce a weighted neighborhood sparsity rate

S (G)

to approximately quantify how many effective connections are activated by the learned topology:

S (G) = \frac{1}{{| V |}^{2}} \sum_{i, j \in V} I [P_{i j} > \frac{1}{| E |}] .

(23)

Here,

\frac{1}{| E |}

serves as a simple threshold separating weak contributions from active information paths: if

P_{i j} \leq \frac{1}{| E |}

, the corresponding message-passing link is considered to have negligible influence and can be approximately ignored. Finally, to revisit the local vs. non-local attention mechanism on heterophilous graphs in Section 4.2.2, we further report the two-hop message-passing upper reference obtained by SGC [42] under Experimental Setting A (corresponding to MA1) and Experimental Setting B (corresponding to MA2), which provides a consistent performance reference for interpreting dataset-dependent behavior.

4.1.2. Experimental Setting

Implementation and hardware. All experiments are conducted on a workstation equipped with 15 vCPUs (Intel(R) Xeon(R) Platinum 8474C) and one NVIDIA GeForce RTX 4090D GPU (24 GB). We implement our models in PyTorch 2.1.2 [43] and PyTorch Geometric 2.7.0 [44].

Model variants and notation. For clarity, we denote a G2LNet variant by the combination of (i) the geometric space and (ii) the relation operator. Specifically, we write

V \in {R, B}

for Euclidean and hyperbolic (Poincaré ball) spaces, respectively, and

R \in {R_{d}, R_{I}}

for distance- and inner-product-based relation operators. For example,

R G 2 LNet R_{d}

denotes the Euclidean-distance variant. Unless otherwise specified, the embedding function is set to

ψ_{z} = MLP (\cdot)

.

Training details. Across all node classification tasks, we adopt a two-layer architecture for G2LNet. Bias terms are initialized to zeros (or sampled from a standard normal distribution), while all other trainable parameters are initialized using Glorot initialization [45]. We optimize with Adam [46] and use ReLU activations in all graph convolution layers. The hidden dimension is set to 32 for PubMed and Actor, and 64 for the remaining datasets. The key hyperparameters–including the number of training epochs, learning rate, weight decay, dropout rate, curvature c, perception factor

α

in Equation (13), and the regularization strengths

λ_{1}

,

λ_{2}

, and

λ_{3}

for

L sm

(Equation (18)),

L spa

(Equation (19)), and

L_{sym}

(Equation (20)), respectively–are tuned on the validation set and summarized in Table 2. Moreover, we observe stable optimization behavior across G2LNet and all its variants; thus, we do not employ any early-stopping strategy, and instead train all models for the pre-specified number of epochs reported in Table 2, within which the training/validation performance reliably converges. Dropout is applied after each graph convolution layer, i.e., after the aggregation operation in Equation (6), and this placement is kept consistent across all variants unless otherwise stated.

4.1.3. Baselines

To comprehensively evaluate the proposed G2LNet under diverse graph heterogeneity, we consider two widely-used semi-supervised protocols (Experimental Setting A and Experimental Setting B; see Section 4.1.2) and compare with representative baselines covering perceptron models, fixed-graph GNNs, and graph-structure learning/non-local aggregation methods.

Experimental Setting A. Following the classical 20-labels-per-class setup [47], baseline methods include: (1) Two linear models: MLP, HMLP; (2) Fixed-graph GNNs: ChebyNet [11], GCN [12], and GWNN [13]; (3) Graph learning networks: MoNet [9], GAT [10], ADSF-RWR [21], GLNN [20], and GLCN [18].

Experimental Setting B. Following GEOM-GCN [28], we adopt the 60/20/20 train/val/test split and compare with: (1) Local aggregation models: MLP, HMLP, GCN, and GAT; (2) Homophily-enhancement model: WRGAT [31]; (3) Non-local aggregation models: GEOM-GCN (three variants) [28],

H_{2} GCN

(two variants) [30], and NLGNN (three variants) [29].

For baselines where the original papers report carefully tuned best performance under the same protocol, we cite their reported numbers directly. In addition, we re-implement MLP/HMLP/GCN/GAT and our methods in PyTorch [43] and PyTorch Geometric [44], and denote re-implemented baselines by “*”. As discussed in prior work, strictly controlled re-implementations may differ from the numbers reported in [10,12,28,29,30,31].

4.2. Experimental Results

Table 3 and Table 4 summarize the node classification results under Setting A (few labels) and Setting B (more labels), respectively. The results under both experimental settings show that the proposed controlled variants of G2LNet achieve the best overall ranking on nine datasets, demonstrating the effectiveness of the proposed method. To more clearly identify the sources of the performance gains, we further report classification performance under three controlled configurations for both settings: (i) non-local neighborhoods, (ii) local neighborhoods, and (iii) without using the given graph.

4.2.1. Experimental Setting A

Under Setting A (as shown in Table 3), the proposed non-local G2LNet variants consistently outperform traditional local aggregation methods. In particular, the best-performing G2LNet variant achieves the best performance on all three citation network datasets, improving over the best baseline on each dataset by 1.08%, 0.90%, and 2.12% on Cora, CiteSeer, and PubMed, respectively. Notably, even without relying on the graph structure, the proposed G2LNet variants outperform the two linear models (MLP and HMLP) on all three datasets, and achieve competitive performance comparable to GCN and GAT, especially on CiteSeer. Moreover, the classification performance of local methods is substantially lower than that of the non-local G2LNet variants. These results indicate that, under label-scarce scenarios, non-local aggregation is particularly important for homophilous graphs, since modeling non-local dependencies can more effectively capture homophilous connections and long-range label consistency when supervision is limited.

4.2.2. Experimental Setting B

Under Setting B (as shown in Table 4), our non-local G2LNet variants show competitive advantages over the best three GEOM-GCN variants and the best three NLGNN variants on both homophilous and heterophilous graphs. Without relying on the graph structure, the G2LNet variants outperform the two linear models (MLP and HMLP) on all datasets, and obtain performance comparable to GCN and GAT on several datasets (CiteSeer, PubMed, Actor, and Cornell). In particular, they achieve the best performance on the Texas and Wisconsin datasets. These results further validate that the effectiveness and generalization capability of the proposed method originate from the model itself, rather than being an incidental outcome under a specific data split.

We also observe that, on the Actor dataset, all local aggregation methods perform worse than the two linear models, and the improvement brought by non-local methods is limited. In addition, on the Cornell dataset, our method underperforms NLMLP, WRGAT, and two H2GCN variants. This may be attributed to the relatively smooth attention scores on this dataset, where the attention-guided ranking mechanism in NLMLP, the homophily regularization in WRGAT, and the homophily-path aggregation strategy in H2GCN play key roles. To better explain this phenomenon, we visualize the changes in validation accuracy, the average weighted homophily rate, and the two-hop neighborhood weighted sparsity rate of

R G 2 LNet R_{I}

on the Actor, Cornell, and Wisconsin datasets over the training iterations. As shown in Figure 2, on Actor the increase in the average weighted neighborhood homophily

H (G)

is not significant, and the decrease in the weighted sparsity rate

S (G)

is limited. On Cornell, both the increase in

H (G)

and the decrease in

S (G)

remain marginal. In contrast, on Wisconsin, where the model performs well,

H (G)

increases substantially and

S (G)

approaches the globally optimal sparsity rate

M S (G)

reported in Table 1. This phenomenon suggests that, on Actor and Cornell, the correlation between node features and class labels is relatively weak, which makes it difficult for our method to learn effective relational connections among nodes. Conversely, on Wisconsin, node features are more strongly correlated with class labels, enabling our method to more easily learn effective connections and suppress noisy edges.

4.2.3. Controlled Variant Analysis

First, under Setting A, the non-local variants significantly outperform the local variants, indicating that cross-neighborhood dependency modeling constitutes the main source of performance gains in the low-label regime. Under Setting B, the graph-free G2LNet variants achieve better performance than other methods on the Texas and Wisconsin datasets, with NLMLP ranking second. Moreover, the performance of local graph convolution operators is clearly inferior to that of MLP. However, on the heterophilous Chameleon and Squirrel datasets, the non-local G2LNet variants underperform the local G2LNet variants. Similar phenomena are also observed in GEOM-GCN and NLGNN. For example, the locally aggregated GEOM-GCN-(I,P,S)-g outperforms other GEOM-GCN variants, and NLGCN and NLGAT (which use local filters for embeddings) outperform NLMLP. These results suggest that, on some heterophilous graphs, aggregating neighborhood information may introduce unfavorable noise and degrade performance, while on other heterophilous graphs, neighborhood information can guide effective message passing. This can be explained by the MA values in Table 1. Specifically, the

M A_{2}

values of Chameleon and Squirrel are 95%, indicating that local modeling already provides strong reference performance under this protocol. Therefore, non-local propagation is more likely to introduce over-mixing or noise accumulation due to label confusion. In contrast, for datasets with lower

M A_{2}

values (e.g., the WebKB series), modeling non-local dependencies is often more helpful for breaking through the local upper bound and improving overall performance.

Second, under both experimental settings, a notable observation is that the performance gap between

R_{d}

and

R_{I}

in hyperbolic space is generally smaller than that in Euclidean space, suggesting that the flexibility introduced by curvature reduces the sensitivity of the model to the choice of relational operators. In addition, we observe that, under both experimental settings, hyperbolic variants are overall more competitive on most datasets, especially on Chameleon, where the hyperbolic variants significantly outperform the Euclidean variants. To further validate this phenomenon, we visualize the Poincaré disk embeddings of partial nodes on the Chameleon and Cora datasets using

R G 2 LNet R_{I}

and

B G 2 LNet R_{I}

. As shown in Figure 3, on Chameleon,

R G 2 LNet R_{I}

can hardly capture hierarchical structure, whereas

B G 2 LNet R_{I}

preserves the node hierarchy. On Cora, both

R G 2 LNet R_{I}

and

B G 2 LNet R_{I}

reflect hierarchical structures, while

B G 2 LNet R_{I}

separates them more clearly. Therefore, on Chameleon,

B G 2 LNet R_{I}

leads to better class separation and thus better performance, whereas on Cora the performances are comparable. These results indicate that hyperbolic space is intuitively more suitable for characterizing uneven and hierarchical graph relations on certain datasets, thereby making non-local message passing more robust.

4.2.4. Computational Efficiency Analysis

We evaluate computational efficiency on PubMed (larger-scale graph) and Texas (smaller-scale graph) by comparing GCN, GAT, GEOM-GCN-I, as well as the proposed local/non-local variants of

R G 2 LNet R_{I}

and

B G 2 LNet R_{I}

in both Euclidean and hyperbolic spaces. We report the peak GPU memory usage and the average runtime per epoch in Figure 4. Overall, the proposed local variants remain efficient and exhibit a computational profile comparable to standard message-passing baselines, indicating that introducing topology refinement does not significantly increase the overhead in the local regime. In addition, hyperbolic variants are consistently slower than their Euclidean counterparts, since hyperbolic operations require extra exponential and logarithmic mappings between the manifold and tangent spaces. In contrast, the non-local variants incur substantially higher memory consumption and runtime on Larger scale graphs, mainly due to the quadratic complexity induced by dense non-local interactions, which is consistent with GEOM-GCN-I as both approaches involve quadratic-cost topology modeling. Moreover, our non-local variants introduce additional quadratic overhead from the topology-constraining objective, as the loss terms also require pairwise relation computations. Consequently, non-local topology learning becomes considerably more expensive on large graphs, which also explains the out-of-memory (OOM) issue observed for

B G 2 LNet R_{d}

in Table 3 and Table 4 on the PubMed dataset: the hyperbolic distance formulation introduces an extra

O (N^{2})

computational factor compared with the hyperbolic angle, making it more demanding in practice. Meanwhile, the non-local variants remain more feasible on small graphs. These results highlight a clear trade-off between global dependency modeling and computational efficiency, and support using local variants as the default choice in Larger scale scenarios.

4.3. Ablation Analysis

In the previous subsection, we showed that the proposed non-local variants achieve the best performance under the low-label Setting A and remain competitive under the more-label Setting B. In this subsection, we conduct an ablation study of the overall best-performing

B G 2 LNet R_{I}

on the homophilous Cora and PubMed datasets (Setting A) as well as the heterophilous Chameleon, Cornell, and Wisconsin datasets (Setting B). Specifically, we remove the center correction unit (CCU, Equation (6)), the perception connection (PC, Equation (13)), and the constraint objective

L_{G 2 LNet}

(Equation (21)) to quantify their contributions. The results are summarized in Table 5.

As reported in Table 5, both CCU and PC consistently improve accuracy over the plain backbone across all datasets. Notably, CCU yields larger improvements on heterophilous graphs, e.g., improving Chameleon from

75.28 %

to

76.98 %

and Wisconsin from

82.17 %

to

86.43 %

, which is consistent with its role of enhancing the individuality of the center node in a residual-like manner and preventing the center representation from being overwhelmed by potentially label-inconsistent neighbor messages. By contrast, PC provides more evident gains on homophilous datasets, e.g., improving Cora from

84.14 %

to

84.77 %

and PubMed from

81.11 %

to

81.73 %

, since it explicitly enlarges the influence of neighbors and thus better exploits label-consistent neighborhood correlations. Combining CCU and PC further improves performance across all datasets, confirming their complementarity between preserving center individuality and strengthening neighbor influence. Moreover, introducing

L_{G 2 LNet}

further strengthens the model, and the best results are achieved when it is jointly applied with CCU and PC. The full model (CCU + PC +

L_{G 2 LNet}

) attains the highest accuracy on all evaluated datasets, reaching

86.58 %

on Cora and

87.16 %

on Wisconsin, which demonstrates the effectiveness of jointly optimizing topology learning and feature propagation.

We further decompose

L_{G 2 LNet}

in Equation (21) into three regularizers: the hyperbolic global smoothness regularization

L_{sm}

(Equation (18)), the sparsity regularization

L_{spa}

(Equation (19)), and the symmetry regularization

L_{sym}

(Equation (20)). The results are summarized in Table 6, where the baseline corresponds to the architecture with CCU and PC enabled (i.e., without

L_{G 2 LNet}

). As shown in Table 6,

L_{sm}

mainly benefits homophilous graphs by promoting globally consistent relation learning (e.g., improving Cora to

85.36 %

and PubMed to

82.64 %

), while

L_{spa}

yields clearer improvements on heterophilous graphs by suppressing redundant or noisy interactions (e.g., improving Cornell to

73.83 %

and Wisconsin to

86.59 %

). Moreover, combining multiple regularizers further enhances performance, and applying all three terms jointly achieves the best overall results (e.g., reaching

83.32 %

on PubMed and

74.51 %

on Cornell). Overall, these ablation results confirm that the improvements of

B G 2 LNet R_{I}

originate from the synergy among (i) CCU for preserving center individuality, (ii) PC for strengthening neighbor influence, and (iii)

L_{G 2 LNet}

for providing complementary inductive biases on smoothness, sparsity, and symmetry, which jointly enhance robust topology learning under both homophilous and heterophilous regimes.

4.4. Hyper-Parameter Analysis

In this subsection, we perform a key hyper-parameter study for the overall best-performing

B G 2 LNet R_{I}

variant on the homophilous Cora dataset under Setting A and the heterophilous Wisconsin dataset under Setting B. We analyze four critical hyper-parameters, including the perception factor

α

in Equation (13), the regularization strength

λ_{1}

of the hyperbolic global smoothness term

L_{sm}

in Equation (18), the regularization strength

λ_{2}

of the sparsity term

L_{spa}

in Equation (19), and the regularization strength

λ_{3}

of the symmetry term

L_{sym}

in Equation (20). The results are reported in Figure 5a–d.

As shown in Figure 5a,

α

plays distinct roles on homophilous and heterophilous graphs. On Cora (homophilous), increasing

α

consistently improves performance, and the best accuracy is achieved with a relatively large

α

(e.g.,

α = 0.9

). This observation is intuitive since a larger

α

increases the contribution of the learned topology and strengthens neighborhood-driven propagation, which is beneficial when neighbors are more likely to share consistent labels. In contrast, on Wisconsin (heterophilous), performance degrades noticeably as

α

increases, and the best result is obtained when

α

is close to 0. This suggests that aggressively amplifying topology-aware aggregation may introduce label-inconsistent messages on heterophilous graphs, where neighbors are not necessarily semantically aligned. Therefore, we choose a moderate

α

that balances topology utilization and center feature reliability across different homophily regimes.

As shown in Figure 5b,

λ_{1}

controls the strength of global smoothness regularization. We observe that a moderate smoothness constraint benefits Cora, indicating that encouraging globally consistent relations helps exploit long-range label agreement on homophilous graphs. However, overly large

λ_{1}

tends to harm Wisconsin, since enforcing excessive smoothness may over-regularize the learned topology and blur discriminative patterns when the graph is heterophilous. These results support the use of a dataset-adaptive smoothness intensity.

λ_{2}

encourages the learned connections to be sparse. Figure 5c shows that moderate sparsity generally improves robustness on both datasets, while overly strong sparsity may remove informative edges and reduce performance. This trend is more critical for heterophilous graphs, where sparsity acts as an effective mechanism to suppress potentially noisy neighbor interactions, thus preventing harmful message passing.

As shown in Figure 5d,

λ_{3}

promotes the symmetry of learned connections. We find that a small but non-zero

λ_{3}

yields stable improvements, suggesting that symmetry regularization helps stabilize relation construction and reduces biased propagation. Similar to the other regularizers, excessively large

λ_{3}

may limit the flexibility of relation learning, especially on heterophilous graphs. Overall, the best performance is obtained when

λ_{1}

,

λ_{2}

, and

λ_{3}

are jointly set to moderate values, which jointly balance smoothness, sparsity, and symmetry for robust topology learning.

5. Conclusions

This paper proposes the Geometric Graph Learning Network (G2LNet), which treats the attention mechanism as a learnable communication probability induced by explicit geometric relationships between node representations, thereby enabling topological inference beyond observed adjacency relations. G2LNet unifies local, non-local, and agraph-based neighborhoods within a single architecture, supports relational operators in both Euclidean and hyperbolic spaces, and introduces an end-to-end constrained objective function to regularize inferred connections. Experiments across nine benchmarks demonstrate that G2LNet’s controlled variant achieves optimal classification performance.

Future work will prioritize scaling non-local and graph-free inference, developing principled criteria for selecting relation operators and neighborhood regimes, and extending the framework to dynamic and multi-relational graphs. Beyond semi-supervised node classification, the proposed geometric topology inference is applicable to settings with noisy or missing edges, including web-scale information networks, recommendation, and other heterogeneous relational learning problems.

Author Contributions

Conceptualization, L.W.; methodology, L.W. and X.X.; software, L.W., X.X. and Z.L.; validation, L.W. and X.X.; writing–original draft preparation, L.W.; writing–review and editing, X.X. and Z.L.; supervision, X.X. and Z.L.; project administration, Z.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Project for the Construction of Regional Innovation System in Jilin Province (YDZJ202504QYCX002).

Data Availability Statement

This study utilized publicly available datasets. The dataset can be accessed at https://github.com/bingzhewei/geom-gcn (accessed on 14 October 2025).

Conflicts of Interest

Author Lei Wang was employed by the company Jilin Gaofen Remote Sensing Application Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Notations

The key mathematical symbols used in this work are listed in Table A1.

Table A1. Summary of Notations.

Symbol	Description
G	a graph
V	the vertex set of a graph
E	the edge set of a graph
$G (A, X)$	a graph G with its adjacency matrix A and node feature matrix X
$V_{i}$	the i-th node in a graph
$N_{i}$	the neighborhood of $v_{i}$
A	$A \in R^{N \times N}$ , the adjacency matrix of a graph
$\bar{A}$	The neighborhood matrix with self-loop, $\bar{A} = A + I$
D	the diagonal degree matrix, $D_{i i} = \sum_{k = 1}^{N} A_{i k}$ , and $D_{i j} = 0$ if $i \neq j$
$\| V \|$	the number of nodes in a graph G
d	the dimensionality of node features
f	the dimensionality of hidden units in GNN
X	$X \in R^{N \times d}$ , node features
$x_{i}$	$x_{i} \in R^{d}$ , node $x_{i}$ features
C	the number of node labeling categories
Z	$Z \in R^{N \times C}$ , model outputs
$Z_{i}$	$Z_{i} \in R^{C}$ , node $x_{i}$ outputs
Y	$Y \in R^{N \times C}$ , one-hot ground truth label, $Y_{i c} = 1$ if $v_{i}$ is of class c, otherwise $Y_{i c} = 0$
$Y_{i}$	$Y_{i} \in R^{C}$ , the i-th row vector of Y
$B_{c}^{d}$	the Poincaré manifold, $B_{c}^{d} = {x \in R^{d} : {〈 x, x 〉}_{2} < - 1 / c}$
c	the curvature in the Poincaré ball model
$g_{x}^{B_{c}^{d}}$	metric system in the Poincaré ball model, $g_{x}^{B_{c}^{d}} = {(λ_{x}^{c})}^{2} g^{E^{d}}$ , where $λ_{x}^{c} = \frac{2}{1 + {c ∥ x ∥}_{2}^{2}}$ and $g^{E^{d}} = I_{d}$
$T_{o} B^{d}$	a node situated in the tangent space at the origin
$d_{B}^{c} (x, y)$	induced distance, $d_{B}^{c} (x, y) = \frac{1}{\sqrt{\| c \|}} {cosh}^{- 1} (1 - \frac{{2 c ∥ x - y ∥}_{2}^{2}}{(1 + {c ∥ x ∥}_{2}^{2}) (1 + {c ∥ y ∥}_{2}^{2})})$
${log}_{x}^{c} (y)$	logarithmic map, ${log}_{x}^{c} (y) = \frac{2}{\sqrt{\| c \| λ_{x}^{c}}} {tanh}^{- 1} (\sqrt{\| c \|} {∥- x \oplus_{c} y∥}_{2}) \frac{- x \oplus_{c} y}{∥ - x \oplus_{c} {y ∥}_{2}}$
${exp}_{x}^{c} (v)$	exponential map, ${exp}_{x}^{c} (v) = x \oplus_{c} (tanh (\sqrt{\| c \|} \frac{λ_{x}^{c} {∥ v ∥}_{2}}{2}) \frac{v}{\sqrt{\| c \|} {∥ v ∥}_{2}})$
$P T_{x \to y}^{c} (v)$	parallel transport, $P T_{x \to y}^{c} (v) = \frac{λ_{x}^{c}}{λ_{y}^{c}} gyr [y, - x] v$
$x \oplus_{c} y$	the Möbius addition in $B_{c}^{d}$ , $x \oplus_{c} y = \frac{{(1 + 2 c 〈 x, y 〉 + c ∥ y ∥}^{2} {) x + (1 - c ∥ x ∥}^{2}) y}{1 + 2 c 〈 x, y 〉 + c^{2} {∥ x ∥}^{2} {∥ y ∥}^{2}}$
$w \otimes_{c} x$	the Möbius scalar multiplication, $w \otimes_{c} x = (1 / \sqrt{c}) tanh (w {tanh}^{- 1} (\sqrt{c} ∥ x ∥)) \frac{x}{∥ x ∥}$ , where $c > 0$ , $x \in T_{o} B^{d}$ , $w \in R$
P	a square matrix used to determine the probability of message transmission between nodes.
$R$	a function to measure the similarity between node features
$L_{t a s k}$	the supervised classification loss, cross-entropy loss
$L$	a loss function for optimizing the probability of message passing between nodes
S	initial state of the messaging matrix, $S = I$ or A
$ϕ$	a function to aggregate neighbor information
$A G G$	a function that sums or averages or maximizes the input eigenvectors
$H (G)$	an average metric for evaluating what percentage of each node’s neighboring nodes have the same labeling
$ψ_{ν}$	for mapping node features to other spaces
$ψ_{z}$	embedding features in space into hidden variable space
$S (G)$	metrics for evaluating the proportion of active connections in a neighborhood
$M A 1$	with the same experimental setup as GCN: remove inter-class connections in $A^{2}$ , then test accuracy using SGC
$M A 2$	with the same experimental setup as GEOM-GCN: remove inter-class connections in $A^{2}$ , then test accuracy using SGC

Appendix B. Relevant Theories and Proofs

Lemma A1

(Asymptotic indistinguishability under deep propagation). Consider a simplified GCN-style propagation on a connected graph with self-loops,

H^{(ℓ + 1)} = \hat{A} H^{(ℓ)}, \hat{A} = D^{- 1} (A + I),

(A1)

where

\hat{A}

is row-stochastic. Then as

ℓ \to \infty

, node representations become indistinguishable up to a rank-one limit, i.e.,

H^{(ℓ)}

converges to the subspace spanned by the leading eigenvector of

\hat{A}

, and differences between node embeddings vanish.

Proof.

Since

\hat{A}

is row-stochastic and the graph is connected,

\hat{A}

admits a dominant eigenvalue 1 and the corresponding stationary distribution

π

. Moreover,

{\hat{A}}^{ℓ}

converges to a rank-one matrix of the form

1 π^{⊤}

under standard ergodicity conditions. Therefore,

H^{(ℓ)} = {\hat{A}}^{ℓ} H^{(0)} ⟶ (1 π^{⊤}) H^{(0)} .

(A2)

The right-hand side has identical rows (each row equals

π^{⊤} H^{(0)}

), implying that node embeddings become indistinguishable in the limit. This captures the oversmoothing phenomenon of deep propagation. □

Lemma A2

(Connectivity becomes hard to distinguish under deep GCN embeddings). If a deep GCN-style embedding yields

x_{i}^{(ℓ)} \approx x_{j}^{(ℓ)}

for most node pairs as ℓ grows, then any relation operator based on distances or inner products of embeddings becomes unable to reliably separate inter-class from intra-class connectivity.

Proof.

By Lemma A1, for sufficiently large ℓ, we have

x_{1}^{(ℓ)} \approx x_{2}^{(ℓ)} \approx x_{3}^{(ℓ)}

even when labels differ. Assume

Y_{1} = Y_{2}

but

Y_{3} \neq Y_{2}

. A relation operator that aims to separate classes typically requires

∥ ψ (x_{1}^{(ℓ)}) - ψ (x_{2}^{(ℓ)}) ∥ < ε and ∥ ψ (x_{2}^{(ℓ)}) - ψ (x_{3}^{(ℓ)}) ∥ > k,

(A3)

with

k ≫ ε

. However, if

x_{1}^{(ℓ)} \approx x_{2}^{(ℓ)} \approx x_{3}^{(ℓ)}

, then continuity of

ψ

implies

ψ (x_{1}^{(ℓ)}) \approx ψ (x_{2}^{(ℓ)}) \approx ψ (x_{3}^{(ℓ)})

, which contradicts the requirement that

∥ ψ (x_{2}^{(ℓ)}) - ψ (x_{3}^{(ℓ)}) ∥

be large. Hence, deep GCN embeddings make connectivity difficult to distinguish by geometric relations. □

Definition A1

(Permutation invariance and permutation equivariance). Let

P \in R^{n \times n}

be any permutation matrix. A mapping f is permutation-invariant if

f (P X) = f (X)

. A mapping f is permutation-equivariant if

f (P X) = P f (X)

.

Lemma A3.

Let

g (X) = f_{2} (f_{1} (X))

. If

f_{1}

is permutation-invariant, then g is permutation-invariant. If

f_{1}

is permutation-equivariant and

f_{2}

is permutation-invariant, then g is permutation-invariant.

Proof.

For any permutation matrix P,

g (P X) = f_{2} (f_{1} (P X)) .

(A4)

If

f_{1}

is permutation-invariant, then

f_{1} (P X) = f_{1} (X)

and thus

g (P X) = f_{2} (f_{1} (X)) = g (X)

. If

f_{1}

is permutation-equivariant, then

f_{1} (P X) = P f_{1} (X)

; since

f_{2}

is permutation-invariant,

f_{2} (P f_{1} (X)) = f_{2} (f_{1} (X))

, giving

g (P X) = g (X)

. □

Lemma A4.

If

f_{1}

and

f_{2}

are permutation-invariant, then

f_{1} \pm f_{2}

,

f_{1} \cdot f_{2}

, and

f_{1} / f_{2}

are permutation-invariant.

Proof.

Take multiplication as an example. For any P,

(f_{1} \cdot f_{2}) (P X) = f_{1} (P X) \cdot f_{2} (P X) = f_{1} (X) \cdot f_{2} (X) = (f_{1} \cdot f_{2}) (X) .

(A5)

Other operations follow similarly. □

Lemma A5.

The aggregation function ϕ in Equation (6) is permutation-invariant.

Proof.

The node-wise mappings used in

ψ

(e.g., MLP and pointwise projections) are permutation-equivariant, while the neighborhood aggregation operator

A G G

(sum/mean/max) is permutation-invariant by definition. By Lemma A3, composing equivariant node-wise mappings with invariant aggregation yields a permutation-invariant output. Finally, by Lemma A4, adding residual/center-correction terms preserves permutation invariance. Therefore

ϕ

is permutation-invariant. □

References

Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks; IEEE: New York, NY, USA, 2005; Volume 2, pp. 729–734. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2017; Volume 30. [Google Scholar]
Zhang, M.; Chen, Y. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2018; Volume 31. [Google Scholar]
Ying, R.; He, R.; Chen, K.; Eksombatchai, P.; Hamilton, W.L.; Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 974–983. [Google Scholar]
Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2015; Volume 28. [Google Scholar]
Atwood, J.; Towsley, D. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2016; Volume 29. [Google Scholar]
Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; Bronstein, M.M. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5115–5124. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2016; Volume 29. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Xu, B.; Shen, H.; Cao, Q.; Qiu, Y.; Cheng, X. Graph wavelet neural network. arXiv 2019, arXiv:1904.07785. [Google Scholar] [CrossRef]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of the International Conference On Machine Learning; PMLR: Cambridge, MA, USA, 2017; pp. 1263–1272. [Google Scholar]
Li, X.; Zhang, Y.; Xu, Y.; Xu, X. NWGformer: A linear graph transformer with non-linear re-weighting of attention scores. Knowl.-Based Syst. 2025, 332, 114815. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Xu, Y.; Xu, X.; Wang, Z. A graph transformer with optimized attention scores for node classification. Sci. Rep. 2025, 15, 30015. [Google Scholar] [CrossRef]
Li, R.; Wang, S.; Zhu, F.; Huang, J. Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, IL, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Jiang, B.; Zhang, Z.; Lin, D.; Tang, J.; Luo, B. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11313–11320. [Google Scholar]
Dornaika, F.; Bi, J.; Zhang, C. A unified deep semi-supervised graph learning scheme based on nodes re-weighting and manifold regularization. Neural Netw. 2023, 158, 188–196. [Google Scholar] [CrossRef] [PubMed]
Gao, X.; Hu, W.; Guo, Z. Exploring structure-adaptive graph learning for robust semi-supervised classification. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME); IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Zhang, K.; Zhu, Y.; Wang, J.; Zhang, J. Adaptive structural fingerprints for graph attention networks. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Luan, S.; Hua, C.; Lu, Q.; Zhu, J.; Zhao, M.; Zhang, S.; Chang, X.W.; Precup, D. Revisiting heterophily for graph neural networks. Adv. Neural Inf. Process. Syst. 2022, 35, 1362–1375. [Google Scholar]
Platonov, O.; Kuznedelev, D.; Diskin, M.; Babenko, A.; Prokhorenkova, L. A critical look at the evaluation of GNNs under heterophily: Are we really making progress? arXiv 2023, arXiv:2302.11640. [Google Scholar]
Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.i.; Jegelka, S. Representation learning on graphs with jumping knowledge networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2018; pp. 5453–5462. [Google Scholar]
Li, G.; Muller, M.; Thabet, A.; Ghanem, B. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9267–9276. [Google Scholar]
Abu-El-Haija, S.; Kapoor, A.; Perozzi, B.; Lee, J. N-gcn: Multi-scale graph convolution for semi-supervised node classification. In Proceedings of the Uncertainty in Artificial Intelligence; PMLR: Cambridge, MA, USA, 2020; pp. 841–851. [Google Scholar]
Liu, M.; Gao, H.; Ji, S. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 338–348. [Google Scholar]
Pei, H.; Wei, B.; Chang, K.C.C.; Lei, Y.; Yang, B. Geom-gcn: Geometric graph convolutional networks. arXiv 2020, arXiv:2002.05287. [Google Scholar] [CrossRef]
Liu, M.; Wang, Z.; Ji, S. Non-local graph neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 10270–10276. [Google Scholar] [CrossRef]
Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Akoglu, L.; Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. Adv. Neural Inf. Process. Syst. 2020, 33, 7793–7804. [Google Scholar]
Suresh, S.; Budde, V.; Neville, J.; Li, P.; Ma, J. Breaking the limit of graph neural networks by improving the assortativity of graphs with local mixing patterns. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, 14–18 August 2021; pp. 1541–1551. [Google Scholar]
Muscoloni, A.; Thomas, J.M.; Ciucci, S.; Bianconi, G.; Cannistraci, C.V. Machine learning meets complex networks via coalescent embedding in the hyperbolic space. Nat. Commun. 2017, 8, 1615. [Google Scholar] [CrossRef]
Nickel, M.; Kiela, D. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2017; Volume 30. [Google Scholar]
Ganea, O.; Bécigneul, G.; Hofmann, T. Hyperbolic neural networks. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2018; Volume 31. [Google Scholar]
Henaff, M.; Bruna, J.; LeCun, Y. Deep convolutional networks on graph-structured data. arXiv 2015, arXiv:1506.05163. [Google Scholar] [CrossRef]
Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 2013, 30, 83–98. [Google Scholar] [CrossRef]
Hoff, P.D.; Raftery, A.E.; Handcock, M.S. Latent space approaches to social network analysis. J. Am. Stat. Assoc. 2002, 97, 1090–1098. [Google Scholar] [CrossRef]
Namata, G.; London, B.; Getoor, L.; Huang, B.; Edu, U. Query-driven active surveying for collective classification. In Proceedings of the 10th International Workshop on Mining and Learning with Graphs, Scotland, UK, 1 July 2012; Volume 8, p. 1. [Google Scholar]
Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective classification in network data. Ai Mag. 2008, 29, 93. [Google Scholar] [CrossRef]
Rozemberczki, B.; Allen, C.; Sarkar, R. Multi-scale attributed node embedding. J. Complex Netw. 2021, 9, cnab014. [Google Scholar] [CrossRef]
Tang, J.; Sun, J.; Wang, C.; Yang, Z. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 807–816. [Google Scholar]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying graph convolutional networks. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2009; pp. 6861–6871. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch; NIPS Workshop Autodiff Submission: Long Beach, CA, USA, 2017. [Google Scholar]
Fey, M.; Lenssen, J.E. Fast graph representation learning with PyTorch Geometric. arXiv 2019, arXiv:1903.02428. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Yang, Z.; Cohen, W.; Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2016; pp. 40–48. [Google Scholar]

Figure 1. One layer of G2LNet. Node features are mapped to a geometric space, geometric relations induce communication probabilities (topology inference), messages are aggregated on the inferred neighborhoods, and constraint functions regularize topology learning.

Figure 2. Variation of validation set accuracy, one-hop neighborhood average band-weight homogeneity

H (G)

, and two-hop neighborhood band-weight sparsity

S (G)

of

R G 2 LNet R_{I}

on several datasets with iterative process. The solid curves denote the mean over repeated runs, and the shaded areas indicate

\pm 1

standard deviation.

Figure 2. Variation of validation set accuracy, one-hop neighborhood average band-weight homogeneity

H (G)

, and two-hop neighborhood band-weight sparsity

S (G)

of

R G 2 LNet R_{I}

on several datasets with iterative process. The solid curves denote the mean over repeated runs, and the shaded areas indicate

\pm 1

standard deviation.

Figure 3. Visualization of Poincaré discs for partial nodes of

R G 2 LNet R_{I}

and

B G 2 LNet R_{I}

on the Chameleon and Cora datasets.

Figure 3. Visualization of Poincaré discs for partial nodes of

R G 2 LNet R_{I}

and

B G 2 LNet R_{I}

on the Chameleon and Cora datasets.

Figure 4. Comparison of peak GPU memory usage and average runtime per epoch across different models on the PubMed and Texas datasets under identical training settings and latent dimensions.

Figure 5. Hyper-parameter sensitivity of

B G 2 LNet R_{I}

on Cora (homophilous, Experimental Setting A) and Wisconsin (heterophilous, Experimental Setting B).

Figure 5. Hyper-parameter sensitivity of

B G 2 LNet R_{I}

on Cora (homophilous, Experimental Setting A) and Wisconsin (heterophilous, Experimental Setting B).

Table 1. The nine datasets used in the experiment.

Dataset	Cora	Citeseer	PubMed	Chameleon	Squirrel	Actor	Cornell	Texas	Wisconsin
Nodes	2708	3327	19,717	2277	5201	7600	183	183	251
Edges	5429	4732	44,338	36,101	217,073	33,544	295	309	499
Features	1433	3703	500	2325	2089	931	1703	1703	1703
Classes	7	6	3	5	5	5	5	5	5
$H (G)$	0.83	0.71	0.79	0.25	0.22	0.24	0.11	0.06	0.16
MA1	0.90	0.77	0.84	–	–	–	–	–	–
MA2	0.96	0.85	0.96	0.95	0.95	0.73	0.59	0.75	0.82
MS $(G)$	0.17	0.17	0.35	0.20	0.20	0.21	0.28	0.37	0.32

Table 2. Hyperparameter configuration used for G2LNet and its variants on each dataset.

Dataset	Epochs	Learning Rate	Weight Decay	Dropout	c	$α$	$λ_{1}$	$λ_{2}$	$λ_{3}$
Cora	150	0.01	$5 \times 10^{- 4}$	0.2	$1 \times 10^{- 3}$	0.9	$5 \times 10^{- 5}$	$1 \times 10^{- 6}$	$1 \times 10^{- 6}$
Citeseer	150	0.02	$5 \times 10^{- 3}$	0.2	$1 \times 10^{- 3}$	0.8	$5 \times 10^{- 5}$	$1 \times 10^{- 6}$	$1 \times 10^{- 6}$
Pubmed	100	0.01	$1 \times 10^{- 3}$	0.0	$5 \times 10^{- 2}$	0.8	$5 \times 10^{- 5}$	$1 \times 10^{- 6}$	$1 \times 10^{- 6}$
Chameleon	200	0.03	$1 \times 10^{- 6}$	0.0	0.1	0.9	$1 \times 10^{- 6}$	$1 \times 10^{- 6}$	$1 \times 10^{- 6}$
Squirrel	200	0.03	0.0	0.0	0.1	0.9	$1 \times 10^{- 6}$	$1 \times 10^{- 6}$	$1 \times 10^{- 7}$
Actor	300	0.03	0.0	0.2	3.5	$1 \times 10^{- 3}$	$5 \times 10^{- 5}$	$1 \times 10^{- 5}$	$1 \times 10^{- 7}$
Cornell	300	0.02	$1 \times 10^{- 5}$	0.2	0.1	$1 \times 10^{- 3}$	$1 \times 10^{- 6}$	$1 \times 10^{- 5}$	$1 \times 10^{- 7}$
Texas	300	0.05	$1 \times 10^{- 5}$	0.0	3.5	$1 \times 10^{- 3}$	$1 \times 10^{- 6}$	$1 \times 10^{- 5}$	$1 \times 10^{- 7}$
Wisconsin	300	0.05	$1 \times 10^{- 6}$	0.0	3.5	$1 \times 10^{- 3}$	$1 \times 10^{- 6}$	$1 \times 10^{- 5}$	$1 \times 10^{- 7}$

Table 3. Experimental Setting A: Comparison of Node Classification Accuracy (%) Across Different Methods on Various Datasets (Average accuracy deviation from 100 independent runs). Methods marked with “*” are our re-implementations under the same protocol. In the table, red, blue, and bold black indicate the best, second-best, and third-best results, respectively.

Dataset	ChebNet	GCN	GWNN	MoNet	GAT	ADSF-RWR	GLNN	GLCN	MLP *	HMLP *	GCN *	GAT *
Cora	81.2	81.5	82.8	81.7 ± 0.5	83.0 ± 0.7	85.4 ± 0.3	83.4	85.5	57.28 ± 0.32	57.56 ± 0.27	81.55 ± 0.31	82.57 ± 0.42
CiteSeer	69.8	71.9	71.7	–	72.5 ± 0.7	74.3 ± 0.4	72.4	72.0	55.30 ± 0.54	56.18 ± 0.42	68.48 ± 0.41	70.23 ± 0.55
PubMed	74.4	71.9	79.1	78.8	79.0 ± 0.3	81.2 ± 0.3	78.3	78.3	72.71 ± 0.32	72.72 ± 0.26	79.23 ± 0.28	77.50 ± 0.53
Dataset	Non-local aggregation				Local aggregation				Not using given graph
Dataset	$R G 2 LNet R_{d}$	$R G 2 LNet R_{I}$	$B G 2 LNet R_{d}$	$B G 2 LNet R_{I}$	$R G 2 LNet R_{d}$	$R G 2 LNet R_{I}$	$B G 2 LNet R_{d}$	$B G 2 LNet R_{I}$	$R G 2 LNet R_{d}$	$R G 2 LNet R_{I}$	$B G 2 LNet R_{d}$	$B G 2 LNet R_{I}$
Cora	82.31 ± 0.31	86.55 ± 0.38	84.46 ± 0.41	86.58 ± 0.33	82.18 ± 0.39	82.34 ± 0.42	82.32 ± 0.40	82.26 ± 0.44	69.82 ± 0.48	70.10 ± 0.45	70.72 ± 0.41	70.54 ± 0.42
CiteSeer	71.23 ± 0.45	74.69 ± 0.48	75.20 ± 0.36	74.75 ± 0.41	69.51 ± 0.45	69.72 ± 0.41	68.87 ± 0.39	69.03 ± 0.41	69.16 ± 0.53	69.41 ± 0.46	70.17 ± 0.48	70.25 ± 0.42
PubMed	81.31 ± 0.33	82.57 ± 0.37	OOM	83.32 ± 0.31	78.40 ± 0.39	78.36 ± 0.35	78.13 ± 0.41	78.25 ± 0.34	71.33 ± 0.46	72.92 ± 0.43	OOM	73.47 ± 0.42

Table 4. Experimental Setting B: Comparison of Node Classification Accuracy (%) Across Different Methods on Various Datasets (Average accuracy deviation from 10 independent runs). Methods marked with “*” are our re-implementations under the same protocol. In the table, red, blue, and bold black indicate the best, second-best, and third-best results, respectively.

Method	Cora	CiteSeer	PubMed	Chameleon	Squirrel	Actor	Cornell	Texas	Wisconsin
MLP *	$72.62 \pm 1.31$	$71.03 \pm 1.89$	$86.62 \pm 0.57$	$44.12 \pm 1.59$	$32.11 \pm 1.86$	$34.14 \pm 0.88$	$71.66 \pm 3.22$	$78.97 \pm 4.99$	$84.15 \pm 6.53$
HMLP *	$75.15 \pm 1.19$	$70.55 \pm 1.66$	$87.68 \pm 0.61$	$43.81 \pm 1.43$	$30.51 \pm 1.82$	$33.46 \pm 1.22$	$75.12 \pm 2.69$	$80.58 \pm 4.86$	$84.90 \pm 6.22$
GCN *	$87.38 \pm 1.27$	$75.78 \pm 1.66$	$88.16 \pm 0.69$	$70.23 \pm 2.59$	$53.39 \pm 1.35$	$23.76 \pm 0.81$	$47.03 \pm 4.67$	$62.46 \pm 5.25$	$50.02 \pm 6.91$
GAT *	$87.70 \pm 1.06$	$75.24 \pm 1.33$	$86.05 \pm 0.52$	$61.16 \pm 1.94$	$41.65 \pm 2.32$	$27.38 \pm 1.74$	$45.92 \pm 3.35$	$58.98 \pm 4.46$	$52.15 \pm 8.74$
GEOM-GCN-I	85.1	77.9	90.05	60.3	33.3	29.0	56.7	57.5	58.2
GEOM-GCN-P	84.9	75.1	88.0	60.9	38.1	31.6	60.8	67.5	64.1
GEOM-GCN-S	85.2	74.7	84.7	59.9	36.2	30.3	55.6	59.7	56.6
$H_{2}$ GCN-1	$86.92 \pm 1.37$	$77.07 \pm 1.64$	$89.40 \pm 0.34$	$57.11 \pm 1.58$	$36.42 \pm 1.89$	$35.86 \pm 1.03$	$82.16 \pm 4.80$	$84.86 \pm 6.77$	$86.67 \pm 4.69$
$H_{2}$ GCN-2	$87.81 \pm 1.35$	$76.88 \pm 1.77$	$89.59 \pm 0.33$	$59.39 \pm 1.98$	$37.90 \pm 2.02$	$35.62 \pm 1.30$	$82.16 \pm 6.00$	$82.16 \pm 5.28$	$85.88 \pm 4.22$
WRGAT	$88.20 \pm 2.26$	$76.81 \pm 1.89$	$88.52 \pm 0.92$	$65.24 \pm 0.87$	$48.85 \pm 0.78$	$36.53 \pm 0.77$	$81.62 \pm 3.90$	$83.62 \pm 5.50$	$86.98 \pm 3.78$
NLMLP	$76.9 \pm 1.8$	$73.4 \pm 1.9$	$88.2 \pm 0.5$	$50.7 \pm 2.2$	$33.7 \pm 1.5$	$37.9 \pm 1.3$	$84.9 \pm 5.7$	$85.4 \pm 3.8$	$87.3 \pm 4.3$
NLGCN	$88.1 \pm 1.0$	$75.2 \pm 1.4$	$89.0 \pm 0.5$	$70.1 \pm 2.9$	$59.0 \pm 1.2$	$31.6 \pm 1.0$	$57.6 \pm 5.5$	$65.5 \pm 6.6$	$60.2 \pm 5.3$
NLGAT	$88.5 \pm 1.8$	$76.2 \pm 1.6$	$88.2 \pm 0.3$	$65.7 \pm 1.4$	$56.8 \pm 2.5$	$29.5 \pm 1.3$	$54.7 \pm 7.6$	$62.6 \pm 7.1$	$56.9 \pm 7.3$
Non-local aggregation
$R G 2 LNet R_{d}$	$87.42 \pm 1.15$	$76.27 \pm 1.42$	$88.40 \pm 0.38$	$67.71 \pm 2.11$	$57.95 \pm 2.77$	$33.58 \pm 1.13$	$73.70 \pm 6.88$	$80.22 \pm 4.51$	$83.11 \pm 5.10$
$R G 2 LNet R_{I}$	$88.55 \pm 1.03$	$77.25 \pm 1.87$	$88.45 \pm 0.41$	$73.62 \pm 1.79$	$70.36 \pm 2.17$	$34.90 \pm 1.45$	$74.32 \pm 7.11$	$83.27 \pm 4.63$	$86.47 \pm 4.32$
$B G 2 LNet R_{d}$	$87.26 \pm 1.77$	$76.91 \pm 1.94$	OOM	$73.22 \pm 1.56$	$54.67 \pm 2.44$	$35.35 \pm 1.28$	$74.05 \pm 6.92$	$85.40 \pm 4.12$	$86.04 \pm 4.87$
$B G 2 LNet R_{I}$	$88.41 \pm 1.22$	$77.49 \pm 1.35$	$\begin{matrix} n a m e d - c o n t e n t c o l o r t y p e r g b c o n t e n t - t y p e 0, 0, 1 89.67 \pm 0.33 \end{matrix}$	$77.46 \pm 1.62$	$66.43 \pm 2.53$	$35.29 \pm 1.42$	$74.51 \pm 7.08$	$86.28 \pm 7.14$	$87.16 \pm 4.95$
Local aggregation
$R G 2 LNet R_{d}$	$83.32 \pm 1.67$	$73.27 \pm 1.66$	$84.11 \pm 0.98$	$77.95 \pm 1.31$	$69.26 \pm 2.67$	$20.25 \pm 1.98$	$47.30 \pm 6.36$	$59.33 \pm 4.98$	$59.65 \pm 5.87$
$R G 2 LNet R_{I}$	$85.40 \pm 1.53$	$75.16 \pm 1.74$	$84.26 \pm 0.82$	$78.92 \pm 1.44$	$71.40 \pm 2.11$	$21.66 \pm 1.58$	$50.58 \pm 7.14$	$61.00 \pm 5.25$	$62.71 \pm 5.33$
$B G 2 LNet R_{d}$	$85.57 \pm 1.28$	$75.68 \pm 1.74$	$84.47 \pm 0.86$	$79.12 \pm 1.32$	$71.49 \pm 2.65$	$23.51 \pm 1.42$	$52.17 \pm 7.27$	$63.40 \pm 5.71$	$63.32 \pm 5.15$
$B G 2 LNet R_{I}$	$85.52 \pm 1.42$	$75.72 \pm 1.82$	$83.92 \pm 0.72$	$79.08 \pm 1.50$	$69.39 \pm 2.18$	$23.63 \pm 1.69$	$53.72 \pm 7.08$	$62.16 \pm 5.27$	$61.39 \pm 5.27$
Not using given graph
$R G 2 LNet R_{d}$	$76.24 \pm 2.08$	$72.42 \pm 1.52$	$84.16 \pm 0.84$	$51.85 \pm 2.46$	$33.26 \pm 7.21$	$35.13 \pm 1.57$	$74.18 \pm 7.87$	$84.25 \pm 4.32$	$86.43 \pm 4.53$
$R G 2 LNet R_{I}$	$78.35 \pm 2.10$	$75.40 \pm 1.56$	$87.27 \pm 0.95$	$52.31 \pm 2.12$	$33.78 \pm 7.23$	$35.28 \pm 1.29$	$76.43 \pm 7.25$	$86.43 \pm 4.56$	$88.46 \pm 4.16$
$B G 2 LNet R_{d}$	$77.71 \pm 2.18$	$75.46 \pm 1.87$	$87.46 \pm 0.80$	$51.94 \pm 2.42$	$34.16 \pm 7.54$	$35.77 \pm 1.61$	$76.43 \pm 7.21$	$86.46 \pm 4.17$	$87.11 \pm 4.66$
$B G 2 LNet R_{I}$	$78.36 \pm 2.56$	$75.51 \pm 1.39$	$87.65 \pm 1.14$	$52.82 \pm 2.08$	$33.46 \pm 7.53$	$35.59 \pm 1.65$	$76.64 \pm 7.33$	$86.51 \pm 4.78$	$88.51 \pm 4.52$

Table 5. Ablation study of

B G 2 LNet R_{I}

on multiple datasets with different module combinations.

Table 5. Ablation study of

B G 2 LNet R_{I}

on multiple datasets with different module combinations.

CCU	PC	$L_{G 2 LNet}$	Cora	PubMed	Chameleon	Cornell	Wisconsin
			$84.14 % \pm 0.37 %$	$81.11 % \pm 0.40 %$	$75.28 % \pm 1.74 %$	$72.03 % \pm 7.28 %$	$82.17 % \pm 5.78 %$
✓			$84.35 % \pm 0.43 %$	$81.25 % \pm 0.37 %$	$76.98 % \pm 1.55 %$	$73.05 % \pm 7.75 %$	$86.43 % \pm 4.77 %$
	✓		$84.77 % \pm 0.46 %$	$81.73 % \pm 0.36 %$	$76.13 % \pm 1.80 %$	$72.96 % \pm 7.19 %$	$84.32 % \pm 4.83 %$
✓	✓		$84.94 % \pm 0.41 %$	$81.98 % \pm 0.43 %$	$77.27 % \pm 1.66 %$	$73.11 % \pm 7.65 %$	$86.42 % \pm 4.51 %$
		✓	$84.86 % \pm 0.50 %$	$82.26 % \pm 0.35 %$	$76.63 % \pm 1.65 %$	$73.17 % \pm 7.87 %$	$83.96 % \pm 5.35 %$
✓		✓	$85.32 % \pm 0.42 %$	$82.55 % \pm 0.45 %$	$76.89 % \pm 1.75 %$	$74.21 % \pm 6.95 %$	$86.48 % \pm 5.13 %$
	✓	✓	$85.57 % \pm 0.45 %$	$82.71 % \pm 0.39 %$	$77.01 % \pm 1.83 %$	$73.27 % \pm 7.34 %$	$85.74 % \pm 5.29 %$
✓	✓	✓	$86.58 % \pm 0.33 %$	$83.32 % \pm 0.31 %$	$77.46 % \pm 1.62 %$	$74.51 % \pm 7.08 %$	$87.16 % \pm 4.95 %$

Table 6. Ablation study of the decomposed regularizers in

L_{G 2 LNet}

.

Table 6. Ablation study of the decomposed regularizers in

L_{G 2 LNet}

.

$L_{sm}$	$L_{spa}$	$L_{sym}$	Cora	PubMed	Chameleon	Cornell	Wisconsin
			$84.94 % \pm 0.41 %$	$81.98 % \pm 0.43 %$	$77.27 % \pm 1.66 %$	$73.11 % \pm 7.65 %$	$86.42 % \pm 4.51 %$
✓			$85.36 % \pm 0.45 %$	$82.64 % \pm 0.31 %$	$77.28 % \pm 1.59 %$	$73.28 % \pm 7.18 %$	$86.47 % \pm 4.49 %$
	✓		$85.13 % \pm 0.42 %$	$82.14 % \pm 0.30 %$	$77.35 % \pm 1.82 %$	$73.83 % \pm 7.44 %$	$86.59 % \pm 4.35 %$
✓	✓		$85.89 % \pm 0.45 %$	$82.77 % \pm 0.46 %$	$77.41 % \pm 1.61 %$	$73.83 % \pm 7.72 %$	$86.89 % \pm 4.28 %$
		✓	$84.98 % \pm 0.47 %$	$82.21 % \pm 0.39 %$	$77.33 % \pm 1.39 %$	$73.77 % \pm 7.56 %$	$86.58 % \pm 4.93 %$
✓		✓	$85.61 % \pm 0.41 %$	$82.85 % \pm 0.37 %$	$77.33 % \pm 1.76 %$	$73.88 % \pm 6.15 %$	$86.98 % \pm 4.88 %$
	✓	✓	$85.22 % \pm 0.36 %$	$82.33 % \pm 0.42 %$	$77.43 % \pm 1.56 %$	$74.06 % \pm 7.52 %$	$87.13 % \pm 4.83 %$
✓	✓	✓	$86.58 % \pm 0.33 %$	$83.32 % \pm 0.31 %$	$77.46 % \pm 1.62 %$	$74.51 % \pm 7.08 %$	$87.16 % \pm 4.95 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, L.; Xu, X.; Li, Z. Geometric Graph Learning Network for Node Classification. Electronics 2026, 15, 696. https://doi.org/10.3390/electronics15030696

AMA Style

Wang L, Xu X, Li Z. Geometric Graph Learning Network for Node Classification. Electronics. 2026; 15(3):696. https://doi.org/10.3390/electronics15030696

Chicago/Turabian Style

Wang, Lei, Xitong Xu, and Zhuqiang Li. 2026. "Geometric Graph Learning Network for Node Classification" Electronics 15, no. 3: 696. https://doi.org/10.3390/electronics15030696

APA Style

Wang, L., Xu, X., & Li, Z. (2026). Geometric Graph Learning Network for Node Classification. Electronics, 15(3), 696. https://doi.org/10.3390/electronics15030696

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Geometric Graph Learning Network for Node Classification

Abstract

1. Introduction

2. Related Work

2.1. Semi-Supervised Node Classification

2.2. Hyperbolic Neural Networks

2.3. Graph Learning Networks

3. Materials and Methods

3.1. Node Mapping

3.2. Geometric Relation Measure

3.3. Neighborhood Aggregation

3.4. Constraint Function

3.5. Computational Complexity Analysis

4. Experiments

4.1. Model and Experimental Setup

4.1.1. Datasets

4.1.2. Experimental Setting

4.1.3. Baselines

4.2. Experimental Results

4.2.1. Experimental Setting A

4.2.2. Experimental Setting B

4.2.3. Controlled Variant Analysis

4.2.4. Computational Efficiency Analysis

4.3. Ablation Analysis

4.4. Hyper-Parameter Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Notations

Appendix B. Relevant Theories and Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI