LDC-GAT: A Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization

Chen, Liping; Zhu, Hongji; Han, Shuguang

doi:10.3390/axioms14070504

Open AccessArticle

LDC-GAT: A Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization

by

Liping Chen

¹

,

Hongji Zhu

²

and

Shuguang Han

^1,*

¹

School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

School of Computer Science, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Axioms 2025, 14(7), 504; https://doi.org/10.3390/axioms14070504

Submission received: 6 June 2025 / Revised: 25 June 2025 / Accepted: 26 June 2025 / Published: 27 June 2025

(This article belongs to the Section Mathematical Analysis)

Download

Browse Figures

Versions Notes

Abstract

Graph attention networks are pivotal for modeling non-Euclidean data, yet they face dual challenges: training oscillations induced by projection-based high-dimensional constraints and gradient anomalies due to poor adaptation to heterophilic structure. To address these issues, we propose LDC-GAT (Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization), which jointly optimizes both forward and backward propagation processes. In the forward path, we introduce Dynamic Residual Graph Filtering, which integrates a tunable self-loop coefficient to balance neighborhood aggregation and self-feature retention. This filtering mechanism, constrained by a lower bound on Dirichlet energy, improves multi-head attention via multi-scale fusion and mitigates overfitting. In the backward path, we design the Fro-FWNAdam, a gradient descent algorithm guided by a learning-rate-aware perceptron. An explicit Frobenius norm bound on weights is derived from Lyapunov theory to form the basis of the perceptron. This stability-aware optimizer is embedded within a Frank–Wolfe framework with Nesterov acceleration, yielding a projection-free constrained optimization strategy that stabilizes training dynamics. Experiments on six benchmark datasets show that LDC-GAT outperforms GAT by 10.54% in classification accuracy, which demonstrates strong robustness on heterophilic graphs.

Keywords:

LDC-GAT; DRG-Filtering; Fro-FWNAdam; multi-head weight threshold

MSC:

68T07

1. Introduction

Graph-structured data, as one mathematical representation of complex relationships in non-Euclidean spaces, has become a fundamental tool for modeling intricate systems, involving social network analysis [1] and molecular interaction prediction [2]. In recent years, graph neural networks (GNNs) have enabled end-to-end learning through spectral-domain frequency response filtering [3] and spatial-domain attention-based aggregation [4]. Notably, the adaptive spectral filtering model introduced by Defferrard et al. [5] leverages differentiable frequency response functions to overcome the heuristic tuning constraints inherent in traditional graph convolutional networks (GCNs). Nevertheless, contemporary approaches used for complex graph modeling usually face an inherent conflict between training stability and topological adaptability [6,7,8], which substantially impairs their ability to distinguish structures [9] and constrains their practical efficacy.

1.1. Related Work

Although attention-based graph neural networks have demonstrated superior performance in node classification tasks [4,10], their deployment in practical systems remains hindered by three critical bottlenecks.

Firstly, the use of fixed self-loop weights limits the model’s ability to dynamically adapt the graph filtering process to heterophilic structural patterns. As demonstrated in Bronstein et al.’s geometric deep learning framework [11], this constraint can induce signal distortion in the spectral domain. GNNs fundamentally operate through implicit modulation of graph signals [12], where traditional neighborhood aggregation acts as a low-pass filter. However, this approach inherently assumes homophily [13], while overlooking the possibility that, in heterophilic graphs, high-frequency components may carry essential discriminative cues [14]. Consequently, GATs with static filtering patterns have been proven to have excessive Dirichlet energy decay [15], lacking the capacity to modulate filtering strength in response to local structural variation.

Secondly, current research lacks rigorous theoretical characterization of the stability of iterative feature propagation. In deep GNN architectures, discretization errors may accumulate across layers, significantly degrading model robustness [16]. These highlight the need for a principled analysis of energy preservation and system dynamics in recursive graph signal processing.

Finally, the issue of unstable optimization caused by topological sensitivity remains largely unaddressed. Traditional graph attention networks, when updating multi-head weights independently, lack spectral constraints in high-dimensional parameter spaces, which may lead to gradient instability during training [9,17]. The gradient dynamics of neural networks are shaped not only by graph topology but also by the geometry of the parameter space itself [18]. While projected gradient methods incur high computational overhead due to frequent re-projections, alternatives such as SimGrad [11] often ignore the influence of graph structure on the spectral radius of weight matrices, potentially leading to unstable convergence behavior.

In summary, the limits in frequency adaptability, propagation stability, and constrained optimization [19,20,21,22,23] indicate the need for a unified framework that jointly addresses signal-level filtering and parameter-level training dynamics in GATs.

1.2. Our Approach

To address the aforementioned challenges, we present Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization (LDC-GAT), which systematically integrates graph signal processing theory with dynamic system stability analysis. The key innovations of our work can be summarized as follows:

We introduce Dynamic Residual Graph Filtering (DRG-Filtering), where a tunable self-loop factor dynamically regulates the coupling between neighborhood and self-node features. Moreover, we refer to the energy conservation idea of the Hamiltonian graph dynamics model [24] to construct a DRG-Filtering that maintains the Dirichlet energy lower bound.
We derive a Frobenius norm constraint for multi-head attention weights using Lyapunov theory, which is subsequently used to design the learning-rate-aware perceptron. This strategy dynamically gauges the compatibility between the current learning rate and local loss landscape characteristics by computing the relative relationship between the Frobenius norm of the weight matrix and a predefined critical threshold.
We pioneer the design of a projection-free optimization algorithm FWNAdam by integrating the Nesterov momentum mechanism into the Frank–Wolfe framework. By replacing high-dimensional projection steps with linear searches, and embedding the proposed learning-rate-aware perceptron into FWNAdam, we introduce Fro-FWNAdam, which ensures feasibility under convex constraints while enabling progressively stable network optimization.

Beyond classification, LDC-GAT is well suited for combinatorial constraint-aware GNNs, which are increasingly important in solving combinatorial problems like subgraph tracing and subgraph selection, where graph constraints are critical. Recent work in this area, such as the DeepTrace framework by P. Yu et al. [25], demonstrates the power of GNNs in optimizing contact tracing in epidemic networks, a task requiring efficient handling of graph constraints. Similarly, Rumor Centrality by D. Shah et al. [26] introduces a method for handling network-based combinatorial problems. Both works emphasize the growing need for optimization techniques tailored to graph-based combinatorial tasks, aligning well with the potential applications of LDC-GAT.

2. Preliminaries

2.1. Graph Attention Networks

Graph attention networks (GATs) [4] utilize a self-attention mechanism to dynamically adjust connection weights between nodes, improving the performance of graph convolutional networks. The graph structure is denoted as

G = (V, E, H)

, where

V = {v_{i}}_{i = 1}^{N}

represents the set of nodes,

E \subseteq V \times V

represents the set of edges, and

X \in R^{N \times F_{0}}

is the initial input feature matrix, with

F_{0}

being the dimensionality of the original features. The node feature matrix at the l-th layer is

H^{(l)} \in R^{N \times F}

, where F is the corresponding feature dimension, and

H^{(0)} = X

.

The original feature vector

h_{i} \in R^{F}

of each node

v_{i}

is first linearly transformed into a high-dimensional feature

h_{i}^{'} = W h_{i}, h_{i}^{'} \in R^{F^{'}}

using a weight matrix

W \in R^{F \times F^{'}}

, where

F^{'}

represents the feature dimension at the

l + 1

-th layer, and the weight matrix W is a learnable parameter matrix.

The attention coefficient for a node pair

(v_{i}, v_{j})

is defined as

e_{i j} = a (W h_{i}, W h_{j})

, where

a \in R^{2 F^{'}}

is the attention mechanism parameter vector. The attention coefficients within the neighborhood

N_{i}

of node

v_{i}

are normalized into

α_{i j}

using LeakyReLU activation and the softmax function as following:

α_{i j} = softmax (e_{i j}) = \frac{exp (LeakyReLU (a^{T} [W h_{i} ‖ W h_{j}]))}{\sum_{k \in N_{i}} exp (LeakyReLU (a^{T} [W h_{i} ‖ W h_{k}]))}

(1)

To enhance the model’s fault tolerance and expressive power, GATs introduce the multi-head attention mechanism. Let there be K independent attention heads working in parallel, with the corresponding weight matrices denoted as

{W^{(k)} \in R^{F \times F^{'}}}_{k = 1}^{K}

and the attention parameter sets as

{a^{(k)} \in R^{2 F^{'}}}_{k = 1}^{K}

. After transformation by the k-th head, the output features of K independent heads are integrated through average pooling, yielding the

l + 1

-th layer output feature matrix

H^{(l + 1)} \in R^{N \times F^{'}}

after multi-head average fusion:

H^{(l + 1)} = σ (\frac{1}{K} \sum_{k = 1}^{K} {\tilde{A}}^{(k)} H^{(l)} W^{(k)})

(2)

where

A^{(k)} \in R^{N \times N}

is the normalized attention matrix for the k-th head.

While existing theories provide effective frameworks for GAT, three critical limitations remain: dynamic graph filtering stability [19], parameter optimization robustness [20], and computational efficiency [21]. To better understand the GAT in various structural scenarios, it is essential to introduce a spectral tool that quantifies signal variation across the graph.

2.2. Dirichlet Energy and Graph Filtering

In graph neural networks and graph signal processing, the Dirichlet energy quantifies the variation of signals across the graph structure, serving as a key measure of smoothness and expressive capacity. For Dirichlet energy H, this is defined as follows.

Definition 1.

Given a graph signal matrix

H \in R^{N \times d}

, the Dirichlet energy is defined as

E (H) = \frac{1}{2} \sum_{(i, j) \in E} {∥ H_{i} - H_{j} ∥}_{2}^{2},

(3)

where

H_{i} \in R^{d}

is the feature vector at node i, and

E

is the set of graph edges.

Maintaining a lower bound on Dirichlet energy prevents the complete suppression of high-frequency components, enabling the graph filter to preserve discriminative information while enforcing smoothness. This balance enhances model expressiveness and improves robustness to structural perturbations, contributing to better generalization on diverse graph topologies.

While Dirichlet energy helps evaluate the behavior of graph filters during forward propagation, model training stability also depends heavily on how parameters are optimized over time [27,28]. We next introduce the underlying principles of optimization dynamics.

2.3. Lyapunov Stability in Discrete Dynamical Systems

To analyze and constrain training dynamics, we adopt Lyapunov stability theory, a classical tool in control systems used to evaluate the stability of discrete-time processes. Consider a general iterative system:

H^{(n + 1)} = H^{(n)} + Δ t \cdot f (H^{(n)}),

(4)

where

f (\cdot)

is a residual update function and

Δ t

is the step size. The system is said to be asymptotically stable if

{∥ H^{(n + 1)} - H^{(n)} ∥}_{F} \leq γ_{0} {∥ H^{(n)} ∥}_{F}, 0 < γ_{0} < 1 .

(5)

In the context of GNNs, this criterion allows us to derive Frobenius norm bounds on weight matrices, which can be used to regulate training dynamics and ensure convergence stability. To further quantify and evaluate the long-term learning behavior of such optimizers, we turn to regret analysis [17].

2.4. Regret Bounds

In constrained optimization problems involving learnable parameters

θ \in F

, especially when

F

is a convex and compact subset of

R^{d}

, the learning process is often modeled as an online learning procedure across time steps

t = 1, 2, \dots, T

. At each step, the learner selects

θ_{t} \in F

, then observes a convex loss function

L_{t} (\cdot)

, and incurs loss

L_{t} (θ_{t})

. A key performance measure in this setting is the cumulative regret, defined as

R_{T} : = \sum_{t = 1}^{T} (L_{t} (θ_{t}) - L_{t} (θ^{*})),

(6)

where

θ^{*} = arg {min}_{θ \in F} \sum_{t = 1}^{T} L_{t} (θ)

is the best fixed decision in hindsight. A learning algorithm is said to have no regret if

lim_{T \to \infty} \frac{R_{T}}{T} = 0,

(7)

i.e., the average regret vanishes asymptotically. One classic class of methods in this domain is Frank–Wolfe-type projection-free algorithms, which avoid expensive Euclidean projections by using linear optimization steps. However, existing variants such as FWAdam typically suffer from regret bounds as high as

O (T^{4 / 3})

[29]. In contrast, the following proposed Fro-FWNAdam combines adaptive learning rate control with gradient normalization under Frobenius norm constraints, and achieves an improved regret bound of

O (\sqrt{T})

This theoretical foundation is critical for the analysis in subsequent sections.

3. Dynamic Residual Graph Filtering

GNNs fundamentally operate through implicit response modulation of graph signals [12]. Traditional neighborhood aggregation leverages low-pass filtering to suppress high-frequency components, thereby promoting the smoothness of node representations. However, this approach heavily relies on the assumption of graph homophily [13], while high-frequency features in heterophilic networks often carry critical discriminative information through inter-node differences [14]. Consequently, existing GATs, due to their static self-loop weights and fixed filtering patterns, are prone to uncontrollable Dirichlet energy decay and lack frequency-domain adaptability [15]. As a result, they are unable to dynamically adjust filtering patterns according to graph heterophily metrics.

To address this, this paper introduces an adjustable self-loop influence factor

β

, which is integrated into the attention weight matrix to regulate the coupling strength between node features. The variable attention weight matrix for the k-th head,

A_{att}^{(k)}

, is defined as follows:

A_{att}^{(k)} = {\tilde{A}}^{(k)} + β I, β \in [- 1, 1]

(8)

where

I \in R^{N \times N}

is the identity matrix. Existing research indicates that the strong coupling between graph filters and weight matrices can induce spectral degradation, limiting the model’s adaptability to diverse graph structures.

This paper projects the original features

H^{(0)}

into the latent space through the mapping

ϕ : R^{N \times F} \to R^{N \times F^{'}}

, constructing cross-layer feature pathways:

H^{(l + 1)} = σ (\frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(l)} W^{(k)} + α ϕ (H^{(0)}))

(9)

where

α \in (- 1, 1)

is the learnable spectral balance coefficient that dynamically adjusts the cross-layer signal strength through a gating mechanism. This structure mitigates Dirichlet energy decay through the residual term.

Theorem 1.

Given the residual propagation rule, assume the activation function

σ (\cdot)

satisfies the contraction property

{∥ σ (X) ∥}_{2} \leq {∥ X ∥}_{2}

, and that the linear mapping

ϕ (\cdot)

satisfies

{∥ ϕ (X) ∥}_{F} \leq {∥ X ∥}_{F}

.

H^{(l + 1)} = σ (\frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(l)} W^{(k)} + α ϕ (H^{(0)})),

(10)

If the attention matrices

A_{att}^{(k)}

are properly normalized, then the Dirichlet energy of the propagated features is bounded from below by

E (H^{(l + 1)}) \geq α^{2} E (ϕ (H^{(0)})) + {(1 - α)}^{2} E (H^{(l)}) .

(11)

The Proof of Theorem 1 is given in Appendix A. Based on Definition 1, maintaining a lower bound on the Dirichlet energy ensures the coherent integration of multi-frequency signals, preventing the loss of critical feature information caused by reliance on a single filtering mode.

The residual term

ϕ (H^{(0)})

preserves high-frequency details, while the attention filtering term captures low-frequency topological patterns. Together, these components enable adaptive frequency band fusion through a differentiable mechanism, enhancing the overall network performance. While the residual filtering mechanism enhances forward signal propagation, uncontrolled parameter growth during training can still lead to instability. We now derive a Frobenius-norm-based constraint to ensure stability during weight updates in high-dimensional spaces and effectiveness.

4. The Frobenius Norm Constraints for Multi-Head Weights

Traditional graph attention networks, when updating multi-head weights independently, lack spectral constraints in high-dimensional parameter spaces, leading to gradient instability during training [9]. Therefore, leveraging Lyapunov stability theory, this paper establishes explicit Frobenius norm bounds for multi-head weight matrices and dynamically adjusts the learning rate based on the exceedance of the norm, feeding this information into Fro-FWNAdam for autonomous parameter space optimization.

Based on Section 2.3, the inter-layer propagation rule of the multi-head GAT is reconstructed into a discretized dynamical system with a time step

Δ t

:

H^{(n + 1)} = H^{(n)} + Δ t \cdot f (\frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(n)} W^{(k)} + α ϕ (H^{(0)}))

(12)

where

f (X) = σ (X) - X

is a linear residual mapping function, and

σ (\cdot)

is the Lipschitz-continuous activation function. According to Lyapunov stability theory, the system is asymptotically stable if the following condition holds:

{∥ H^{(n + 1)} - H^{(n)} ∥}_{F} \leq γ_{0} {∥ H^{(n)} ∥}_{F} (0 < γ_{0} < 1) .

(13)

Substituting into the discretized equation gives

Δ t ∥ f (\frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(n)} W^{(k)} + α ϕ (H^{(0)})) ∥_{F} \leq γ_{0} {∥ H^{(n)} ∥}_{F} .

(14)

Assuming that the activation function

σ (\cdot)

satisfies the Lipschitz condition

{∥ σ (X) ∥}_{F} \leq L_{σ} {∥ X ∥}_{F}

, the Lipschitz constant of the residual mapping

f (X)

is given by

L_{Φ} = max \{| 1 - L_{σ} |, | L_{σ} - 1 |\} .

(15)

For the ReLU family of functions (

L_{σ} = 1

),

L_{Φ} = 0

, and in this case, system stability is dominated by the linear terms. By neglecting higher-order nonlinear terms (

L_{Φ} = 0

), the stability condition degenerates to

\frac{Δ t}{K} \sum_{k = 1}^{K} ∥ A_{att}^{(k)} H^{(n)} W^{(k)} ∥_{F} \leq γ_{0} {∥ H^{(n)} ∥}_{F} .

(16)

Using the matrix norm compatibility property

{∥ A B ∥}_{F} \leq {∥ A ∥}_{F} {∥ B ∥}_{F}

, we obtain

\frac{Δ t}{K} \sum_{k = 1}^{K} ∥ A_{att}^{(k)} ∥_{F} {∥ W^{(k)} ∥}_{F} \leq γ_{0} .

(17)

Assuming the attention matrix is normalized (

{∥ A_{att}^{(k)} ∥}_{F} \leq 1

), the weights for each head must satisfy

{∥ W^{(k)} ∥}_{F} \leq \frac{γ_{0} K}{Δ t} .

(18)

We consider using the above weight constraint to adjust the learning rate to help the model accurately adjust parameters and accelerate model convergence. Although full-time-domain constraint adjustment mechanisms can precisely maintain parameter feasibility, its real-time computational overhead may lead to significant performance degradation in deep graph networks. Actually, once the Frobenius norm of the weight matrix enters an

ϵ

-neighborhood, the probability of constraint violation decays rapidly, so the following adjustment critical point is set to reduce computational overhead.

Lemma 1.

Let the parameter space be a convex and compact set

F = {W \in R^{F \times F^{'}} ∣ ∥ W ∥_{F} \leq τ}

, and suppose the initial parameter satisfies

δ_{1} : = {∥ W_{1}^{(k)} - τ ∥}_{F} \leq C_{0}

. Assume that whenever

{∥ W_{t}^{(k)} ∥}_{F} > τ

, the learning rate satisfies

η_{t}^{(k)} \geq \frac{η τ}{(τ + ϵ) \sqrt{t}}

, and that there exists a constant

G > 0

such that

{∥ g_{t} ∥}_{F} \leq G

. Then, there exists a finite time

t_{0} = ⌈{(\frac{2 η τ G}{(τ + ϵ) ϵ})}^{2}⌉,

(19)

such that for all

t > t_{0}

, the Frobenius norm satisfies

{∥ W_{t}^{(k)} - τ ∥}_{F} \leq ϵ

.

The Proof of Lemma 1 is given in Appendix B.

5. Optimization Under Norm Constraints: The Fro-FWNAdam Algorithm

The gradient [17,18] dynamics of neural networks are subject to dual constraints arising from both topological sensitivity and the coupling of high-dimensional parameter spaces. To address this issue, this paper introduces Fro-FWNAdam, based on Frank–Wolfe and Nesterov momentum [30]. This approach optimizes the convergence speed of the loss surface by leveraging Nesterov momentum and replaces high-dimensional projections with the Frank–Wolfe linear search [29], making it better suited for constrained optimization scenarios.

Let the feasible domain of the multi-head weight matrices be a convex compact set

F \subset R^{d}

, subject to the Frobenius norm constraint:

F = {W^{(k)} \in R^{F \times F^{'}} ∣ ∥ W^{(k)} ∥_{F} \leq τ}

(20)

where

τ

is determined by asymptotic stability theory. In the online convex optimization framework, the composite loss function is defined as

L_{t} (θ_{t}) = - λ_{1} \sum_{i} y_{i} log ({\hat{y}}_{i}) + λ_{2} {(max ({∥ W ∥}_{F} - τ, 0))}^{2} + λ_{3} \sum {\tilde{m}}_{t}

(21)

At the t-th iteration, the algorithm selects

θ_{t} = {W_{t}^{(k)}}_{k = 1}^{K} \in F

to minimize the cumulative regret

R_{T} = \sum_{t = 1}^{T} (L_{t} (θ_{t}) - L_{t} (θ^{*}))

, where

θ^{*}

denotes the global optimum. The gradient

g_{t}

of

θ_{t}

is computed as

g_{t} = \nabla_{θ_{t}} L_{t} (θ_{t})

(22)

The gradient of the cross-entropy term is computed through backpropagation, and the gradient of the regularization term is

\nabla_{W^{(k)}} max (∥ W^{(k)} ∥_{F} {- τ, 0)}^{2} = 2 max (∥ W^{(k)} ∥_{F} - τ, 0) \cdot \frac{W^{(k)}}{∥ W^{(k)} ∥_{F}}

(23)

Introduce the decay factor

β_{1}^{t} = β_{1} λ^{(t - 1)}

(

λ \in (0, 1)

), and update the first-order momentum

m_{t}

; then calculate the Nesterov-corrected momentum

{\hat{m}}_{t}

:

m_{t} = β_{1}^{t} m_{t - 1} + (1 - β_{1}^{t}) g_{t}, {\hat{m}}_{t} = β_{1}^{t} m_{t} + (1 - β_{1}^{t}) g_{t}

(24)

Let

β_{2} \in [0, 1)

be the second-order momentum coefficient. Update the second-order momentum

v_{t}

. Then perform further bias correction for

m_{t}

and

v_{t}

:

{\tilde{m}}_{t} = \frac{{\hat{m}}_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}} = \frac{β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}}{1 - β_{2}^{t}}

(25)

This paper constructs an auxiliary function to assist the Frank–Wolfe algorithm in determining the search direction:

F_{t} (θ) = η_{t} 〈\sum_{τ = 1}^{t} {\tilde{m}}_{τ}, θ〉 + {∥ θ - θ_{1} ∥}_{2}^{2}

(26)

where

η_{t}

is the learning rate at the t-th iteration. The following linear subproblem is then solved to determine the descent direction:

S_{t} = arg min_{θ \in F} 〈 \nabla F_{t} (θ_{t}), θ 〉

(27)

where

\nabla F_{t} (θ_{t}) = η_{t} {\tilde{m}}_{t} + 2 (θ_{t} - θ_{1})

. Finally, the update rule for the weight matrix parameters is as follows:

θ_{t + 1} = θ_{t} + \frac{η_{t}^{(k)} (S_{t} - θ_{t})}{\sqrt{{\hat{v}}_{t}} + ϵ}

(28)

where the Frobenius-based dynamic learning rate adaptation strategy introduced in Fro-FWNAdam is as follows:

η_{t}^{(k)} = \{\begin{matrix} η_{t_{0}}^{(k)} \cdot \frac{τ}{max (∥ W^{(k)} ∥_{F}, ϵ)} \cdot \frac{1}{\sqrt{t}}, & if ∥ W^{(k)} ∥_{F} > τ & t < t_{0}, \\ η_{t_{0}}^{(k)} \cdot \frac{1}{\sqrt{t}}, & otherwise . \end{matrix}

(29)

Here, the norm constraint boundary is defined as

τ = \frac{γ_{0} K}{Δ t}

, where

η_{t_{0}}^{(k)}

represents the original learning rate for the k-th head at the t-th iteration of Fro-FWNAdam, and

η_{t}^{(k)}

denotes the adjusted adaptive learning rate. Moreover, building on Lemma 1, we propose a phased learning rate adjustment strategy: During the initial stage

(t < t_{0})

, dense boundary monitoring is employed, while boundary constraints are relaxed once the parameters converge within the feasible region.

Based on the execution flow of the proposed Fro-FWNAdam and Section 2.4, its core innovations—namely, the integration of dynamic learning rate modulation and projection-free constrained optimization—require rigorous theoretical justification. To this end, we present the following theorem, which establishes the cumulative regret of the proposed algorithm.

Theorem 2.

Let the parameter space

F

satisfy

diam (F) \leq D

and

∥ θ - θ^{'} ∥_{\infty} \leq D_{\infty}

. Assume that when

{∥ W_{t}^{(k)} ∥}_{F} > τ

, the learning rate satisfies

η_{t}^{(k)} \geq \frac{η τ}{(τ + ϵ) \sqrt{t}},

and there exists a finite time

t_{0} = {(\frac{2 η τ G}{(τ + ϵ) ϵ})}^{2},

such that for all

t > t_{0}

we have

{∥ W_{t}^{(k)} ∥}_{F} \leq τ + ϵ

. Furthermore, suppose there exist constants

G, G_{\infty} > 0

such that for any

θ \in F

, the gradient satisfies

L_{t} (θ) - L_{t} (θ^{*}) \leq 〈 \nabla L_{t} (θ), θ - θ^{*} 〉, ∥ \nabla L_{t} (θ) ∥_{F} \leq G, {∥ \nabla L_{t} (θ) ∥}_{\infty} \leq G_{\infty} .

Then, the cumulative regret

R_{T} = \sum_{t = 1}^{T} (L_{t} (θ_{t}) - L_{t} (θ^{*}))

is upper-bounded by

R_{T} \leq \frac{D^{2} \sqrt{T v_{T}}}{2 η (1 - β_{1})} + \frac{η (β_{1} + 1) G_{\infty}}{(1 - β_{1}) \sqrt{1 - β_{2}}} \sum_{i = 1}^{d} {∥ g_{1 : T, i} ∥}_{2} + \frac{D_{\infty}^{2} G_{\infty} \sqrt{1 - β_{2}}}{2 η {(1 - λ)}^{2}} .

(30)

The proof of Theorem 2 is given in Appendix C.

Theorem 2 implies that when the data features are sparse and the gradients are bounded, the summation term can be significantly smaller than its theoretical upper bound,

\sum_{i = 1}^{d} {∥ g_{1 : T, i} ∥}_{2} ≪ d G_{\infty} \sqrt{T},

which indicates that, similar to Adam, the Fro-FWNAdam algorithm also achieves a regret bound of

O (d \sqrt{T})

.

Corollary 1.

Assume the loss function

L : R^{d} \to R

has bounded gradients such that, for all

θ \in R^{d}

and

t \geq 1

,

∥ g_{t} ∥_{2} \leq G, {∥ \nabla L (θ) ∥}_{\infty} \leq G_{\infty} .

Then, there exists a constant

D > 0

such that the distance between any two iterates

θ_{m}

and

θ_{n}

generated by the Fro-FWNAdam algorithm is uniformly bounded:

{∥ θ_{n} - θ_{m} ∥}_{2} \leq D, \forall m, n \in N, with m \leq n .

Moreover, for all

T \geq 1

, the average regret of Fro-FWNAdam satisfies the sublinear bound

\frac{R (T)}{T} \leq \frac{C}{\sqrt{T}},

for some constant

C > 0

. As a result, Fro-FWNAdam achieves asymptotic non-regret:

lim_{T \to \infty} \frac{R (T)}{T} = 0 .

(31)

Our algorithm over T iterations is bounded by

O (\sqrt{T})

—a significant improvement over the

O (T^{4 / 3})

theoretical bound of the FWAdam optimizer.

6. LDC-GAT

To address the limitations of existing graph attention networks in frequency adaptability, propagation stability, and optimization robustness, we propose LDC-GAT (Figure 1).

LDC-GAT first linearly projects initial features and fuses them with multi-head attention aggregates. Simultaneously, it computes Frobenius norms for each attention head, monitoring boundary deviations: heads exceeding the threshold receive weight penalties with Fro-FWNAdam learning rate adaptation, while others retain their original parameters for global loss optimization by Fro-FWNAdam. The core idea of LDC-GAT is to jointly enhance the forward feature propagation and backward parameter update processes under a unified stability-aware framework.

In the forward path, we introduce the Dynamic Residual Graph Filtering, which integrates a learnable self-loop modulation factor and a residual feature pathway to enable frequency-adaptive filtering. This mechanism enforces a lower bound on the Dirichlet energy during propagation, ensuring that high-frequency signals are not excessively attenuated.

In the backward path, we develop Fro-FWNAdam, a projection-free optimizer that incorporates a Frobenius norm constraint derived from Lyapunov stability theory. This constraint ensures that the weight matrices remain within a bounded range, preventing training divergence caused by uncontrolled norm growth.

At each iteration, Fro-FWNAdam firstly computes the gradient of a composite loss function that includes both a cross-entropy and a Frobenius norm penalty when the weight exceeds a predefined threshold

τ

. The learning rate is then dynamically adjusted: if the norm of

{∥ W^{(k)} ∥}_{F}

exceeds

τ

, the learning rate decays proportionally to the inverse of both the norm magnitude and the square root of the iteration count; otherwise, a standard inverse square root schedule is applied. This adaptive mechanism promotes stability in the early training phase and efficiency in later stages.

The optimizer further incorporates Nesterov momentum and bias-corrected moment estimates, following the Adam family of methods, but avoids costly Euclidean projections by adopting a Frank–Wolfe update rule. Specifically, it computes a linear minimization direction over the feasible parameter set and interpolates between the current iterate and this direction, scaled by a normalized step size. As a result, Fro-FWNAdam achieves convergence without explicit projection steps, maintaining both computational efficiency and theoretical boundedness. The details of Fro-FWNAdam are provided in Algorithm 1.

Algorithm 1 Fro-FWNAdam

Require:

H^{(l)}

, adjacency matrix A, step size

Δ t

, initial parameters

θ_{1}

, norm threshold

τ

, contraction rate

γ = 0.5

Ensure: Updated parameters

θ_{t + 1}

, feature matrix

H^{(l + 1)}

, learning rates

η_{t}^{(k)}

1:: Initialize: $θ_{1} \in F$ , $β_{1}, β_{2} \in [0, 1)$ , $λ \in (0, 1)$
2:: Set $β_{1 t} \leftarrow β_{1} \cdot λ^{t - 1}$
3:: for $t = 1$ to T do
4:: for $k = 1$ to K do
5:: $H_{k}^{(l)} \leftarrow H^{(l)} W^{(k)}$
6:: $A^{(k)} \leftarrow {(a^{(k)})}^{⊤} [H_{k i}^{(l)} ∣ H_{k j}^{(l)}]$
7:: ${∥ W^{(k)} ∥}_{F} \leftarrow min (∥ W^{(k)} ∥_{F}, τ)$
8:: $g_{t} \leftarrow \nabla L_{t} (θ_{t})$
9:: if ${∥ W^{(k)} ∥}_{F} > τ$ and $t < t_{0}$ then
10:: $η_{t}^{(k)} \leftarrow η_{t_{0}}^{(k)} \cdot \frac{τ}{max (∥ W^{(k)} ∥_{F}, ϵ) \cdot \sqrt{t}}$
11:: else
12:: $η_{t}^{(k)} \leftarrow \frac{η_{t_{0}}^{(k)}}{\sqrt{t}}$
13:: end if
14:: end for
15:: $H^{(l + 1)} \leftarrow σ (\frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(l)} W^{(k)} + α \cdot φ (H^{(0)}))$
16:: $m_{t} \leftarrow β_{1 t} m_{t - 1} + (1 - β_{1 t}) g_{t}$
17:: ${\hat{m}}_{t} \leftarrow β_{1 t} m_{t} + (1 - β_{1 t}) g_{t}$
18:: $v_{t} \leftarrow β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$
19:: ${\tilde{m}}_{t} \leftarrow \frac{{\hat{m}}_{t}}{1 - β_{1 t}^{t}}$ , ${\hat{v}}_{t} \leftarrow \frac{v_{t}}{1 - β_{2}^{t}}$
20:: $S_{t} \leftarrow arg {min}_{θ} 〈 \nabla L_{t} (θ_{t}), θ 〉$
21:: $θ_{t + 1} \leftarrow θ_{t} + \frac{η_{t}^{(k)} \cdot (S_{t} - θ_{t})}{\sqrt{{\hat{v}}_{t}} + ϵ}$
22:: end for

By combining structure-aware gradient control, norm-constrained learning rates, and projection-free updates, Fro-FWNAdam achieves stable and efficient training dynamics tailored to the demands of graph-structured data.

7. Simulations

This paper presents ablation experiments of LDC-GAT on publicly available homophilic graph datasets, including Cora, Citeseer, and Pubmed [31], as well as heterophilic graph datasets, such as Chameleon, Texas, and Squirrel [5]. Dataset details are shown in Table 1.

It is compared with traditional neural networks like MLP, Graph SAGE [15], ChebNet [5], GAT [4], GIN [13], and GCN [8]. Our experimental setup adopts a time step

Δ t

of 0.01, 64-dimensional embeddings, a dropout rate of 0.5, and a self-node factor

α

of 0.5. The experiments validate the effectiveness of the research in this paper by recording the training and testing performance of each network in the semi-supervised node classification task.

In homophilic graphs, LDC-GAT enhances the geometric fidelity of node feature representations by adaptively modulating feature coupling through DGR-Filtering and accelerates model training stability and accuracy through Fro-FWNAdam. The training process is shown in Figure 2 and Figure 3.

As demonstrated in Figure 2, LDC-GAT, incorporating the Fro-FWNAdam optimizer, achieves faster convergence to higher accuracy compared to the vanilla GAT. The subfigures in Figure 2 show the performance of LDC-GAT on three different datasets: Cora, Citeseer, and PubMed. Each subfigure has been separated and enlarged in Appendix D for better clarity and interpretation.

Correspondingly, Figure 3 shows that LDC-GAT exhibits a steeper loss reduction, reaching lower asymptotic loss values more rapidly. The following figure illustrates the loss dynamics over training epochs.

Each subplot in Figure 3 provides an insight into the loss reduction pattern for Cora, Citeseer, and PubMed, with LDC-GAT showing a more rapid and consistent decrease in loss compared to GAT. Each subfigure of Figure 2 and Figure 3 has been separated and enlarged in Appendix D for better clarity and interpretation.

To comprehensively evaluate the overall effectiveness of the improvement strategy on homophilic and heterophilic graphs, this paper conducts a series of ablation experiments on LDC-GAT, followed by a comparison with GAT.Results are shown in Table 2.

To ensure conciseness given the extensive dataset collection, this paper focuses on Cora for homophilic graph ablation studies and Chameleon for heterophilic graph experiments. Results are shown in Table 3.

The proposed triple collaborative enhancement mechanism effectively captures graph structural features, leading to improved accuracy and reduced loss. To address the performance limitations of traditional optimizers under constraints [21], we introduce the FWNAdam algorithm, which efficiently corrects gradient updates. To further validate the reliability of our network, we re-ran all experiments 10 times with different random seeds, reporting the mean accuracy, standard deviation, and 95% confidence intervals.

The experimental results on homophilic graph datasets (Cora, Citeseer, and PubMed) demonstrate that LDC-GAT consistently outperforms GAT in terms of accuracy. Specifically, LDC-GAT achieves a 5.70% improvement on Citeseer. These improvements are statistically significant, with p-values of 0.011, 0.002, and 0.025, respectively. This indicates that the enhancements provided by LDC-GAT are reliable and not due to random fluctuations. The 95% confidence intervals for these results further confirm the robustness of LDC-GAT’s performance on homophilic graphs. Specific results are shown in Table 4 below.

For heterophilic graph datasets such as Chameleon, Texas, and Squirrel, LDC-GAT demonstrates significant improvements over GAT, as summarized in Table 5. In particular, LDC-GAT outperforms GAT, with improvements that are all statistically significant, with p-values of 0.0005, 0.0001, and 0.0012, respectively. These results confirm LDC-GAT’s superior adaptability in handling complex heterophilic graph structures, where nodes with diverse features or labels are involved.

To assess the superiority of LDC-GAT, we compared its performance with six classic baseline models. As shown in Table 6, LDC-GAT consistently outperformed other models.

GCN and GAT face performance limitations on heterophilic graphs due to their neglect of cross-type feature interactions [32]. LDC-GAT addresses this by leveraging a residual-enhanced multi-scale attention propagation rule and an adaptive learning rate adjustment strategy to enhance mutual information between heterophilic nodes. Consequently, LDC-GAT achieves a 0.1% to 0.9% improvement in average test accuracy on heterophilic graph datasets such as Chameleon, Texas, and Squirrel, surpassing other top baseline models.

To comparatively assess the robustness of GAT and LDC-GAT under graph sparsity conditions, we systematically removed 10–50% of edges from benchmark datasets to simulate topological degradation and evaluated model performance (Table 7).

The experimental results demonstrate that as the edge removal ratio increases from 10% to 50%, GAT exhibits an average performance degradation of 6.7%, whereas LDC-GAT shows only a 4.2% reduction. Notably, on datasets including Citeseer and PubMed, LDC-GAT maintains high accuracy even at 50% edge removal. Under extreme sparsity conditions (50% edge deletion), LDC-GAT outperforms GAT by an average margin of 13.2% in accuracy. These findings substantiate the superior adaptability of our method in sparse graph scenarios.

8. Conclusions

This paper mitigates the inherent limitations of graph attention networks in dynamic topology adaptation and training stability by constructing a full-stack solution for stabilizing graph neural network training from three dimensions: theoretical modeling based on discrete ODE stability, algorithm design, and architectural innovation. We creatively propose the Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization (LDC-GAT) featuring residual graph filtering and Frobenius norm threshold-based dynamic learning rate adaptation. The core innovations achieved are as follows:

In semi-supervised classification tasks, LDC-GAT demonstrates an average accuracy improvement of approximately 10.54% over vanilla GAT across six benchmark datasets (including Cora and PubMed), validating the effectiveness of multi-band signal fusion enabled by its dynamic residual filtering mechanism for cross-layer feature injection.
For constrained optimization, LDC-GAT pioneers explicit parameter-space compression through Frobenius norm constraint boundaries on multi-head weight matrices. The proposed collaborative optimization algorithm, Fro-FWNAdam, integrates dynamic learning rate adaptation, Nesterov momentum acceleration, and Frank–Wolfe direction search. It improves convergence speed by 10.6% while ensuring constraint feasibility, making it well suited for constrained scenarios.

In futurework, the stability-aware design of LDC-GAT could be extended to dynamic or temporal graphs, where topological changes over time further challenge propagation stability. Additionally, generalizing Fro-FWNAdam to support alternative constraint types such as spectral norm or structured sparsity may enhance its adaptability to broader optimization scenarios in graph learning. LDC-GAT’s potential in combinatorial optimization for real-world applications like contact tracing and social network analysis should also be explored. The DeepTrace framework [25] and Rumor Centrality method [26] highlight the importance of graph constraint-aware optimization, with applications in epidemic modeling and rumor propagation. LDC-GAT’s dynamic filtering and optimization strategies could significantly contribute to solving graph-constrained optimization problems, advancing the field of combinatorial optimization in networked systems.

Author Contributions

Conceptualization, L.C.; Method ology, L.C.; Validation, H.Z.; Resources, S.H.; Data curation, H.Z.; Writing—original draft, L.C.; Writing—review & editing, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 12471304.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Theorem 1

Definition A1.

Given a graph signal matrix

H \in R^{N \times d}

, the Dirichlet energy is defined as

E (H) = \frac{1}{2} \sum_{(i, j) \in E} {∥ H_{i} - H_{j} ∥}_{2}^{2},

(A1)

where

H_{i} \in R^{d}

denotes the feature vector of node i, and

E

is the set of graph edges.

E (H^{(l)}) \geq α^{2} E (H^{(0)}) + {(1 - α)}^{2} E (H^{(0)})

(A2)

Theorem A1.

Given the residual propagation rule

H^{(l + 1)} = σ (\frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(l)} W^{(k)} + α ϕ (H^{(0)})),

(A3)

assume the activation function

σ (\cdot)

satisfies the contraction property

{∥ σ (X) ∥}_{2} \leq {∥ X ∥}_{2}

, and the linear mapping

ϕ (\cdot)

satisfies

{∥ ϕ (X) ∥}_{F} \leq {∥ X ∥}_{F}

. If the attention matrices

A_{att}^{(k)}

are properly normalized, then the Dirichlet energy of the propagated features is bounded from below by

E (H^{(l + 1)}) \geq α^{2} E (ϕ (H^{(0)})) + {(1 - α)}^{2} E (H^{(l)}) .

(A4)

Proof.

Assume the activation function

σ (X)

is linear, such that

σ (X) = X

. Then, the residual propagation simplifies to

H^{(l + 1)} = (\frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(l)} W^{(k)} + α ϕ (H^{(0)})) .

(A5)

Let

T_{1} = \frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(l)} W^{(k)}

, and

T_{2} = α ϕ (H^{(0)})

. Applying the vector norm inequality,

{∥ a + b ∥}^{2} \geq (1 - β) {∥ a ∥}^{2} + (1 - \frac{1}{β}) {∥ b ∥}^{2} (β > 0), with β = \frac{1}{1 - α},

(A6)

the Dirichlet energy can be bounded below as

E (H^{(l + 1)}) \geq (1 - α) E (T_{1}) + α E (T_{2}) .

(A7)

For

T_{1} = \frac{1}{K} \sum_{k = 1}^{K} A_{att}^{(k)} H^{(l)} W^{(k)}

, using the normalization of attention matrices:

E (T_{1}) \leq \frac{1}{K} \sum_{k = 1}^{K} ∥ A_{att}^{(k)} ∥_{F}^{2} {∥ H^{(l)} W^{(k)} ∥}_{F}^{2} .

(A8)

Assuming

∥ W^{(k)} ∥_{F} \leq τ

and

∥ A_{att}^{(k)} ∥_{F} \leq 1

, it follows that

E (T_{1}) \leq E (H^{(l)}) .

(A9)

For the residual term

T_{2} = ϕ (H^{(0)})

, we have the following by assumption:

E (T_{2}) \geq E (ϕ (H^{(0)})) \geq κ E (H^{(0)}),

(A10)

where

κ

is the energy-preserving coefficient of the mapping

ϕ (\cdot)

. Combining the above results yields

E (H^{(l + 1)}) \geq (1 - α) E (H^{(l)}) + α κ E (H^{(0)}) .

(A11)

By recursively applying the inequality up to layer L, the final lower bound becomes

E (H^{(L)}) \geq α^{L} κ^{L} E (H^{(0)}) + {(1 - α)}^{L} E (H^{(0)}) .

(A12)

□

Appendix B. Poof of Lemma 1

Lemma A1.

Let the parameter space be a convex and compact set

F = {W \in R^{F \times F^{'}} ∣ {∥ W ∥}_{F} \leq τ}

, and suppose the initial parameter satisfies

δ_{1} : = {∥ W_{1}^{(k)} - τ ∥}_{F} \leq C_{0}

. Assume that whenever

{∥ W_{t}^{(k)} ∥}_{F} > τ

, the learning rate satisfies

η_{t}^{(k)} \geq \frac{η τ}{(τ + ϵ) \sqrt{t}}

, and that there exists a constant

G > 0

such that

∥ g_{t} ∥_{F} \leq G

. Then, there exists a finite time

t_{0} = ⌈{(\frac{2 η τ G}{(τ + ϵ) ϵ})}^{2}⌉,

(A13)

such that for all

t > t_{0}

, the Frobenius norm satisfies

{∥ W_{t}^{(k)} - τ ∥}_{F} \leq ϵ

.

Proof.

When

{∥ W_{t}^{(k)} ∥}_{F} > τ

, the parameter update follows:

W_{t + 1}^{(k)} = Π_{F} (W_{t}^{(k)} - η_{t}^{(k)} g_{t}),

(A14)

where

Π_{F}

denotes the projection onto the convex compact set

F = {W ∣ ∥ W ∥_{F} \leq τ}

. By the non-expansiveness of the projection operator, we have

{∥ Π_{F} (W) - τ ∥}_{F} \leq {∥ W - τ ∥}_{F}, \forall W \in R^{F \times F^{'}} .

(A15)

Applying the triangle inequality:

δ_{t + 1} = {∥ W_{t + 1}^{(k)} - τ ∥}_{F} \leq {∥ W_{t}^{(k)} - τ ∥}_{F} + η_{t}^{(k)} {∥ g_{t} ∥}_{2} .

(A16)

Expanding the squared norm:

δ_{t + 1}^{2} \leq {∥ W_{t}^{(k)} - τ ∥}_{F}^{2} - 2 η_{t}^{(k)} 〈 W_{t}^{(k)} - τ, g_{t} 〉 + {(η_{t}^{(k)})}^{2} {∥ g_{t} ∥}_{2}^{2} .

(A17)

Since the loss function

L_{t}

is convex,

〈 W_{t}^{(k)} - τ, g_{t} 〉 \geq 0

, and assuming

∥ g_{t} ∥_{F} \leq G

, we obtain

δ_{t + 1} \leq δ_{t} - η_{t}^{(k)} G .

(A18)

Substituting the learning rate lower bound

η_{t}^{(k)} \geq \frac{η τ}{(τ + ϵ) \sqrt{t}}

, summing over

t = 1

to

T - 1

, yields

δ_{T} \leq δ_{1} - \sum_{t = 1}^{T - 1} \frac{η τ G}{(τ + ϵ) \sqrt{t}} .

(A19)

Using the integral comparison,

\sum_{t = 1}^{T - 1} \frac{1}{\sqrt{t}} \geq \int_{1}^{T} \frac{1}{\sqrt{t}} d t = 2 (\sqrt{T} - 1),

we obtain

δ_{T} \leq δ_{1} - \frac{2 η τ G}{(τ + ϵ)} (\sqrt{T} - 1) .

(A20)

To ensure

δ_{T} \leq ϵ

, we solve

ϵ \geq δ_{1} - \frac{2 η τ G}{(τ + ϵ)} (\sqrt{T} - 1),

(A21)

which leads to

\sqrt{T} \geq \frac{(τ + ϵ) (δ_{1} - ϵ)}{2 η τ G} + 1 .

(A22)

Assuming

T ≫ 1

, we simplify to

t_{0} \geq {(\frac{2 η τ G}{(τ + ϵ) ϵ})}^{2} .

(A23)

Taking the smallest integer, we obtain

t_{0} = ⌈{(\frac{2 η τ G}{(τ + ϵ) ϵ})}^{2}⌉ .

(A24)

□

Appendix C. Poof of Theorem 2

Definition A2.

Let the parameter space be a convex and compact set

F = {W^{(k)} \in R^{F \times F^{'}} ∣ ∥ W^{(k)} ∥_{F} \leq τ}

. The loss function

L_{t} : F \to R

is said to be convex if, for any

θ_{1}, θ_{2} \in F

and

λ \in [0, 1]

, the following condition holds:

L_{t} (λ θ_{1} + (1 - λ) θ_{2}) \leq λ L_{t} (θ_{1}) + (1 - λ) L_{t} (θ_{2}),

(A25)

where

L_{t} (θ) = L_{task} + λ_{2} max {({∥ W ∥}_{F} - τ, 0)}^{2} + λ_{3} \sum {\tilde{m}}_{t} .

(A26)

Lemma A2.

Suppose there exist constants

G, G_{\infty} > 0

such that for any

θ \in F

the gradient satisfies the following conditions:

L_{t} (θ) - L_{t} (θ^{*}) \leq 〈 \nabla L_{t} (θ), θ - θ^{*} 〉, ∥ \nabla L_{t} (θ) ∥_{F} \leq G, {∥ \nabla L_{t} (θ) ∥}_{\infty} \leq G_{\infty} .

(A27)

Then, for each parameter dimension i, the gradient component

g_{t, i}

satisfies

\sum_{t = 1}^{T} \frac{g_{t, i}^{2}}{\sqrt{t}} \leq 2 G_{\infty} {∥ g_{1 : T, i} ∥}_{2} .

(A28)

Lemma A3.

Let the momentum coefficient be defined as

β_{1 t} = β_{1} λ^{t - 1}

, and assume the condition

γ ≜ β_{1}^{2} / \sqrt{β_{2}} < 1

holds. Let the second-moment estimate be given by

{\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}

. Then, for each coordinate i, the following inequality holds:

\sum_{t = 1}^{T} \frac{{\hat{m}}_{t, i}^{2}}{\sqrt{t {\hat{v}}_{t, i}}} \leq \frac{2}{(1 - γ) \sqrt{1 - β_{2}}} {∥ g_{1 : T, i} ∥}_{2},

(A29)

where

{\hat{m}}_{t}

denotes the bias-corrected first-moment (momentum) estimate.

Theorem A2.

Let the parameter space

F

satisfy

diam (F) \leq D

and

∥ θ - θ^{'} ∥_{\infty} \leq D_{\infty}

. Assume that when

{∥ W_{t}^{(k)} ∥}_{F} > τ

, the learning rate satisfies

η_{t}^{(k)} \geq \frac{η τ}{(τ + ϵ) \sqrt{t}},

and there exists a finite time

t_{0} = {(\frac{2 η τ G}{(τ + ϵ) ϵ})}^{2},

such that for all

t > t_{0}

, we have

{∥ W_{t}^{(k)} ∥}_{F} \leq τ + ϵ

. Furthermore, suppose there exist constants

G, G_{\infty} > 0

such that for any

θ \in F

the gradient satisfies

L_{t} (θ) - L_{t} (θ^{*}) \leq 〈 \nabla L_{t} (θ), θ - θ^{*} 〉, ∥ \nabla L_{t} (θ) ∥_{F} \leq G, {∥ \nabla L_{t} (θ) ∥}_{\infty} \leq G_{\infty} .

Then, the cumulative regret

R_{T} = \sum_{t = 1}^{T} (L_{t} (θ_{t}) - L_{t} (θ^{*}))

is upper-bounded by

R_{T} \leq \frac{D^{2} \sqrt{T v_{T}}}{2 η (1 - β_{1})} + \frac{η (β_{1} + 1) G_{\infty}}{(1 - β_{1}) \sqrt{1 - β_{2}}} \sum_{i = 1}^{d} {∥ g_{1 : T, i} ∥}_{2} + \frac{D_{\infty}^{2} G_{\infty} \sqrt{1 - β_{2}}}{2 η {(1 - λ)}^{2}} .

(A30)

Proof.

From the convexity inequality (Lemma 2), the instantaneous regret satisfies

L_{t} (θ_{t}) - L_{t} (θ^{*}) \leq 〈 g_{t}, θ_{t} - θ^{*} 〉 .

(A31)

Thus, the cumulative regret is

R (T) \leq \sum_{t = 1}^{T} 〈 g_{t}, θ_{t} - θ^{*} 〉 .

(A32)

Consider the update rule

θ_{t + 1} = Π_{F} (θ_{t} - η_{t} \frac{{\tilde{m}}_{t}}{\sqrt{{\tilde{v}}_{t}} + ϵ}),

(A33)

and by non-expansiveness of projection,

{∥ θ_{t + 1} - θ^{*} ∥}_{F}^{2} \leq {∥θ_{t} - η_{t} \frac{{\tilde{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} - θ^{*}∥}_{F}^{2} .

(A34)

Expanding and simplifying gives

〈 g_{t}, θ_{t} - θ^{*} 〉 \leq \frac{{∥ θ_{t} - θ^{*} ∥}_{F}^{2} - {∥ θ_{t + 1} - θ^{*} ∥}_{F}^{2}}{2 η_{t}} + \frac{η_{t} {∥ {\tilde{m}}_{t} ∥}_{F}^{2}}{2 {(\sqrt{{\hat{v}}_{t}} + ϵ)}^{2}} + 〈 g_{t} - {\tilde{m}}_{t}, θ_{t} - θ^{*} 〉 .

(A35)

From Lemma 4,

\sum_{t = 1}^{T} \frac{{\hat{m}}_{t, i}^{2}}{\sqrt{t {\hat{v}}_{t, i}}} \leq \frac{2}{(1 - γ) \sqrt{1 - β_{2}}} {∥ g_{1 : T, i} ∥}_{2} .

(A36)

Combining with Lemma 3,

\sum_{t = 1}^{T} \frac{{\hat{m}}_{t, i}^{2}}{\sqrt{t {\hat{v}}_{t, i}}} \leq \frac{2 G_{\infty}}{{(1 - γ)}^{2} \sqrt{1 - β_{2}}} {∥ g_{1 : T, i} ∥}_{2} .

(A37)

The total regret is partitioned into two phases: the initial phase (

t \leq t_{0}

) and the stable phase (

t > t_{0}

).

In the initial phase (

t \leq t_{0}

), by Lemma 1, the learning rate satisfies

η_{t} \geq \frac{η τ}{(τ + ϵ) \sqrt{t}}

. Combined with Lemma 4, we obtain

\sum_{t = 1}^{t_{0}} \frac{∥ {\tilde{m}}_{t} ∥_{F}^{2}}{\sqrt{t {\hat{v}}_{t}}} \leq \frac{2 G_{\infty}}{{(1 - γ)}^{2} \sqrt{1 - β_{2}}} \sum_{i = 1}^{d} {∥ g_{1 : t_{0}, i} ∥}_{2} .

(A38)

Substituting into the parameter deviation expression yields

\sum_{t = 1}^{t_{0}} 〈 g_{t}, θ_{t} - θ^{*} 〉 \leq \frac{D^{2}}{2 η} \sum_{t = 1}^{t_{0}} \frac{1}{\sqrt{t}} + \frac{η τ^{2} G_{\infty}}{{(1 - γ)}^{2} (1 - β_{2})} \sum_{i = 1}^{d} {∥ g_{1 : t_{0}, i} ∥}_{2} .

(A39)

In the stable phase (

t > t_{0}

), the learning rate decays as

η_{t} = \frac{η}{\sqrt{t}}

, and the parameters satisfy

{∥ W^{(k)} ∥}_{F} \leq τ + ϵ

. Applying Lemma 4,

\sum_{t = t_{0} + 1}^{T} 〈 g_{t}, θ_{t} - θ^{*} 〉 \leq \frac{D^{2} \sqrt{T} v_{T}}{2 η (1 - β_{1})} + \frac{η (β_{1} + 1) G_{\infty}}{(1 - β_{1}) \sqrt{1 - β_{2}}} \sum_{i = 1}^{d} {∥ g_{1 : T, i} ∥}_{2} .

(A40)

The accumulated bias is bounded as

\sum_{t = 1}^{T} 〈 g_{t} - {\tilde{m}}_{t}, θ_{t} - θ^{*} 〉 \leq D_{\infty} \sum_{t = 1}^{T} {∥ g_{t} - {\tilde{m}}_{t} ∥}_{\infty} .

(A41)

Using a geometric series bound

\sum_{t = 1}^{T} {∥ g_{t} - {\tilde{m}}_{t} ∥}_{\infty} \leq \frac{G_{\infty} \sqrt{1 - β_{2}}}{{(1 - λ)}^{2}}

:

\sum_{t = 1}^{T} 〈 g_{t} - {\tilde{m}}_{t}, θ_{t} - θ^{*} 〉 \leq \frac{D_{\infty} G_{\infty} \sqrt{1 - β_{2}}}{{(1 - λ)}^{2}} .

(A42)

Finally, we obtain

R (T) \leq \frac{D^{2} \sqrt{T v_{T}}}{2 η (1 - β_{1})} + \frac{η (β_{1} + 1) G_{\infty}}{(1 - β_{1}) \sqrt{1 - β_{2}}} \sum_{i = 1}^{d} {∥ g_{1 : T, i} ∥}_{2} + \frac{D_{\infty}^{2} G_{\infty} \sqrt{1 - β_{2}}}{2 η {(1 - λ)}^{2}} .

(A43)

□

Appendix D. Detailed Presentation of Subfigures in Figure 2 and Figure 3

In this section, we provide a detailed presentation and analysis of the accuracy and loss changes during the training processes of LDC-GAT and GAT. These results are crucial to understanding the advantages of LDC-GAT in terms of convergence speed and stability.

Figure A1. Accuracy comparison on the Cora dataset between LDC-GAT and vanilla GAT.

Figure A1 corresponds to the subfigure on the left in Figure 2 and illustrates the accuracy changes of LDC-GAT and GAT on the Cora dataset. As shown, LDC-GAT achieves faster and smoother convergence without significant oscillations, demonstrating the effectiveness of our carefully tuned hyperparameters and the model’s robustness in adapting to the dataset.

Figure A2. Loss comparison on the Cora dataset between LDC-GAT and vanilla GAT.

Figure A2 corresponds to the subfigure on the left in Figure 3 and shows the loss progression for LDC-GAT and vanilla GAT on the Cora dataset. It is evident that LDC-GAT reduces the loss more quickly and converges faster than vanilla GAT, confirming the effectiveness of the Fro-FWNAdam optimizer in speeding up training and minimizing loss.

Figure A3. Accuracy comparison on the Citeseer dataset between LDC-GAT and vanilla GAT.

Figure A3 corresponds to the subfigure on the center in Figure 2 and demonstrates the accuracy variation of LDC-GAT and vanilla GAT on the Citeseer dataset. We observe that LDC-GAT not only converges faster but also achieves higher accuracy at convergence, indicating its superior performance over vanilla GAT.

Figure A4. Loss comparison on the Citeseer dataset between LDC-GAT and vanilla GAT.

Figure A4 corresponds to the subfigure on the center in Figure 3 and displays the loss values of LDC-GAT and vanilla GAT on the Citeseer dataset. It is clear that LDC-GAT converges faster and achieves a lower loss at convergence, further validating the efficacy of the Fro-FWNAdam optimizer.

Figure A5. Accuracycomparison on the PubMed dataset between LDC-GAT and vanilla GAT.

Figure A5 corresponds to the subfigure on the right in Figure 2 and shows the accuracy changes for LDC-GAT and vanilla GAT on the PubMed dataset. As in the previous datasets, LDC-GAT converges more quickly and smoothly, demonstrating the effectiveness of our hyperparameter tuning in achieving rapid convergence and stable accuracy improvement.

Figure A6. Loss comparison on the PubMed dataset between LDC-GAT and vanilla GAT.

Figure A6 corresponds to the subfigure on the right in Figure 3 and presents the loss reduction dynamics for LDC-GAT and vanilla GAT on the PubMed dataset. LDC-GAT again shows a faster loss reduction and quicker convergence compared to vanilla GAT, highlighting the significant contribution of the Fro-FWNAdam optimizer in accelerating the training process and improving model stability.

References

Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1263–1272. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29; Curran Associates: Barcelona, Spain, 2016; pp. 3844–3852. [Google Scholar]
Li, Q.; Han, Z.; Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 3538–3545. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Oono, K.; Suzuki, T. Graph neural networks exponentially lose expressive power for node classification. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Pareja, A.; Domeniconi, G.; Chen, J.; Ma, T.; Suzumura, T.; Kanezashi, H.; Kaler, T. EvolveGCN: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 5363–5370. [Google Scholar]
Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
Balcilar, M.; Héroux, P.; Gaüizère, B.; Taminato, R.J.; Vazirgiannis, M.; Malliaros, F.D. Breaking the limits of message passing graph neural networks. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 599–608. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Newman, M.E.J. Mixing patterns in networks. Phys. Rev. E 2003, 67, 026126. [Google Scholar] [CrossRef] [PubMed]
He, W.; Wei, Z.; Huang, Z.; Li, J.; Li, H. BerNet: Learning arbitrary graph spectral filters via Bernstein approximation. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Virtual Event, 6–14 December 2021. [Google Scholar]
Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; Sun, X. Stability and generalization of graph convolutional neural networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 1539–1547. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
McCallum, A.K.; Nigam, K.; Rennie, J.; Seymore, K. Automating the construction of internet portals with machine learning. In Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 327–334. [Google Scholar]
Kim, Y.; Lee, S.; Kim, J.; Park, J. Stabilizing multi-head attention in transformers through gradient regularization. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Lv, S.; Shen, Y.; Qian, H.; Li, Y.; Wang, X. Robust graph neural networks under distribution shifts: A causal invariance approach. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Kavis, M.J.; Xu, Z.; Goldfarb, D. Projection-free adaptive methods for constrained deep learning. J. Mach. Learn. Res. 2023, 24, 1–38. [Google Scholar]
Ngo, T.; Yin, J.; Ge, Y.-F.; Wang, H. Optimizing IoT intrusion detection—A graph neural network approach with attribute-based graph construction. Information 2025, 16, 499. [Google Scholar] [CrossRef]
Chen, L.; Zhu, H.; Han, S. Stability-optimized graph convolutional network: A novel propagation rule with constraints derived from ODEs. Mathematics 2025, 13, 761. [Google Scholar] [CrossRef]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 1024–1034. [Google Scholar]
Tan, C.W.; Yu, P.-D.; Chen, S.; Poor, H.V. DeepTrace: Learning to optimize contact tracing in epidemic networks with graph neural networks. IEEE Trans. Signal Inf. Process. Netw. 2025, 11, 97–113. [Google Scholar] [CrossRef]
Shah, D.; Zaman, T. Rumors in a network: Who’s the culprit? IEEE Trans. Inf. Theory 2011, 57, 5163–5181. [Google Scholar] [CrossRef]
Yan, J.; Duan, Y. Momentum cosine similarity gradient optimization for graph convolutional networks. Comput. Eng. Appl. 2024, 60, 133–143. [Google Scholar]
Gama, F.; Bruna, J.; Ribeiro, A. Stability properties of graph neural networks. IEEE Trans. Signal Process. 2019, 68, 5680–5695. [Google Scholar] [CrossRef]
Zhang, M.; Zhou, Y.; Quan, W.; Wang, Y.; Zhao, Q. Online learning for IoT optimization: A Frank–Wolfe Adam-based algorithm. IEEE Internet Things J. 2020, 7, 8228–8237. [Google Scholar] [CrossRef]
Chang, B.; Chen, M.; Haber, E.; Chi, E.H. AntisymmetricRNN: A dynamical system view on recurrent neural networks. arXiv 2019, arXiv:1902.09689. [Google Scholar]
Pei, H.; Wei, B.; Chang, K.C.-C.; Lei, Y.; Yang, B. Geom-GCN: Geometric graph convolutional networks. arXiv 2020, arXiv:2002.05287. [Google Scholar]
Bo, D. Research on Key Technologies of Spectral Domain Graph Neural Networks. Ph.D. Thesis, Beijing University of Posts and Telecommunications, Beijing, China, 2023. [Google Scholar]

Figure 1. The Workflow of LDC-GAT: The synergistic enhancement of the DRG-Filtering, Frobenius norm constraints for each attention head, and Fro-FWNAdam.

Figure 2. Accuracy change diagram of LDC-GAT training process. Subfigure ’cora_acc’ on the left illustrates the accuracy comparison for Cora, ’Citeseer_acc’ in the center presents the accuracy comparison for Citeseer, and ’PubMed_acc’ on the right shows the accuracy comparison for PubMed.

Figure 3. Loss change diagram of LDC-GAT training process. Subfigure ’cora_loss’ on the left illustrates the loss comparison for Cora, ’Citeseer_loss’ in the center presents the loss comparison for Citeseer, and ’PubMed_loss’ on the right shows the loss comparison for PubMed.

Table 1. Dataset details.

Dataset	Nodes	Edges	Classes	Features	Homophily Level
Cora	2708	5429	7	1433	0.81
Citeseer	3327	4732	6	3703	0.74
PubMed	19,717	44,338	3	500	0.80
Chameleon	2277	36,101	3	2325	0.18
Texas	183	309	5	1703	0.11
Squirrel	5201	217,073	3	2089	0.018

Table 2. The ablation experiment results of LDC-GAT on Cora.

DRG-Filtering	Weight Norm Constraints	Fro-FWNAdam	Acc	Loss
×	×	×	83.90%	1.0893
✓	×	×	83.94%	1.0725
✓	✓	×	84.28%	1.0549
✓	✓	✓	84.50%	1.0433

Table 3. The ablation experiment results of LDC-GAT on Chameleon.

DRG-Filtering	Weight Norm Constraints	Fro-FWNAdam	Acc	Loss
×	×	×	60.31%	1.2416
✓	×	×	63.14%	1.1317
✓	✓	×	65.74%	1.0288
✓	✓	✓	67.79%	0.9677

Table 4. Model performance on homophilic graph datasets: Mean accuracy, standard deviation, confidence interval, and statistical significance of GAT and LDC-GAT.

Metric	Cora	Citeseer	PubMed
GAT mean accuracy	83.90%	48.75%	77.70%
LDC-GAT mean accuracy	84.50%	54.45%	78.20%
Standard deviation (±)	±0.0031	±0.0042	±0.0045
Confidence interval (95%)	[84.2%, 84.8%]	[53.9%, 55.0%]	[78.0%, 78.4%]
p-value	0.011	0.002	0.025

Table 5. Model performance on heterophilic graph datasets: Mean accuracy, standard deviation, confidence interval, and statistical significance of GAT and LDC-GAT.

Metric	Chameleon	Texas	Squirrel
GAT mean accuracy	60.31%	60.31%	42.65%
LDC-GAT mean accuracy	67.79%	81.88%	70.13%
Standard deviation (±)	±0.0050	±0.0020	±0.0050
Confidence interval (95%)	[67.4%, 68.2%]	[81.6%, 82.1%]	[69.8%, 70.5%]
p-value	0.0005	0.0001	0.0012

Table 6. Mean accuracy (%) comparison of model performance on homophilic and heterophilic graph datasets.

	Homophilic Graph Datasets			Heterophilic Graph Datasets
Model	Cora	Citeseer	PubMed	Chameleon	Texas	Squirrel
Graph SAGE	78.00	52.19	76.00	66.67	56.76	56.20
MLP	57.00	53.35	72.90	50.44	78.38	42.56
ChebNet	73.60	53.85	69.00	58.77	81.78	39.75
GCN	83.60	50.23	77.90	47.55	37.84	69.80
GIN	76.40	47.60	77.00	66.89	56.75	52.16
GAT	83.90	48.75	77.70	61.31	60.31	42.65
LDC-GAT	84.50	54.45	78.20	67.79	81.88	70.13

Table 7. Mean accuracy (%) comparison of GAT and LDC-GAT when the edge removal ratio increases from 10% to 50%.

Edge Removal Ratio	Drop 10%		Drop 30%		Drop 50%
Model	GAT	LDC-GAT	GAT	LDC-GAT	GAT	LDC-GAT
Cora	83.1	83.6	78.6	81.9	76.6	79.6
Citeseer	46.17	52.79	45.38	51.89	44.13	51.42
PubMed	75.2	77.92	72.3	75.68	71.8	75.2
Chameleon	59.65	66.94	59.64	66.91	59.36	65.8
Squirrel	40.46	70.08	40.25	69.88	39.88	69.7
Texas	57.76	79.76	56.76	78.46	54.05	76.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, L.; Zhu, H.; Han, S. LDC-GAT: A Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization. Axioms 2025, 14, 504. https://doi.org/10.3390/axioms14070504

AMA Style

Chen L, Zhu H, Han S. LDC-GAT: A Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization. Axioms. 2025; 14(7):504. https://doi.org/10.3390/axioms14070504

Chicago/Turabian Style

Chen, Liping, Hongji Zhu, and Shuguang Han. 2025. "LDC-GAT: A Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization" Axioms 14, no. 7: 504. https://doi.org/10.3390/axioms14070504

APA Style

Chen, L., Zhu, H., & Han, S. (2025). LDC-GAT: A Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization. Axioms, 14(7), 504. https://doi.org/10.3390/axioms14070504

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LDC-GAT: A Lyapunov-Stable Graph Attention Network with Dynamic Filtering and Constraint-Aware Optimization

Abstract

1. Introduction

1.1. Related Work

1.2. Our Approach

2. Preliminaries

2.1. Graph Attention Networks

2.2. Dirichlet Energy and Graph Filtering

2.3. Lyapunov Stability in Discrete Dynamical Systems

2.4. Regret Bounds

3. Dynamic Residual Graph Filtering

4. The Frobenius Norm Constraints for Multi-Head Weights

5. Optimization Under Norm Constraints: The Fro-FWNAdam Algorithm

6. LDC-GAT

7. Simulations

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix B. Poof of Lemma 1

Appendix C. Poof of Theorem 2

Appendix D. Detailed Presentation of Subfigures in Figure 2 and Figure 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI