STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs

Huang, Shiwei; Xiao, Shunxin; Zhang, Xu-Yao; Zhu, Shunzhi; Liu, Luoqi; Wang, Da-Han

doi:10.3390/math14091568

Open AccessArticle

STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs

by

Shiwei Huang

¹

,

Shunxin Xiao

¹,

Xu-Yao Zhang

²,

Shunzhi Zhu

¹,

Luoqi Liu

³ and

Da-Han Wang

^1,*

¹

College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China

²

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Beijing 100190, China

³

Meitu Inc., Xiamen 361100, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(9), 1568; https://doi.org/10.3390/math14091568

Submission received: 8 April 2026 / Revised: 30 April 2026 / Accepted: 3 May 2026 / Published: 6 May 2026

(This article belongs to the Special Issue Causal Inference and Machine Learning: Mathematical Modeling, Analysis and Applications)

Download

Browse Figures

Versions Notes

Abstract

Text-attributed graphs (TAGs) require models to jointly exploit node text and graph structure, yet doing so effectively remains difficult when node text is sparse and the structural context is large. Here, we propose STAGE (Semantic and Topological Augmented Graph Embedding), a two-stage framework for representation learning on TAGs. In Stage I, a frozen large language model is used offline to generate explanatory text that enriches compressed node attributes without introducing online LLM training cost. In Stage II, STAGE performs structure-aware representation learning under a fixed global token budget by combining random-walk-based structural context with graph-conditioned token reduction before PLM encoding. This design preserves informative semantic content while preventing unconstrained sequence expansion. Experiments on seven benchmark datasets show that STAGE consistently outperforms strong baselines under the same evaluation setting and maintains favorable efficiency under bounded input-length constraints.

Keywords:

text-attributed graphs; large language models; graph neural networks; semantic augmentation; representation learning

MSC:

68T05; 05C82; 05C85

1. Introduction

Graphs serve as a fundamental data structure for modeling entities and their complex interrelationships through nodes and edges [1,2,3,4]. Specifically, many real-world graphs are associated with rich node text, such as paper abstracts, product descriptions, or user-generated content. These data are commonly studied as text-attributed graphs (TAGs), where nodes carry textual attributes and edges describe relations between them. For this setting, Graph Neural Networks (GNNs) have become a standard choice because they can aggregate information over local neighborhoods and make effective use of graph connectivity, especially in low-label settings [5,6,7,8].

A practical difficulty, however, lies in how node text is represented before graph propagation begins. Early TAG methods often encode text with shallow features such as Bag-of-Words or Skip-gram embeddings [5]. These representations are easy to use, but they usually miss contextual meaning and domain-specific distinctions. Once such weak features are fed into a GNN, message passing may propagate irrelevant or weakly informative features across neighboring nodes, which limits the quality of the learned node representations [9]. Later methods improve the text encoder itself. For example, PLM-based models such as BERT can be combined with GNNs, and approaches like GraphBridge [10] further incorporate neighboring text during encoding. These models provide substantially stronger contextual representations than shallow text features. However, they still rely mainly on the original node text as the primary semantic source. In many specialized TAGs, that text is short, technical, and semantically compressed, so even a strong pretrained encoder may receive insufficient evidence to recover the missing background knowledge from the input alone [9,10].

Large Language Models (LLMs) make this limitation more tractable because they can supply missing context, definitions, and semantic expansion that are difficult to obtain directly from raw node attributes alone [11,12]. However, using LLMs inside TAG pipelines is not straightforward. Direct fine-tuning is expensive, and prompt-based use does not automatically solve how graph structure should be incorporated. Existing LLM-based methods can enrich node features, yet they often process nodes one by one and leave structural interaction to a later stage [13]. In addition, LLM outputs may be unreliable in highly specialized domains, and standard self-attention is not well suited to large graph neighborhoods without substantial computational cost [13].

Figure 1 illustrates the practical tension behind current approaches. Methods built on shallow or handcrafted features are lightweight, but they leave much of the node semantics under-modeled. Joint text structure models improve structural reasoning, yet they are still constrained by the quality of the original text. Generative LLM-based methods can enrich semantics, but the generated information is often attached to nodes independently, with limited interaction between semantic augmentation and graph propagation. In other words, current TAG methods tend to be stronger on either semantic enrichment or structural modeling, but less convincing when both are needed at the same time.

We address this issue with STAGE (Semantic and Topological Augmented Graph Embedding), a decoupled framework for TAG representation learning. STAGE separates semantic enrichment from structure-aware encoding. In the first stage, a frozen LLM is used offline to generate explanatory text that complements sparse node text. In the second stage, the enriched text is encoded together with structural context, but under an explicit token budget. To make this feasible, we introduce a graph-conditioned token reduction module that keeps informative tokens from the structural neighborhood before PLM encoding. This design allows the model to use richer semantic inputs without letting sequence length grow uncontrollably as more graph context is considered.

The main contributions of this paper are as follows:

We present STAGE, a decoupled framework that combines offline LLM-based semantic enrichment with GNN-based structural propagation for text-attributed graphs.
We introduce a graph-conditioned token reduction mechanism that performs token selection before deep encoding. Unlike topology-agnostic pruning, it uses structural context to retain informative semantic anchors while keeping the PLM input length within a fixed computational budget.
We evaluate STAGE on seven benchmark datasets and show that it outperforms strong baselines under our evaluation protocol, while maintaining favorable efficiency on graphs with different characteristics.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the proposed STAGE framework. Section 4 reports the experimental results. Section 5 provides additional discussion and analysis. Finally, Section 6 concludes the paper.

2. Related Work

2.1. TAG Representation Learning

Research on text-attributed graphs (TAGs) has progressed from relatively separate modeling of text and structure to more tightly coupled architectures. Early graph-based methods such as GCN [5] and GraphSAGE [14] are effective at neighborhood aggregation, but they typically rely on shallow text features such as Bag-of-Words. As a result, their initial node representations may miss contextual meaning and domain-specific terminology [15]. In contrast, text-centric encoders based on PLMs such as BERT [16] provide stronger contextual representations, but they process nodes independently and therefore do not directly model graph connectivity [10,17].

Subsequent work moved toward the joint modeling of node text and graph structure. Methods such as GLEM [17] and SimTeG [18] combine PLMs and GNNs through iterative training or distillation. GraphBridge [10] further allows the language encoder to incorporate neighboring text during encoding, making structure available earlier in the representation process. These methods improve the interaction between text and topology, but they still depend heavily on the quality of the original node attributes. In specialized TAG settings, the raw text is often short, technical, and incomplete, which limits how much semantic information can be recovered from the input alone [9,10]. Related work has also explored language model pretraining on text-rich networks [19], but this line of work does not focus on offline semantic generation under a fixed token budget.

2.2. LLM-Based Semantic Augmentation for Graph Learning

Recent work has used LLMs to enrich graph data before downstream learning. TAPE [13], for example, generates rationales or predictions that are then attached to nodes as additional features. KEA [15] extracts entities and generates short definitions to reduce the risks associated with direct zero-shot use of LLMs, while still using the generated content as semantic augmentation rather than end-to-end graph reasoning. More broadly, graph LLM studies have explored prompting, explanation generation, and graph-aware adaptation as ways to inject richer semantics into graph learning pipelines [20,21,22].

These methods improve the semantic content of node attributes, but the generation step is typically performed for each node separately. As a result, the added information is not tightly coupled with graph structure at the time it is produced or encoded [23]. In other words, they strengthen semantic enrichment, but they do not fully address how the enriched text should be filtered and integrated once structural context becomes large.

Recent LLM applications have shown that LLMs can support a variety of text-centered analysis and decision support tasks, such as mental health text analysis [24], cyber threat intelligence [25], fake or real tweet classification [26], virtual assistant command classification [27], and LLM-based evaluation [28]. These studies suggest that LLMs can provide useful semantic cues for downstream models, but they also indicate that LLM-generated information should be used carefully in specialized or high-stakes domains. This observation further motivates our decoupled design: STAGE uses the LLM as an offline semantic enhancer, while the downstream PLM and GNN components remain responsible for supervised structure-aware representation learning.

2.3. Efficient Transformers and Token Reduction

The Transformer architecture has emerged as the de facto standard for representation learning across a wide array of domains [29,30,31]. Recently, a separate line of research studies how to reduce the cost of Transformer encoding. Since standard self-attention scales quadratically with sequence length [32], prior research has explored sparse attention, low-rank approximations, and token reduction methods such as BigBird [33], PoWER-BERT [34], ToMe [35], and learned token pruning [36,37]. However, most of these methods are designed for plain text or image inputs, where token importance is determined mainly by redundancy within a single sequence.

In TAGs, the situation is different because token importance also depends on how the text relates to structural context. This is particularly relevant when semantic augmentation and graph propagation are combined: broader neighborhoods can provide useful context, but they can also quickly increase sequence length and introduce irrelevant tokens. STAGE is designed for this setting. Compared with joint text structure models, it supplements sparse node text with offline LLM-generated explanations. Compared with existing LLM-based augmentation methods, it introduces graph-conditioned token selection before PLM encoding so that structural context affects which tokens are retained under a fixed computational budget.

3. Materials and Methods

3.1. Preliminaries

We first define the text-attributed graph setting studied in this paper and summarize the modeling components used in STAGE. These include Graph Neural Networks for structural propagation, Pretrained Language Models for text encoding, and prompted large language models for offline semantic augmentation. We also briefly describe common ways of incorporating LLMs into graph learning to clarify where STAGE fits.

Text-Attributed Graphs. A text-attributed graph is denoted as $G_{T} = (V, E, T)$ , where $V$ is the node set and $E$ is the edge set. Each node $v \in V$ is associated with a raw text sequence $t_{v} \in T$ and a corresponding discrete label $y_{v} \in Y$ . In this work, we focus on semi-supervised node classification, which aims to predict the labels of unlabeled nodes $V_{U}$ given the graph topology, textual attributes, and a small subset of labeled nodes $V_{L}$ .
GNN-Based Paradigm. Graph Neural Networks (GNNs) learn node representations by recursively aggregating information from topological neighbors. Formally, the hidden representation of node v at the l-th layer is updated as

$h_{v}^{(l)} = UPD (h_{v}^{(l - 1)}, AGG ({h_{u}^{(l - 1)} : u \in N (v)})),$

(1)

where $N (v)$ denotes the set of neighbors of v, and $AGG (\cdot)$ and $UPD (\cdot)$ represent the aggregation and update functions, respectively. The initial feature $h_{v}^{(0)}$ is typically derived from the textual attribute $t_{v}$ .
Pretraining of PLMs. Pretrained Language Models (PLMs), such as BERT and RoBERTa, generate contextualized text representations via self-attention mechanisms. Given a raw text $t_{v}$ , it is first tokenized into a sequence $S_{v} = {s_{v, i}}_{i = 1}^{| S_{v} |}$ . A PLM encodes this sequence into contextualized hidden states:

$H_{v} = PLM (S_{v}) \in R^{| S_{v} | \times d} .$

(2)

Instandard sequence classification tasks, the embedding of the special [CLS] token is commonly extracted as the global semantic representation of the entire text.
LLMs with Prompting. Unlike PLMs that are typically fine-tuned to adapt to downstream tasks, Large Language Models (LLMs) are often utilized in a frozen state via prompting. Let $M$ denote a frozen LLM. A transformation function $P (\cdot)$ wraps the raw text $t_{v}$ into a natural language prompt. The model then generates tokens autoregressively according to

$y_{t} \sim P_{M} (y_{t} ∣ P (t_{v}), y_{< t}) .$

(3)

This formulation highlights the role of prompting in extracting or generating semantic information without updating the parameters of the LLM.
Taxonomy of LLM Integration. Existing work uses LLMs in graph learning in different ways [15,20,21,22]. One distinction is whether the language model is directly tunable or only accessible through prompting. Another is whether the model is used to predict labels directly or to enrich node attributes before downstream learning. STAGE belongs to the latter setting. It uses a prompted, frozen LLM to generate additional semantic context, while a trainable PLM and a GNN are responsible for structure-aware representation learning.

3.2. The Proposed STAGE Framework

Figure 2 shows the overall design of STAGE. Instead of coupling semantic augmentation and graph propagation in a single training loop, STAGE separates them into two stages. The first stage enriches node text offline with LLM-generated explanations. The second stage learns structure-aware representations from the enriched text under a fixed token budget. This separation keeps the semantic augmentation step simple to apply while making the subsequent encoding and propagation stages easier to control computationally.

3.2.1. Stage I: Generative Semantic Injection

In many TAG datasets, the raw text attached to each node is short and highly compressed. Citation graphs are a typical example: abstracts often contain technical terms but leave out the background needed to interpret them reliably. To supplement this missing context, we use a frozen LLM offline to generate explanatory text for each node. Because this stage is detached from the downstream training loop, it enriches node attributes without introducing the cost of LLM fine-tuning into the representation learning process.

Knowledge Extraction via Prompting. For each node v, we construct a prompt $P (t_{v})$ based on its raw text $t_{v}$ . The prompt asks the LLM to identify technical terms in the input text and generate a short description for each term. To keep the prompting strategy consistent across datasets with different text styles, we use a general extraction template rather than a dataset-specific domain instruction. The prompt template used in our implementation was as follows:
You should work like a named entity recognizer. Text: [text]. Extract the technical terms from this text and output a description for each term in the format of a Python (version 3.10) dictionary, e.g., {’XX’: ’XXX’, ’YY’: ’YYY’}.

The generation process can be written in shorthand as

e_{v} = M (P (t_{v})),

(4)

where

M

denotes the frozen LLM introduced in the preliminaries, and

e_{v}

is the generated explanatory text. In practice, this step makes the semantics of brief or specialized node text more explicit before graph-based learning begins.

Implementation Details of Stage I. In our implementation, Stage I uses a unified general-text prompt for all datasets. For each node, we apply a lightweight input-length control step before prompting and then submit the processed text to the frozen LLM using the same extraction template. In our OpenRouter-based API calls, we did not explicitly set temperature, max_tokens, top_p, frequency penalty, or presence penalty. Therefore, these values followed the model/provider-side default behavior. To avoid introducing undocumented manual settings, we now report this configuration explicitly. The same API configuration is applied to all nodes and datasets. The generated response is post-processed into plain explanatory text and concatenated with the original node attribute. Malformed outputs and empty generations are filtered by simple format checking, and nodes that fail this step fall back to their original text only. Since Stage I is executed offline, the generation cost is incurred once and does not appear in the optimization of the PLM or GNN components.
Semantic Enrichment. We concatenate the original node text with the generated explanation to obtain an enriched attribute:

${\hat{t}}_{v} = [t_{v} \oplus e_{v}],$

(5)

where ⊕ denotes concatenation. The resulting set $\hat{T} = {{\hat{t}}_{v}}_{v \in V}$ is then used in the second stage. Since the LLM is only used offline, its computation does not appear in the training loop of the PLM or the GNN.

3.2.2. Stage II: Structure-Aware Representation Learning

Although the enriched attributes

\hat{T}

contain more semantic information, using them together with large structural neighborhoods can quickly make PLM encoding too expensive. A straightforward concatenation of neighbor text increases sequence length, which in turn raises the cost of self-attention quadratically [32]. Existing token reduction methods such as ToMe [35] can shorten sequences, but they are usually designed for plain sequences and do not consider graph structure. STAGE addresses this issue with a graph-conditioned token reduction module. Before PLM encoding, the module selects tokens from the structural context under a fixed budget, so the model can incorporate neighborhood information without letting the input length grow unchecked.

The selector is intentionally lightweight because it is applied before PLM encoding and is used to score candidate tokens under a fixed budget. A more expressive cross-attention module could model richer token–neighbor interactions, but it would also introduce additional computation before token reduction and weaken the efficiency motivation of the proposed design.

Random Walk Context Sampling. To capture local structural context, we construct a walk-based subgraph for each target node v. Starting from v, we perform a random walk with restart and collect the visited nodes into a sampled structural view $G_{v}^{(L)}$ , where L denotes the walk depth. Compared with BFS-style neighborhood expansion, this strategy provides a more controllable way to explore local context without deterministically including all nodes within a fixed radius. Compared with attention-based neighbor sampling, it does not introduce an additional trainable sampling module. In practice, the sampled subgraph is later linearized into a sequence by concatenating the reduced texts of its constituent nodes under a fixed global token budget.
Graph-Conditioned Token Selection. To enforce a fixed computational budget $L_{m a x}$ , STAGE introduces a graph-conditioned token selector $S_{ϕ} (\cdot)$ . In our implementation, the selector is trained at the node level. For each target node v, let ${e_{v, k}}_{k = 1}^{| t_{v} |}$ denote the token embeddings of its enriched text, and let $n_{v}$ denote a structural conditioning vector obtained from multi-hop neighborhood aggregation. The selector then produces a relevance score for each token:

$α_{v, k} = S_{ϕ} (e_{v, k}, n_{v}),$

(6)

where $α_{v, k} \in (0, 1)$ reflects the importance of token k in the presence of graph context.

During selector training, we use a soft aggregation scheme rather than hard truncation. Specifically, the token embeddings are summarized as

s_{v} = \sum_{k} α_{v, k} e_{v, k},

(7)

and the selector parameters are optimized with a supervised classification loss together with a regularization term:

L_{s e l} = L_{c l s} + β L_{r e g},

(8)

where

L_{c l s}

is the cross-entropy loss on the node label and

L_{r e g}

regularizes the score distribution over valid tokens.

In practice, the regularization term encourages the selector to remain well behaved over the valid token positions instead of collapsing prematurely to an overly sharp distribution. Let

m_{v}

denote the binary attention mask over valid tokens. We define a normalized reference distribution over valid positions as

u_{v} = \frac{m_{v}}{\sum_{j} m_{v, j}},

(9)

Let

{\bar{α}}_{v}

denote the normalized selector distribution over valid tokens:

{\bar{α}}_{v, k} = \frac{α_{v, k} m_{v, k}}{\sum_{j} α_{v, j} m_{v, j}} .

(10)

We then define the regularization term as

L_{r e g} = KL ({\bar{α}}_{v} ∥ u_{v}) = \sum_{k : m_{v, k} = 1} {\bar{α}}_{v, k} log \frac{{\bar{α}}_{v, k}}{u_{v, k}},

(11)

where

{\bar{α}}_{v}

denotes the normalized selector distribution over the valid tokens of node v.

After the selector is trained and frozen, we use it during sequence construction. For each sampled subgraph

G_{v}^{(L)}

, we allocate a per-node token quota under the fixed global budget

L_{m a x}

, retain the highest-scoring tokens for each node, and concatenate the reduced node texts with separator tokens. In this way, the final PLM input remains bounded even when the sampled structural context becomes larger.

Hierarchical Encoding and Aggregation. The term hierarchical refers to the fact that the structure of the STAGE models has two levels: sequence-level encoding and graph-level aggregation. After token selection, we obtain a reduced text sequence ${\tilde{t}}_{v}$ from the enriched attribute and its sampled structural context. The PLM then encodes this reduced sequence to produce a dense node embedding $h_{v} \in R^{d}$ :

$h_{v} = Encoder ({\tilde{t}}_{v}) .$

(12)

Because the input sequence already integrates filtered random walk contexts, $h_{v}$ captures local structural information at the sequence level. Subsequently, these embeddings serve as initial features for a GNN aggregator, which propagates them over the graph topology to capture higher-order dependencies:

$z_{v} = GNN (h_{v}, A; Θ_{G}) .$

(13)

For node classification, we further apply a linear classifier on top of $z_{v}$ to obtain the prediction

${\hat{y}}_{v} = softmax (W_{c} z_{v} + b_{c}),$

(14)

and optimize the GNN stage with the supervised cross-entropy loss

$L_{g n n} = - \sum_{v \in V_{L}} y_{v}^{⊤} log {\hat{y}}_{v},$

(15)

where $V_{L}$ denotes the set of labeled nodes, and $y_{v}$ is the one-hot ground-truth label of node v.

3.2.3. Optimization and Complexity Analysis

Algorithm 1 summarizes the training pipeline of STAGE.

Cascaded Optimization Strategy. We train STAGE in three steps instead of optimizing all components jointly. This design keeps the pipeline manageable and avoids passing gradients through the full LLM–PLM–GNN stack at once.

Step 1: Train the Token Selector. We first optimize the token selector $S_{ϕ}$ at the node level. For each target node, the selector receives token embeddings from the enriched text together with a structural conditioning vector derived from neighborhood aggregation. Because hard Top-K truncation is not differentiable, we train the selector with a soft aggregation scheme and optimize it using the loss $L_{s e l} = L_{c l s} + β L_{r e g}$ . The selector is frozen after this stage.
Step 2: Fine-Tune the PLM Encoder. After freezing the selector, we construct a walk-based subgraph for each target node and allocate a per-node token quota under the fixed global budget $L_{m a x}$ . The selector scores are then used to retain the most informative tokens for each node, and the resulting reduced node texts are concatenated into a bounded PLM input sequence. The PLM encoder is fine-tuned on these reduced inputs with the classification loss $L_{e n c}$ .
Step 3: Train the GNN Aggregator. We then freeze the PLM encoder, use it to produce node representations $h_{v}$ , and train the GNN aggregator on top of these representations with the supervised downstream loss $L_{g n n}$ . Since only one trainable module is optimized at a time, the overall training process is easier to control in memory and implementation complexity than a fully joint alternative.

Algorithm 1 Training Pipeline of the Proposed STAGE

Require: Text-attributed graph

G_{T} = (V, E, T)

, frozen LLM

M

, token selector

S_{ϕ}

, PLM
encoder, GNN aggregator, global token budget

L_{m a x}

.

1:: % Stage I: Offline Generative Semantic Injection
2:: for each node $v \in V$ do
3:: Construct a prompt $P (t_{v})$ from raw text $t_{v}$
4:: Generate explanatory text $e_{v} \leftarrow M (P (t_{v}))$
5:: Form enriched attribute ${\hat{t}}_{v} \leftarrow [t_{v} \oplus e_{v}]$
6:: end for
7:: % Stage II: Structure-Aware Representation Learning
8:: Step 1: Train the token selector
9:: for each target node v in batch do
10:: Encode the enriched text of v into token embeddings ${e_{v, k}}$
11:: Construct a structural conditioning vector $n_{v}$ from multi-hop neighborhood aggregation
12:: Compute token relevance scores $α_{v, k} \leftarrow S_{ϕ} (e_{v, k}, n_{v})$
13:: Form a soft summary $s_{v} \leftarrow \sum_{k} α_{v, k} e_{v, k}$
14:: Update $ϕ$ using selector loss $L_{s e l}$
15:: end for
16:: Step 2: Fine-tune the PLM encoder
17:: for each target node v in batch do
18:: Sample a walk-based structural view $G_{v}^{(L)}$
19:: Allocate a per-node token quota under the fixed global budget $L_{m a x}$
20:: Retain the highest-scoring tokens for each node in $G_{v}^{(L)}$
21:: Concatenate the reduced node texts into a bounded sequence ${\tilde{t}}_{v}$
22:: Encode ${\tilde{t}}_{v}$ with the PLM encoder to obtain $h_{v}$
23:: Update the encoder using classification loss $L_{e n c}$
24:: end for
25:: Step 3: Train the GNN aggregator
26:: for each target node v in batch do
27:: Propagate PLM-based node representations through the graph:
28:: $z_{v} \leftarrow GNN (h_{v}, A; Θ_{G})$
29:: Update GNN parameters using downstream supervision $L_{g n n}$
30:: end for
31:: return Trained selector $S_{ϕ}$ , PLM encoder, and GNN aggregator

Table 1 provides an analytical comparison of representative TAG encoding strategies.

Formal Properties of Graph-Conditioned Reduction. Let $R_{v}$ denote the number of raw candidate tokens collected from the sampled structural view $G_{v}^{(L)}$ for node v, and let ${\tilde{L}}_{v}$ denote the final number of tokens retained for PLM encoding after graph-conditioned reduction. By construction, STAGE enforces the bounded-input property:

${\tilde{L}}_{v} \leq L_{m a x}, \forall v \in V .$

(16)

This property implies that the PLM encoder never receives an input sequence whose length grows without bound as the sampled structural context expands. In particular, although $R_{v}$ may increase substantially with walk depth L, the final encoded sequence is always upper-bounded by the global token budget.

A second implication is a decoupling effect between context growth and PLM-side cost. Increasing L enlarges the sampled subgraph and therefore increases graph construction and sequence assembly overhead, but it does not allow the quadratic self-attention cost of the PLM to scale directly with the full raw neighborhood text. In this sense, graph-conditioned reduction changes the dominant cost from unconstrained PLM attention over long sequences to a bounded-input encoding stage together with a larger preprocessing stage.

A Compositional View of STAGE. STAGE can be viewed as a composition of four operators over graph–text inputs. Let $A$ denote the offline semantic augmentation operator, $R$ the graph-conditioned reduction operator, $E$ the PLM encoding operator, and $P_{G}$ the graph propagation operator. Then, the overall pipeline can be written schematically as

$G_{T} \overset{A}{⟶} \hat{T} \overset{R}{⟶} \tilde{T} \overset{E}{⟶} H \overset{P_{G}}{⟶} Z,$

(17)

where $\hat{T}$ denotes the semantically enriched node text, $\tilde{T}$ the reduced structure-aware text sequences, $H$ the PLM-based node embeddings, and $Z$ the final graph-aware representations. This view highlights that STAGE is not simply a sequential engineering pipeline, but a compositional mapping in which semantic expansion, structure-aware token filtering, contextual encoding, and graph propagation play distinct roles.
Time and Space Complexity. We now summarize the main computational costs of STAGE.

Stage I: Offline semantic augmentation. Let $C_{LLM}$ denote the average generation cost per node. Since Stage I is executed once for each node and is detached from the downstream training loop, its total complexity is

$O (| V | \cdot C_{LLM}) .$

(18)
Stage II: Structure-aware representation learning. For each target node v, let $R_{v}$ denote the number of raw candidate tokens collected from the sampled subgraph. The graph-conditioned selector scores these tokens and constructs a reduced sequence of length ${\tilde{L}}_{v} \leq L_{m a x}$ . The token scoring and sequence construction costs therefore scale with the candidate context size, which may increase with walk depth. In contrast, the dominant self-attention cost of the PLM with respect to sequence length is bounded by the fixed token budget:

$O ({\tilde{L}}_{v}^{2} d) \leq O (L_{m a x}^{2} d),$

(19)

where d is the hidden dimension of the PLM. Thus, compared with an unconstrained concatenation strategy whose self-attention cost would scale as $O (R_{v}^{2} d)$ , STAGE replaces the dependence on the full raw neighborhood text with a dependence on the fixed budget $L_{m a x}$ .
GNN propagation. After PLM encoding, the graph propagation stage follows the complexity of a standard message-passing GNN. For a K-layer GNN with hidden dimension $d_{h}$ , the propagation cost is proportional to the graph size and can be written in the usual form as

$O (K | E | d_{h}) .$

(20)

Overall, STAGE does not remove all cost associated with larger structural exploration. Instead, it bounds the PLM-side sequence cost while allowing the preprocessing overhead—including graph construction, token scoring, and sequence assembly—to absorb most of the growth in structural context. This analytical view is consistent with the empirical scalability trends reported in Section 4.5.

4. Results

In this section, we evaluate STAGE and answer the following research questions:

RQ1: Can STAGE outperform existing baselines across diverse text-attributed graphs?
RQ2: How do semantic injection and token retention strategies affect the performance of STAGE?
RQ3: How sensitive is STAGE to the balance between random walk depth and a fixed token budget?
RQ4: How does STAGE scale in preprocessing cost, PLM-side cost, and downstream performance as the structural context grows?

4.1. Implementation Details

Datasets and Baselines. We evaluate STAGE on seven benchmark datasets from citation, social, and e-commerce domains. Our experimental protocol follows the settings used in GraphBridge [10] and TAPE [13]. We compare STAGE with 14 baselines from three groups: traditional methods (e.g., GCN, GraphSAGE, BERT, and RoBERTa), joint text–structure methods (e.g., GLEM, ENGINE, and GraphBridge), and LLM-augmented methods (e.g., TAPE and KEA). Dataset statistics are reported in Table 2. For all seven datasets, we use the same dataset versions and train/validation/test splits as GraphBridge [10] (following TAPE [13] where applicable), without introducing any additional re-splitting. Following GraphBridge [10], Table 2 reports the number of nodes, edges, classes, and the average number of tokens per node.
Experimental Setup. All experiments were conducted on a single NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM. In Stage I, we use GPT-3.5-turbo to generate the offline semantic explanations. To keep the semantic augmentation procedure consistent across datasets with different text styles, we use a unified general-text extraction prompt for all datasets. The generated outputs are post-processed into explanatory text, and malformed or empty generations are filtered before concatenation. In Stage II, RoBERTa-base is adopted as the PLM backbone and GraphSAGE as the GNN aggregator. The hyperparameters of STAGE are tuned separately for each dataset based on validation performance. We tune the reduction module, the PLM fine-tuning stage, and the GNN stage independently, and report the final test results using the configuration that achieves the best validation accuracy for each dataset. When prior work provides established search ranges, we use them as references for the tuning procedure. Unless otherwise specified, we report mean accuracy and standard deviation over ten runs with different random seeds.

4.2. Overall Comparison (RQ1)

Table 3 reports the node classification accuracy on all seven datasets. Several patterns are worth noting.

Observation 1: Both text semantics and graph structure matter for TAG learning. The results show that models relying mainly on one source of information are often limited. On ArXiv-2023, for example, structure-oriented methods such as GCN remain well below the strongest results, suggesting that shallow initial features are not sufficient for this dataset. In contrast, text-based PLMs capture richer semantics, but they do not fully exploit graph connectivity. This gap is consistent with the nature of TAGs, where useful signals come from both node text and graph structure rather than either one alone.
Observation 2: Semantic enrichment is more useful when it is followed by structure-aware learning. Generative baselines such as TAPE enrich node attributes with additional semantic content, but the generated information is typically produced for each node independently. Joint models such as GraphBridge make stronger use of structural information, yet they are still constrained by the quality of the original text. STAGE combines these two steps in sequence: it first enriches the node text offline and then learns graph-aware representations from the enriched attributes. The comparison suggests that this separation is effective in practice.
Observation 3: STAGE achieves the strongest overall results under our evaluation protocol. Across the seven benchmarks, STAGE achieves the best overall results among the methods compared in our experiments. The gains are visible across all seven datasets, although their magnitude varies by dataset, with the largest margins on WikiCS and CiteSeer. In absolute terms, STAGE reaches 92.80% on Cora and 85.66% on ArXiv-2023, indicating that the proposed design remains effective across graphs with different scales and characteristics.

4.3. Component Ablation and Diagnostic Analysis (RQ2)

We conducted a controlled component ablation study to examine the individual contributions of the two main components of STAGE: offline semantic injection and graph-conditioned token reduction. Since the benefit of token reduction is expected to become more evident when the sampled structural context is larger, we report the complete ablation on OGBN-Products under a larger walk depth. This setting creates a larger and noisier candidate token pool, making it suitable for evaluating whether graph-conditioned token selection provides additional benefit over simple retention strategies.

All variants use the same train/validation/test split, PLM backbone, GNN architecture, token budget, walk depth, and downstream training configuration. Only the corresponding component is removed or replaced. Specifically, STAGE w/o Semantic Injection removes the LLM-generated explanatory text and uses only the original node text; STAGE w/o Token Selector replaces the learned graph-conditioned selector with head truncation; and STAGE w/ Random Token Selection randomly retains tokens under the same token budget.

Table 4 shows that removing semantic injection leads to the lowest performance, indicating that the offline LLM-generated explanations provide useful semantic context. Replacing the learned selector with head truncation improves over the non-enriched variant, but remains lower than the full model. Random token selection performs better than head truncation in this setting, suggesting that fixed positional retention may become less reliable when more neighbor tokens are included. The full STAGE model achieves the best accuracy, improving over the no-semantic variant by 2.49 percentage points, over head truncation by 1.42 percentage points, and over random selection by 0.76 percentage points. These results support the joint contribution of semantic enrichment and graph-conditioned token reduction under larger structural contexts.

4.4. Parameter Sensitivity (RQ3)

We now study how STAGE behaves under different combinations of random walk length L and a fixed global token budget of 512. This setting reflects the trade-off between using more structural context and preserving more tokens from the central node.

Effect of graph characteristics. Figure 3 shows that the preferred walk length depends on the dataset. On Cora, performance improves as L becomes larger and reaches its best value at $L = 32$ , suggesting that broader structural context is useful in this graph. On ArXiv-2023, the best results appear at smaller values such as $L \in {0, 8}$ . A likely reason is that, once the node text has already been enriched by the LLM, allocating too much of the fixed token budget to distant neighbors can remove useful local content from the target node itself.
Interpreting the fluctuations. The performance curves are not perfectly monotonic. This is expected because the hard truncation step is discrete: when L changes, the token allocation across the target node and its neighbors can shift abruptly. As a result, some informative local tokens may be replaced with neighbor tokens at certain settings. Even so, the overall trends remain stable enough to show that the selector can work under a fixed budget across graphs with different properties.

4.5. Scalability Analysis (RQ4)

We examine how the computational cost of STAGE changes as the structural context grows under a fixed global token budget. Since the selector enforces an upper bound before PLM encoding, increasing the walk depth does not allow the PLM input length to grow freely. Instead, it enlarges the sampled structural view and the raw candidate context, while the final PLM input remains bounded by the token budget.

Resource profile under increasing structural context. Table 5 reports the scalability statistics of STAGE on OGBN-ArXiv under different walk depths. As the walk depth increases from 0 to 32, the sampled subgraph becomes larger, with the average number of nodes per graph increasing from 1.00 to 7.87, and the average raw candidate tokens per graph increasing from 336.46 to 2692.87. This confirms that larger walk depths indeed expose the model to substantially broader structural context.

At the same time, the final PLM input remains bounded. Although the raw candidate context grows by nearly eight times, the average final tokens per graph increase only from 337.38 to 507.01, approaching but not exceeding the fixed token budget. This indicates that the graph-conditioned reduction stage effectively constrains the input length before PLM encoding.

The additional cost introduced by larger walk depths is mainly reflected in preprocessing. Graph construction time, sequence build time, and token selection time all increase as the sampled context grows. In contrast, the peak GPU memory used by the LM stage remains stable at 11.78 GB across all tested settings. The LM-side epoch time is also nearly stable for moderate walk depths, and only increases at the largest settings. These results suggest that STAGE does not eliminate all costs of larger structural exploration, but shifts the dominant overhead from PLM self-attention to the graph construction and sequence assembly stages.

From the downstream performance perspective, broader structural context does not improve accuracy on OGBN-ArXiv under this bounded-input regime. The best GNN result is achieved at walk depth 0, while deeper walks yield slightly lower but broadly similar performance. This suggests that, for this dataset, allocating more of the fixed budget to distant neighbors may reduce the amount of useful local semantic content preserved for the target node.

The reduction module is trained once offline and kept fixed across the walk-depth settings; its one-time training cost is 10,642.37 s with a peak GPU memory of 5.64 GB.

5. Discussion

5.1. Discussion on LLM Choices and Prompt Design

Since STAGE uses an LLM only in the offline generation stage, the framework is not tied to gradient-based fine-tuning of a particular LLM. In our experiments, GPT-3.5-turbo is used to generate explanatory text for all datasets under a unified extraction prompt. This design keeps the semantic augmentation stage detached from downstream PLM and GNN training, and the generated explanations are treated as additional node attributes rather than as direct predictions.

Nevertheless, the choice of the generative model and the prompt template may still affect the quality of the generated explanations. For example, a stronger LLM may produce more accurate definitions for specialized terms, while a weaker model may generate incomplete or noisy descriptions. Similarly, prompts that are too domain-specific may improve one dataset but reduce generality across heterogeneous TAGs. For this reason, we adopted a simple and label-free extraction prompt in the main experiments, which asks the LLM to identify technical terms and provide short descriptions without using class labels or downstream supervision.

We do not claim that STAGE is insensitive to all possible LLM backbones or prompt formulations. Instead, the current design aims to reduce this dependency by using the LLM as an offline semantic enhancer rather than an end-to-end predictor. Conducting a more systematic evaluation of different LLM backbones, decoding settings, and prompt templates remains an important direction for future work.

5.2. Why Graph-Conditioned Token Reduction Matters

Section 4.5 shows that token reduction helps control the cost of PLM encoding as the structural context grows. Beyond efficiency, this step also matters for representation quality.

A natural alternative would be to encode each node’s enriched text into a single dense vector first (for example, with the [CLS] token) and then apply graph propagation on top of these node-level embeddings. This late-fusion design is simpler, but it compresses the enriched text before the model has a chance to decide which parts of the structural context are worth preserving. In settings where node text is short, technical, and semantically uneven, this compression can remove useful lexical cues too early.

This issue becomes more pronounced in graphs with heterophily or weak graph signal, where neighboring text may be topically misaligned with the target node [38,39,40]. A related phenomenon has also been observed in long-context NLP, where irrelevant context can distract attention-based models [41].

STAGE instead applies token selection before PLM encoding. The selector operates under graph context, so the retained tokens are not chosen only by local redundancy within a sequence. They are chosen with respect to the structural neighborhood from which the sequence was constructed. This gives the encoder access to a more targeted input: it can preserve informative local content together with structurally relevant neighbor tokens, while still staying within a fixed token budget.

In this sense, graph-conditioned token reduction is not only a way to save memory. It is also part of how STAGE combines semantic enrichment with structure-aware learning. The offline generation stage adds missing context to node text, and the selector determines which parts of that enriched context should remain available when structural information is introduced.

5.3. Failure Case Analysis and Limitations

Although STAGE performs well on the evaluated benchmarks, it still has several limitations.

First, the quality of Stage I depends on the generated explanations. If the LLM provides incomplete, ambiguous, or misleading descriptions for key terms, the enrichment stage may introduce noise rather than useful context. This issue is more likely in highly specialized domains where the relevant terminology is uncommon or rapidly evolving.

Second, the token budget in Stage II creates an explicit trade-off. A larger structural context allows the model to consider more neighbors, but it also leaves fewer tokens available for the target node under a fixed budget. As shown in Section 4.4, this trade-off does not have the same optimum on every dataset. The best setting depends on how much signal comes from local text versus broader graph context.

Third, STAGE is decoupled by design. This makes the training process easier to manage, but it also means that the offline semantic generation stage is not updated using downstream feedback. In some cases, a tighter interaction between generation and encoding could produce better task-specific explanations than a frozen offline stage.

These identified limitations are consistent with broader observations on hallucination in language generation and graph-aware LLM reasoning [42,43]. One possible extension is to ground the generation stage with retrieval-augmented generation [44], so that the offline explanations are supported by external evidence rather than the parametric knowledge of the LLM alone.

Another limitation is that the current framework is evaluated on static TAG benchmarks. In dynamic or time-varying TAGs, node text, edge text, and graph topology may evolve over time, which would require incremental semantic enrichment and token selection rather than one-time offline generation. Extending STAGE to dynamic TAGs is therefore an important direction for future work.

In addition, this study focuses on semi-supervised node classification, which is the standard setting for the evaluated TAG benchmarks. Whether the enriched representations produced by STAGE can generalize to auxiliary tasks such as link prediction or graph classification remains to be further investigated.

6. Conclusions

This paper presents STAGE, a decoupled framework for representation learning on text-attributed graphs. The framework uses a frozen LLM offline to generate explanatory text that supplements sparse node attributes, and then learns structure-aware representations from the enriched text with a graph-conditioned token reduction module. This design makes it possible to introduce additional semantic context without letting the PLM input length grow freely with the structural neighborhood.

Across seven benchmark datasets, STAGE shows consistent improvements over strong baselines under our evaluation protocol. The results suggest that separating semantic enrichment from graph-based propagation is a practical way to balance representation quality and computational cost in TAG learning.

There are several directions for future work. One is to study whether tighter interaction between the semantic augmentation stage and the downstream encoder can further improve performance. Another is to examine how stable the framework remains under stronger domain shifts or more challenging low-resource settings.

Author Contributions

Conceptualization, S.H., S.X. and D.-H.W.; methodology, S.H.; software, S.H.; validation, S.H.; formal analysis, S.H.; investigation, S.H.; data curation, S.H.; writing—original draft preparation, S.H.; writing—review and editing, S.X., X.-Y.Z., S.Z., L.L. and D.-H.W.; visualization, S.H.; supervision, S.X. and D.-H.W.; project administration, D.-H.W.; funding acquisition, D.-H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62576301, the Major Science and Technology Plan Project on the Future Industry Fields of Xiamen City under Grant No. 3502Z20241027, and the Open Project of the State Key Laboratory of Multimodal Artificial Intelligence Systems under Grant No. MAIS2024101.

Data Availability Statement

Publicly available datasets were used in this study. The datasets analyzed in this study are available from their original public sources. The processed data and implementation details supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

Author Luoqi Liu was employed by the Meitu Inc. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, X.; Zhang, X.; Zeng, Z.; Wu, Q.; Zhang, J. Unsupervised spectral feature selection with l1-norm graph. Neurocomputing 2016, 200, 47–54. [Google Scholar] [CrossRef]
Fan, Y.; Liu, J.; Weng, W.; Chen, B.; Chen, Y.; Wu, S. Multi-label feature selection with constraint regression and adaptive spectral graph. Knowl.-Based Syst. 2021, 212, 106621. [Google Scholar] [CrossRef]
Lin, K.; Xie, X.; Weng, W.; Du, X. Global-local graph attention: Unifying global and local attention for node classification. Comput. J. 2024, 67, 2959–2969. [Google Scholar] [CrossRef]
Hong, B.; Lu, P.; Chen, R.; Lin, K.; Yang, F. Health Insurance Fraud Detection via Multiview Heterogeneous Information Networks with Augmented Graph Structure Learning. IEEE Trans. Comput. Soc. Syst. 2024, 12, 2297–2317. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–10. [Google Scholar]
Hong, C.; Chen, L.; Liang, Y.; Zeng, Z. Stacked capsule graph autoencoders for geometry-aware 3D head pose estimation. Comput. Vis. Image Underst. 2021, 208, 103224. [Google Scholar] [CrossRef]
Kaibiao, L.; Chen, J.; Ruicong, C.; Fan, Y.; Yang, Z.; Min, L.; Ping, L. Adaptive neighbor graph aggregated graph attention network for heterogeneous graph embedding. ACM Trans. Knowl. Discov. Data 2023, 18, 3616377. [Google Scholar] [CrossRef]
Ma, Y.; Lou, H.; Yan, M.; Sun, F.; Li, G. Spatio-temporal fusion graph convolutional network for traffic flow forecasting. Inf. Fusion 2024, 104, 102196. [Google Scholar] [CrossRef]
Zhang, D.C.; Yang, M.; Ying, R.; Lauw, H.W. Text-attributed graph representation learning: Methods, applications, and challenges. In Proceedings of the ACM Web Conference, Singapore, 13–17 May 2024; pp. 1298–1301. [Google Scholar]
Wang, Y.; Zhu, Y.; Zhang, W.; Zhuang, Y.; Li, Y.; Tang, S. Bridging local details and global context in text-attributed graphs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 14830–14841. [Google Scholar]
Chen, W.; Liu, W.; Zheng, J.; Zhang, X. Leveraging large language model as news sentiment predictor in stock markets: A knowledge-enhanced strategy. Discov. Comput. 2025, 28, 74. [Google Scholar] [CrossRef]
Chen, W.; Hussain, W.; Chen, J. GLMTopic: A hybrid Chinese topic model leveraging large language models. Comput. Mater. Contin. 2025, 85, 1559–1583. [Google Scholar] [CrossRef]
He, X.; Bresson, X.; Laurent, T.; Perold, A.; LeCun, Y.; Hooi, B. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning. arXiv 2023, arXiv:2305.19523. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1024–1034. [Google Scholar]
Chen, Z.; Mao, H.; Li, H.; Jin, W.; Wen, H.; Wei, X.; Wang, S.; Yin, D.; Fan, W.; Liu, H.; et al. Exploring the potential of large language models (LLMs) in learning on graphs. ACM SIGKDD Explor. Newsl. 2024, 25, 42–61. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Zhao, J.; Qu, M.; Li, C.; Yan, H.; Qian, L.; Li, P.; Zhou, J.; Tang, J. Learning on large-scale text-attributed graphs via variational inference. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Duan, K.; Liu, Q.; Chua, T.S.; Yan, S.; Ooi, W.T.; Xie, Q.; He, J. Simteg: A frustratingly simple approach improves textual graph learning. arXiv 2023, arXiv:2308.02565. [Google Scholar]
Jin, B.; Han, W.; Pan, Y.; Jiang, Y.; Ji, H.; Han, J. Patton: Language model pretraining on text-rich networks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 8320–8339. [Google Scholar]
Zhao, J.; Zhuo, L.; Shen, Y.; Qu, M.; Liu, K.; Bronstein, M.; Zhu, Z.; Tang, J. Graphtext: Graph reasoning in text space. arXiv 2023, arXiv:2310.01089. [Google Scholar] [CrossRef]
Tang, J.; Ding, Y.; Zhao, W.X.; Gong, Y.; Tian, Q.; Nie, J.Y.; Wen, J.R. GraphGPT: Graph instruction tuning for large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 491–500. [Google Scholar]
Fatemi, B.; Halcrow, J.; Perozzi, B. Talk like a graph: Encoding graphs for large language models. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 1–12. [Google Scholar]
Zhu, Y.; Wang, Y.; Shi, H.; Tang, S. Efficient tuning and inference for large language models on textual graphs. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 5734–5742. [Google Scholar]
Kermani, A.; Perez-Rosas, V.; Metsis, V. A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG. arXiv 2025, arXiv:2503.24307. [Google Scholar] [CrossRef]
Shafee, S.; Bessani, A.; Ferreira, P.M. False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems. arXiv 2025, arXiv:2507.06252. [Google Scholar]
Alnabi, D.L.A. Fake and Real Tweet Classification Using a Pre-Trained GPT-3 Approach. Adv. Eng. Intell. Syst. 2025, 4, 91–103. [Google Scholar] [CrossRef]
Nurpatsha, S. A New Analysis of Web Customer Service Text Classification of Alexa Virtual Assistant Commands Using a Deep Learning Model. J. Artif. Intell. Syst. Model. 2025, 3, 76–90. [Google Scholar] [CrossRef]
Gu, J.; Jiang, X.; Shi, Z.; Tan, H.; Zhai, X.; Xu, C.; Li, W.; Shen, Y.; Ma, S.; Liu, H.; et al. A Survey on LLM-as-a-Judge. arXiv 2024, arXiv:2411.15594. [Google Scholar] [CrossRef]
Weng, W.; Hou, F.; Gong, S.; Chen, F.; Lin, D. Attribute graph clustering via transformer and graph attention autoencoder. Intell. Data Anal. 2025, 29, 306–319. [Google Scholar] [CrossRef]
Qiao, J.; Guo, X.; Jin, J.; Wang, D.; Li, K.; Gao, W.; Cui, F.; Zhang, Z.; Shi, H.; Wei, L. Taco-DDI: Accurate prediction of drug-drug interaction events using graph transformer-based architecture and dynamic co-attention matrices. Neural Netw. 2025, 189, 107655. [Google Scholar] [CrossRef]
Liang, J.; Luo, Y.; Lin, H.; Lin, Y.; Guo, J.M. Structure-aware transformer for enhanced low-resolution human pose estimation. Vis. Comput. 2026, 42, 86. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 17283–17297. [Google Scholar]
Goyal, S.; Choudhary, A.R.; Raje, S.; Chakaravarthy, V.; Sabharwal, Y.; Ashari, A. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 12–18 July 2020; pp. 3690–3699. [Google Scholar]
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffmann, C.; Hoffman, J. Token merging: Your ViT but faster. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 1–12. [Google Scholar]
Kim, S.; Shen, S.; Thorsley, D.; Gholami, A.; Hassner, T.; Keutzer, K. Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 784–794. [Google Scholar]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Comput. Surv. 2022, 55, 109. [Google Scholar] [CrossRef]
Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Leman, A.; Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 7793–7804. [Google Scholar]
Ma, Y.; Liu, X.; Shah, N.; Tang, J. Is homophily a necessity for graph neural networks? In Proceedings of the 10th International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–12. [Google Scholar]
Hou, Y.; Zhang, J.; Cheng, J.; Ma, K.; Ma, R.T.; Chen, H.; Yang, M.C. Measuring and improving the use of graph information in graph neural networks. In Proceedings of the 8th International Conference on Learning Representations, Virtual, 26 April–1 May 2020; pp. 1–11. [Google Scholar]
Shi, F.; Chen, X.; Misra, K.; Scales, N.; Dohan, D.; Chi, E.H.; Schärli, N.; Zhou, D. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 31210–31227. [Google Scholar]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Merrer, E.L.; Trédan, G. LLMs hallucinate graphs too: A structural perspective. In Proceedings of the 13th International Conference on Complex Networks and Their Applications, Istanbul, Turkey, 10–12 December 2024; pp. 233–245. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 9459–9474. [Google Scholar]

Figure 1. Comparison of representative training paradigms for text-attributed graphs and the design motivation of STAGE.

Figure 2. Overall framework of STAGE. In Stage I, a frozen LLM generates explanatory text for each node to enrich sparse raw text. In Stage II, STAGE samples structural context, applies graph-conditioned token reduction under a fixed global budget, encodes the retained tokens with a PLM, and finally performs graph-based propagation with a GNN for downstream prediction.

Figure 3. Parameter sensitivity of walk steps under a fixed global token budget.

Table 1. Analytical comparison of representative TAG encoding strategies.

Strategy	Structural Text Usage	PLM-Side Sequence Cost	Main Limitation
Node-text-only PLM	Target node text only	$O (\| t_{v} \|^{2} d)$	Does not encode structural context in the PLM input
Unconstrained joint context encoding	Raw target and neighbor text	$O (R_{v}^{2} d)$	Cost grows rapidly with sampled context length
Topology-agnostic truncation	Truncated context sequence	$O (L_{m a x}^{2} d)$	Retention ignores graph-conditioned token relevance
STAGE	Graph-conditioned reduced context	$O (L_{m a x}^{2} d)$	Requires preprocessing for sampling and token scoring

Table 2. Statistics of the seven datasets used in our experiments.

Dataset	# Nodes	# Edges	# Avg. Tokens	# Classes
Cora	2708	5429	194	7
WikiCS	11,701	215,863	545	10
CiteSeer	3186	4277	196	6
ArXiv-2023	46,198	78,543	253	40
Ele-Photo	48,362	500,928	185	12
OGBN-Products (subset)	54,025	74,420	163	47
OGBN-ArXiv	169,343	1,166,243	231	40

# denotes the number of the corresponding item.

Table 3. Node classification accuracy (%) comparisons. The best, second-best, and third-best results are highlighted in bold and red, underlined and blue, and italicized and green, respectively. “–” indicates out-of-memory (OOM) or incompatibility. Traditional methods include GNNs and PLMs; joint methods model structure and text together; LLM-Aug. methods utilize generative augmentation.

Category	Method	Cora	WikiCS	CiteSeer	ArXiv-2023	Ele-Photo	OGBN-Prod.	OGBN-ArXiv
Traditional (GNN-based)	MLP	76.12 ± 1.51	68.11 ± 0.76	70.28 ± 1.13	65.41 ± 0.16	62.21 ± 0.17	58.11 ± 0.23	62.57 ± 0.11
	GCN	88.12 ± 1.13	76.82 ± 0.62	71.98 ± 1.32	66.99 ± 0.19	80.11 ± 0.09	69.84 ± 0.52	70.78 ± 0.10
	GraphSAGE	87.60 ± 1.40	76.65 ± 0.84	72.44 ± 1.11	68.76 ± 0.51	79.79 ± 0.23	70.64 ± 0.20	71.72 ± 0.21
	GAT	85.13 ± 0.95	77.04 ± 0.55	72.73 ± 1.18	67.61 ± 0.24	80.38 ± 0.37	69.70 ± 0.25	70.85 ± 0.17
	NodeFormer	88.48 ± 0.33	75.47 ± 0.46	75.74 ± 0.54	67.44 ± 0.42	77.30 ± 0.06	67.26 ± 0.71	69.60 ± 0.08
Traditional (PLM-based)	BERT	79.70 ± 1.70	78.13 ± 0.63	71.92 ± 1.07	77.15 ± 0.09	68.79 ± 0.11	76.23 ± 0.19	72.75 ± 0.09
	RoBERTa-base	78.49 ± 1.36	76.91 ± 0.69	71.66 ± 1.18	77.33 ± 0.16	69.12 ± 0.15	76.01 ± 0.14	72.51 ± 0.03
	RoBERTa-large	79.79 ± 1.31	77.79 ± 0.89	72.26 ± 1.80	77.70 ± 0.35	71.22 ± 0.09	76.29 ± 0.27	73.20 ± 0.13
Joint Structure–Text	GLEM	87.61 ± 0.19	78.11 ± 0.61	77.51 ± 0.63	79.18 ± 0.21	81.47 ± 0.52	76.15 ± 0.32	74.46 ± 0.27
	SimTeG	86.85 ± 1.81	79.77 ± 0.68	78.69 ± 1.12	79.31 ± 0.49	81.61 ± 0.18	76.46 ± 0.55	74.31 ± 0.14
	ENGINE	87.56 ± 1.48	77.97 ± 0.94	76.79 ± 1.38	78.34 ± 0.15	80.50 ± 0.33	77.80 ± 1.20	73.59 ± 0.14
	GraphBridge	92.14 ± 1.03	80.59 ± 0.47	85.32 ± 1.39	84.07 ± 0.34	83.84 ± 0.07	79.80 ± 0.19	74.89 ± 0.23
LLM- Augmented	TAPE	87.82 ± 0.91	–	–	80.11 ± 0.20	–	79.46 ± 0.11	74.66 ± 0.07
LLM- Augmented	KEA	90.44 ± 1.62	80.48 ± 0.31	75.55 ± 1.24	85.23 ± 0.47	78.27 ± 0.21	76.99 ± 0.37	73.40 ± 0.19
Ours	STAGE	92.80 ± 1.65	81.99 ± 0.33	86.18 ± 1.36	85.66 ± 0.48	84.01 ± 0.22	81.08 ± 0.11	75.47 ± 0.16

Bold: best; Underlined with blue: second-best; Italicized with green: third-best. Gray background highlights our proposed method.

Table 4. Complete component ablation on OGBN-Products under a larger walk depth. Accuracy (%) is reported as mean ± standard deviation over ten runs.

Variant	Main Difference	Accuracy (%)
STAGE w/o Semantic Injection	Original text only	78.59 ± 0.22
STAGE w/o Token Selector	Head truncation	79.66 ± 0.17
STAGE w/ Random Token Selection	Random retention	80.32 ± 0.09
Full STAGE	Graph-conditioned selector	81.08 ± 0.11

Bold value indicates the best performance.

Table 5. Scalability statistics of STAGE on OGBN-ArXiv under different walk depths. The reduction stage is trained once offline and shared across all walk-depth settings.

Walk Steps	Avg. Nodes	Avg. Raw Tokens	Avg. Final Tokens	Graph Build (s)	Seq. Build (s)	Token Sel. (s)	LM Epoch (s)	LM Mem. (GB)	GNN Acc. (%)
0	1.00	336.46	337.38	27.51	49.08	15.74	1121.81	11.78	76.07 ± 0.19
8	3.13	1066.49	486.49	52.17	87.49	35.43	1132.67	11.78	75.26 ± 0.19
16	4.91	1678.03	503.46	59.07	98.51	44.73	1128.82	11.78	75.09 ± 0.15
24	6.48	2216.42	506.47	63.99	121.53	57.74	1131.99	11.78	75.32 ± 0.14
32	7.87	2692.87	507.01	61.42	128.88	67.49	1334.78	11.78	75.16 ± 0.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, S.; Xiao, S.; Zhang, X.-Y.; Zhu, S.; Liu, L.; Wang, D.-H. STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs. Mathematics 2026, 14, 1568. https://doi.org/10.3390/math14091568

AMA Style

Huang S, Xiao S, Zhang X-Y, Zhu S, Liu L, Wang D-H. STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs. Mathematics. 2026; 14(9):1568. https://doi.org/10.3390/math14091568

Chicago/Turabian Style

Huang, Shiwei, Shunxin Xiao, Xu-Yao Zhang, Shunzhi Zhu, Luoqi Liu, and Da-Han Wang. 2026. "STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs" Mathematics 14, no. 9: 1568. https://doi.org/10.3390/math14091568

APA Style

Huang, S., Xiao, S., Zhang, X.-Y., Zhu, S., Liu, L., & Wang, D.-H. (2026). STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs. Mathematics, 14(9), 1568. https://doi.org/10.3390/math14091568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs

Abstract

1. Introduction

2. Related Work

2.1. TAG Representation Learning

2.2. LLM-Based Semantic Augmentation for Graph Learning

2.3. Efficient Transformers and Token Reduction

3. Materials and Methods

3.1. Preliminaries

3.2. The Proposed STAGE Framework

3.2.1. Stage I: Generative Semantic Injection

3.2.2. Stage II: Structure-Aware Representation Learning

3.2.3. Optimization and Complexity Analysis

4. Results

4.1. Implementation Details

4.2. Overall Comparison (RQ1)

4.3. Component Ablation and Diagnostic Analysis (RQ2)

4.4. Parameter Sensitivity (RQ3)

4.5. Scalability Analysis (RQ4)

5. Discussion

5.1. Discussion on LLM Choices and Prompt Design

5.2. Why Graph-Conditioned Token Reduction Matters

5.3. Failure Case Analysis and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI