STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs
Abstract
1. Introduction
- We present STAGE, a decoupled framework that combines offline LLM-based semantic enrichment with GNN-based structural propagation for text-attributed graphs.
- We introduce a graph-conditioned token reduction mechanism that performs token selection before deep encoding. Unlike topology-agnostic pruning, it uses structural context to retain informative semantic anchors while keeping the PLM input length within a fixed computational budget.
- We evaluate STAGE on seven benchmark datasets and show that it outperforms strong baselines under our evaluation protocol, while maintaining favorable efficiency on graphs with different characteristics.
2. Related Work
2.1. TAG Representation Learning
2.2. LLM-Based Semantic Augmentation for Graph Learning
2.3. Efficient Transformers and Token Reduction
3. Materials and Methods
3.1. Preliminaries
- Text-Attributed Graphs. A text-attributed graph is denoted as , where is the node set and is the edge set. Each node is associated with a raw text sequence and a corresponding discrete label . In this work, we focus on semi-supervised node classification, which aims to predict the labels of unlabeled nodes given the graph topology, textual attributes, and a small subset of labeled nodes .
- GNN-Based Paradigm. Graph Neural Networks (GNNs) learn node representations by recursively aggregating information from topological neighbors. Formally, the hidden representation of node v at the l-th layer is updated aswhere denotes the set of neighbors of v, and and represent the aggregation and update functions, respectively. The initial feature is typically derived from the textual attribute .
- Pretraining of PLMs. Pretrained Language Models (PLMs), such as BERT and RoBERTa, generate contextualized text representations via self-attention mechanisms. Given a raw text , it is first tokenized into a sequence . A PLM encodes this sequence into contextualized hidden states:Instandard sequence classification tasks, the embedding of the special [CLS] token is commonly extracted as the global semantic representation of the entire text.
- LLMs with Prompting. Unlike PLMs that are typically fine-tuned to adapt to downstream tasks, Large Language Models (LLMs) are often utilized in a frozen state via prompting. Let denote a frozen LLM. A transformation function wraps the raw text into a natural language prompt. The model then generates tokens autoregressively according toThis formulation highlights the role of prompting in extracting or generating semantic information without updating the parameters of the LLM.
- Taxonomy of LLM Integration. Existing work uses LLMs in graph learning in different ways [15,20,21,22]. One distinction is whether the language model is directly tunable or only accessible through prompting. Another is whether the model is used to predict labels directly or to enrich node attributes before downstream learning. STAGE belongs to the latter setting. It uses a prompted, frozen LLM to generate additional semantic context, while a trainable PLM and a GNN are responsible for structure-aware representation learning.
3.2. The Proposed STAGE Framework
3.2.1. Stage I: Generative Semantic Injection
- Knowledge Extraction via Prompting. For each node v, we construct a prompt based on its raw text . The prompt asks the LLM to identify technical terms in the input text and generate a short description for each term. To keep the prompting strategy consistent across datasets with different text styles, we use a general extraction template rather than a dataset-specific domain instruction. The prompt template used in our implementation was as follows:You should work like a named entity recognizer. Text: [text]. Extract the technical terms from this text and output a description for each term in the format of a Python (version 3.10) dictionary, e.g., {’XX’: ’XXX’, ’YY’: ’YYY’}.
- Implementation Details of Stage I. In our implementation, Stage I uses a unified general-text prompt for all datasets. For each node, we apply a lightweight input-length control step before prompting and then submit the processed text to the frozen LLM using the same extraction template. In our OpenRouter-based API calls, we did not explicitly set temperature, max_tokens, top_p, frequency penalty, or presence penalty. Therefore, these values followed the model/provider-side default behavior. To avoid introducing undocumented manual settings, we now report this configuration explicitly. The same API configuration is applied to all nodes and datasets. The generated response is post-processed into plain explanatory text and concatenated with the original node attribute. Malformed outputs and empty generations are filtered by simple format checking, and nodes that fail this step fall back to their original text only. Since Stage I is executed offline, the generation cost is incurred once and does not appear in the optimization of the PLM or GNN components.
- Semantic Enrichment. We concatenate the original node text with the generated explanation to obtain an enriched attribute:where ⊕ denotes concatenation. The resulting set is then used in the second stage. Since the LLM is only used offline, its computation does not appear in the training loop of the PLM or the GNN.
3.2.2. Stage II: Structure-Aware Representation Learning
- Random Walk Context Sampling. To capture local structural context, we construct a walk-based subgraph for each target node v. Starting from v, we perform a random walk with restart and collect the visited nodes into a sampled structural view , where L denotes the walk depth. Compared with BFS-style neighborhood expansion, this strategy provides a more controllable way to explore local context without deterministically including all nodes within a fixed radius. Compared with attention-based neighbor sampling, it does not introduce an additional trainable sampling module. In practice, the sampled subgraph is later linearized into a sequence by concatenating the reduced texts of its constituent nodes under a fixed global token budget.
- Graph-Conditioned Token Selection. To enforce a fixed computational budget , STAGE introduces a graph-conditioned token selector . In our implementation, the selector is trained at the node level. For each target node v, let denote the token embeddings of its enriched text, and let denote a structural conditioning vector obtained from multi-hop neighborhood aggregation. The selector then produces a relevance score for each token:where reflects the importance of token k in the presence of graph context.
- Hierarchical Encoding and Aggregation. The term hierarchical refers to the fact that the structure of the STAGE models has two levels: sequence-level encoding and graph-level aggregation. After token selection, we obtain a reduced text sequence from the enriched attribute and its sampled structural context. The PLM then encodes this reduced sequence to produce a dense node embedding :Because the input sequence already integrates filtered random walk contexts, captures local structural information at the sequence level. Subsequently, these embeddings serve as initial features for a GNN aggregator, which propagates them over the graph topology to capture higher-order dependencies:For node classification, we further apply a linear classifier on top of to obtain the predictionand optimize the GNN stage with the supervised cross-entropy losswhere denotes the set of labeled nodes, and is the one-hot ground-truth label of node v.
3.2.3. Optimization and Complexity Analysis
- Cascaded Optimization Strategy. We train STAGE in three steps instead of optimizing all components jointly. This design keeps the pipeline manageable and avoids passing gradients through the full LLM–PLM–GNN stack at once.
- Step 1: Train the Token Selector. We first optimize the token selector at the node level. For each target node, the selector receives token embeddings from the enriched text together with a structural conditioning vector derived from neighborhood aggregation. Because hard Top-K truncation is not differentiable, we train the selector with a soft aggregation scheme and optimize it using the loss . The selector is frozen after this stage.
- Step 2: Fine-Tune the PLM Encoder. After freezing the selector, we construct a walk-based subgraph for each target node and allocate a per-node token quota under the fixed global budget . The selector scores are then used to retain the most informative tokens for each node, and the resulting reduced node texts are concatenated into a bounded PLM input sequence. The PLM encoder is fine-tuned on these reduced inputs with the classification loss .
- Step 3: Train the GNN Aggregator. We then freeze the PLM encoder, use it to produce node representations , and train the GNN aggregator on top of these representations with the supervised downstream loss . Since only one trainable module is optimized at a time, the overall training process is easier to control in memory and implementation complexity than a fully joint alternative.
| Algorithm 1 Training Pipeline of the Proposed STAGE |
| Require: Text-attributed graph , frozen LLM , token selector , PLM encoder, GNN aggregator, global token budget .
|
- Formal Properties of Graph-Conditioned Reduction. Let denote the number of raw candidate tokens collected from the sampled structural view for node v, and let denote the final number of tokens retained for PLM encoding after graph-conditioned reduction. By construction, STAGE enforces the bounded-input property:This property implies that the PLM encoder never receives an input sequence whose length grows without bound as the sampled structural context expands. In particular, although may increase substantially with walk depth L, the final encoded sequence is always upper-bounded by the global token budget.
- A Compositional View of STAGE. STAGE can be viewed as a composition of four operators over graph–text inputs. Let denote the offline semantic augmentation operator, the graph-conditioned reduction operator, the PLM encoding operator, and the graph propagation operator. Then, the overall pipeline can be written schematically aswhere denotes the semantically enriched node text, the reduced structure-aware text sequences, the PLM-based node embeddings, and the final graph-aware representations. This view highlights that STAGE is not simply a sequential engineering pipeline, but a compositional mapping in which semantic expansion, structure-aware token filtering, contextual encoding, and graph propagation play distinct roles.
- Time and Space Complexity. We now summarize the main computational costs of STAGE.
- Stage I: Offline semantic augmentation. Let denote the average generation cost per node. Since Stage I is executed once for each node and is detached from the downstream training loop, its total complexity is
- Stage II: Structure-aware representation learning. For each target node v, let denote the number of raw candidate tokens collected from the sampled subgraph. The graph-conditioned selector scores these tokens and constructs a reduced sequence of length . The token scoring and sequence construction costs therefore scale with the candidate context size, which may increase with walk depth. In contrast, the dominant self-attention cost of the PLM with respect to sequence length is bounded by the fixed token budget:where d is the hidden dimension of the PLM. Thus, compared with an unconstrained concatenation strategy whose self-attention cost would scale as , STAGE replaces the dependence on the full raw neighborhood text with a dependence on the fixed budget .
- GNN propagation. After PLM encoding, the graph propagation stage follows the complexity of a standard message-passing GNN. For a K-layer GNN with hidden dimension , the propagation cost is proportional to the graph size and can be written in the usual form as
4. Results
- RQ1: Can STAGE outperform existing baselines across diverse text-attributed graphs?
- RQ2: How do semantic injection and token retention strategies affect the performance of STAGE?
- RQ3: How sensitive is STAGE to the balance between random walk depth and a fixed token budget?
- RQ4: How does STAGE scale in preprocessing cost, PLM-side cost, and downstream performance as the structural context grows?
4.1. Implementation Details
- Datasets and Baselines. We evaluate STAGE on seven benchmark datasets from citation, social, and e-commerce domains. Our experimental protocol follows the settings used in GraphBridge [10] and TAPE [13]. We compare STAGE with 14 baselines from three groups: traditional methods (e.g., GCN, GraphSAGE, BERT, and RoBERTa), joint text–structure methods (e.g., GLEM, ENGINE, and GraphBridge), and LLM-augmented methods (e.g., TAPE and KEA). Dataset statistics are reported in Table 2. For all seven datasets, we use the same dataset versions and train/validation/test splits as GraphBridge [10] (following TAPE [13] where applicable), without introducing any additional re-splitting. Following GraphBridge [10], Table 2 reports the number of nodes, edges, classes, and the average number of tokens per node.
- Experimental Setup. All experiments were conducted on a single NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM. In Stage I, we use GPT-3.5-turbo to generate the offline semantic explanations. To keep the semantic augmentation procedure consistent across datasets with different text styles, we use a unified general-text extraction prompt for all datasets. The generated outputs are post-processed into explanatory text, and malformed or empty generations are filtered before concatenation. In Stage II, RoBERTa-base is adopted as the PLM backbone and GraphSAGE as the GNN aggregator. The hyperparameters of STAGE are tuned separately for each dataset based on validation performance. We tune the reduction module, the PLM fine-tuning stage, and the GNN stage independently, and report the final test results using the configuration that achieves the best validation accuracy for each dataset. When prior work provides established search ranges, we use them as references for the tuning procedure. Unless otherwise specified, we report mean accuracy and standard deviation over ten runs with different random seeds.
4.2. Overall Comparison (RQ1)
- Observation 1: Both text semantics and graph structure matter for TAG learning. The results show that models relying mainly on one source of information are often limited. On ArXiv-2023, for example, structure-oriented methods such as GCN remain well below the strongest results, suggesting that shallow initial features are not sufficient for this dataset. In contrast, text-based PLMs capture richer semantics, but they do not fully exploit graph connectivity. This gap is consistent with the nature of TAGs, where useful signals come from both node text and graph structure rather than either one alone.
- Observation 2: Semantic enrichment is more useful when it is followed by structure-aware learning. Generative baselines such as TAPE enrich node attributes with additional semantic content, but the generated information is typically produced for each node independently. Joint models such as GraphBridge make stronger use of structural information, yet they are still constrained by the quality of the original text. STAGE combines these two steps in sequence: it first enriches the node text offline and then learns graph-aware representations from the enriched attributes. The comparison suggests that this separation is effective in practice.
- Observation 3: STAGE achieves the strongest overall results under our evaluation protocol. Across the seven benchmarks, STAGE achieves the best overall results among the methods compared in our experiments. The gains are visible across all seven datasets, although their magnitude varies by dataset, with the largest margins on WikiCS and CiteSeer. In absolute terms, STAGE reaches 92.80% on Cora and 85.66% on ArXiv-2023, indicating that the proposed design remains effective across graphs with different scales and characteristics.
4.3. Component Ablation and Diagnostic Analysis (RQ2)
4.4. Parameter Sensitivity (RQ3)
- Effect of graph characteristics. Figure 3 shows that the preferred walk length depends on the dataset. On Cora, performance improves as L becomes larger and reaches its best value at , suggesting that broader structural context is useful in this graph. On ArXiv-2023, the best results appear at smaller values such as . A likely reason is that, once the node text has already been enriched by the LLM, allocating too much of the fixed token budget to distant neighbors can remove useful local content from the target node itself.
- Interpreting the fluctuations. The performance curves are not perfectly monotonic. This is expected because the hard truncation step is discrete: when L changes, the token allocation across the target node and its neighbors can shift abruptly. As a result, some informative local tokens may be replaced with neighbor tokens at certain settings. Even so, the overall trends remain stable enough to show that the selector can work under a fixed budget across graphs with different properties.
4.5. Scalability Analysis (RQ4)
- Resource profile under increasing structural context. Table 5 reports the scalability statistics of STAGE on OGBN-ArXiv under different walk depths. As the walk depth increases from 0 to 32, the sampled subgraph becomes larger, with the average number of nodes per graph increasing from 1.00 to 7.87, and the average raw candidate tokens per graph increasing from 336.46 to 2692.87. This confirms that larger walk depths indeed expose the model to substantially broader structural context.
5. Discussion
5.1. Discussion on LLM Choices and Prompt Design
5.2. Why Graph-Conditioned Token Reduction Matters
5.3. Failure Case Analysis and Limitations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Wang, X.; Zhang, X.; Zeng, Z.; Wu, Q.; Zhang, J. Unsupervised spectral feature selection with l1-norm graph. Neurocomputing 2016, 200, 47–54. [Google Scholar] [CrossRef]
- Fan, Y.; Liu, J.; Weng, W.; Chen, B.; Chen, Y.; Wu, S. Multi-label feature selection with constraint regression and adaptive spectral graph. Knowl.-Based Syst. 2021, 212, 106621. [Google Scholar] [CrossRef]
- Lin, K.; Xie, X.; Weng, W.; Du, X. Global-local graph attention: Unifying global and local attention for node classification. Comput. J. 2024, 67, 2959–2969. [Google Scholar] [CrossRef]
- Hong, B.; Lu, P.; Chen, R.; Lin, K.; Yang, F. Health Insurance Fraud Detection via Multiview Heterogeneous Information Networks with Augmented Graph Structure Learning. IEEE Trans. Comput. Soc. Syst. 2024, 12, 2297–2317. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–10. [Google Scholar]
- Hong, C.; Chen, L.; Liang, Y.; Zeng, Z. Stacked capsule graph autoencoders for geometry-aware 3D head pose estimation. Comput. Vis. Image Underst. 2021, 208, 103224. [Google Scholar] [CrossRef]
- Kaibiao, L.; Chen, J.; Ruicong, C.; Fan, Y.; Yang, Z.; Min, L.; Ping, L. Adaptive neighbor graph aggregated graph attention network for heterogeneous graph embedding. ACM Trans. Knowl. Discov. Data 2023, 18, 3616377. [Google Scholar] [CrossRef]
- Ma, Y.; Lou, H.; Yan, M.; Sun, F.; Li, G. Spatio-temporal fusion graph convolutional network for traffic flow forecasting. Inf. Fusion 2024, 104, 102196. [Google Scholar] [CrossRef]
- Zhang, D.C.; Yang, M.; Ying, R.; Lauw, H.W. Text-attributed graph representation learning: Methods, applications, and challenges. In Proceedings of the ACM Web Conference, Singapore, 13–17 May 2024; pp. 1298–1301. [Google Scholar]
- Wang, Y.; Zhu, Y.; Zhang, W.; Zhuang, Y.; Li, Y.; Tang, S. Bridging local details and global context in text-attributed graphs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 14830–14841. [Google Scholar]
- Chen, W.; Liu, W.; Zheng, J.; Zhang, X. Leveraging large language model as news sentiment predictor in stock markets: A knowledge-enhanced strategy. Discov. Comput. 2025, 28, 74. [Google Scholar] [CrossRef]
- Chen, W.; Hussain, W.; Chen, J. GLMTopic: A hybrid Chinese topic model leveraging large language models. Comput. Mater. Contin. 2025, 85, 1559–1583. [Google Scholar] [CrossRef]
- He, X.; Bresson, X.; Laurent, T.; Perold, A.; LeCun, Y.; Hooi, B. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning. arXiv 2023, arXiv:2305.19523. [Google Scholar]
- Hamilton, W.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1024–1034. [Google Scholar]
- Chen, Z.; Mao, H.; Li, H.; Jin, W.; Wen, H.; Wei, X.; Wang, S.; Yin, D.; Fan, W.; Liu, H.; et al. Exploring the potential of large language models (LLMs) in learning on graphs. ACM SIGKDD Explor. Newsl. 2024, 25, 42–61. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Zhao, J.; Qu, M.; Li, C.; Yan, H.; Qian, L.; Li, P.; Zhou, J.; Tang, J. Learning on large-scale text-attributed graphs via variational inference. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Duan, K.; Liu, Q.; Chua, T.S.; Yan, S.; Ooi, W.T.; Xie, Q.; He, J. Simteg: A frustratingly simple approach improves textual graph learning. arXiv 2023, arXiv:2308.02565. [Google Scholar]
- Jin, B.; Han, W.; Pan, Y.; Jiang, Y.; Ji, H.; Han, J. Patton: Language model pretraining on text-rich networks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 8320–8339. [Google Scholar]
- Zhao, J.; Zhuo, L.; Shen, Y.; Qu, M.; Liu, K.; Bronstein, M.; Zhu, Z.; Tang, J. Graphtext: Graph reasoning in text space. arXiv 2023, arXiv:2310.01089. [Google Scholar] [CrossRef]
- Tang, J.; Ding, Y.; Zhao, W.X.; Gong, Y.; Tian, Q.; Nie, J.Y.; Wen, J.R. GraphGPT: Graph instruction tuning for large language models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 491–500. [Google Scholar]
- Fatemi, B.; Halcrow, J.; Perozzi, B. Talk like a graph: Encoding graphs for large language models. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 1–12. [Google Scholar]
- Zhu, Y.; Wang, Y.; Shi, H.; Tang, S. Efficient tuning and inference for large language models on textual graphs. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 5734–5742. [Google Scholar]
- Kermani, A.; Perez-Rosas, V.; Metsis, V. A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG. arXiv 2025, arXiv:2503.24307. [Google Scholar] [CrossRef]
- Shafee, S.; Bessani, A.; Ferreira, P.M. False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems. arXiv 2025, arXiv:2507.06252. [Google Scholar]
- Alnabi, D.L.A. Fake and Real Tweet Classification Using a Pre-Trained GPT-3 Approach. Adv. Eng. Intell. Syst. 2025, 4, 91–103. [Google Scholar] [CrossRef]
- Nurpatsha, S. A New Analysis of Web Customer Service Text Classification of Alexa Virtual Assistant Commands Using a Deep Learning Model. J. Artif. Intell. Syst. Model. 2025, 3, 76–90. [Google Scholar] [CrossRef]
- Gu, J.; Jiang, X.; Shi, Z.; Tan, H.; Zhai, X.; Xu, C.; Li, W.; Shen, Y.; Ma, S.; Liu, H.; et al. A Survey on LLM-as-a-Judge. arXiv 2024, arXiv:2411.15594. [Google Scholar] [CrossRef]
- Weng, W.; Hou, F.; Gong, S.; Chen, F.; Lin, D. Attribute graph clustering via transformer and graph attention autoencoder. Intell. Data Anal. 2025, 29, 306–319. [Google Scholar] [CrossRef]
- Qiao, J.; Guo, X.; Jin, J.; Wang, D.; Li, K.; Gao, W.; Cui, F.; Zhang, Z.; Shi, H.; Wei, L. Taco-DDI: Accurate prediction of drug-drug interaction events using graph transformer-based architecture and dynamic co-attention matrices. Neural Netw. 2025, 189, 107655. [Google Scholar] [CrossRef]
- Liang, J.; Luo, Y.; Lin, H.; Lin, Y.; Guo, J.M. Structure-aware transformer for enhanced low-resolution human pose estimation. Vis. Comput. 2026, 42, 86. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 17283–17297. [Google Scholar]
- Goyal, S.; Choudhary, A.R.; Raje, S.; Chakaravarthy, V.; Sabharwal, Y.; Ashari, A. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 12–18 July 2020; pp. 3690–3699. [Google Scholar]
- Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffmann, C.; Hoffman, J. Token merging: Your ViT but faster. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 1–12. [Google Scholar]
- Kim, S.; Shen, S.; Thorsley, D.; Gholami, A.; Hassner, T.; Keutzer, K. Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 784–794. [Google Scholar]
- Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Comput. Surv. 2022, 55, 109. [Google Scholar] [CrossRef]
- Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Leman, A.; Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 7793–7804. [Google Scholar]
- Ma, Y.; Liu, X.; Shah, N.; Tang, J. Is homophily a necessity for graph neural networks? In Proceedings of the 10th International Conference on Learning Representations, Virtual, 25–29 April 2022; pp. 1–12. [Google Scholar]
- Hou, Y.; Zhang, J.; Cheng, J.; Ma, K.; Ma, R.T.; Chen, H.; Yang, M.C. Measuring and improving the use of graph information in graph neural networks. In Proceedings of the 8th International Conference on Learning Representations, Virtual, 26 April–1 May 2020; pp. 1–11. [Google Scholar]
- Shi, F.; Chen, X.; Misra, K.; Scales, N.; Dohan, D.; Chi, E.H.; Schärli, N.; Zhou, D. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 31210–31227. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Merrer, E.L.; Trédan, G. LLMs hallucinate graphs too: A structural perspective. In Proceedings of the 13th International Conference on Complex Networks and Their Applications, Istanbul, Turkey, 10–12 December 2024; pp. 233–245. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 9459–9474. [Google Scholar]



| Strategy | Structural Text Usage | PLM-Side Sequence Cost | Main Limitation |
|---|---|---|---|
| Node-text-only PLM | Target node text only | Does not encode structural context in the PLM input | |
| Unconstrained joint context encoding | Raw target and neighbor text | Cost grows rapidly with sampled context length | |
| Topology-agnostic truncation | Truncated context sequence | Retention ignores graph-conditioned token relevance | |
| STAGE | Graph-conditioned reduced context | Requires preprocessing for sampling and token scoring |
| Dataset | # Nodes | # Edges | # Avg. Tokens | # Classes |
|---|---|---|---|---|
| Cora | 2708 | 5429 | 194 | 7 |
| WikiCS | 11,701 | 215,863 | 545 | 10 |
| CiteSeer | 3186 | 4277 | 196 | 6 |
| ArXiv-2023 | 46,198 | 78,543 | 253 | 40 |
| Ele-Photo | 48,362 | 500,928 | 185 | 12 |
| OGBN-Products (subset) | 54,025 | 74,420 | 163 | 47 |
| OGBN-ArXiv | 169,343 | 1,166,243 | 231 | 40 |
| Category | Method | Cora | WikiCS | CiteSeer | ArXiv-2023 | Ele-Photo | OGBN-Prod. | OGBN-ArXiv |
|---|---|---|---|---|---|---|---|---|
| Traditional (GNN-based) | MLP | 76.12 ± 1.51 | 68.11 ± 0.76 | 70.28 ± 1.13 | 65.41 ± 0.16 | 62.21 ± 0.17 | 58.11 ± 0.23 | 62.57 ± 0.11 |
| GCN | 88.12 ± 1.13 | 76.82 ± 0.62 | 71.98 ± 1.32 | 66.99 ± 0.19 | 80.11 ± 0.09 | 69.84 ± 0.52 | 70.78 ± 0.10 | |
| GraphSAGE | 87.60 ± 1.40 | 76.65 ± 0.84 | 72.44 ± 1.11 | 68.76 ± 0.51 | 79.79 ± 0.23 | 70.64 ± 0.20 | 71.72 ± 0.21 | |
| GAT | 85.13 ± 0.95 | 77.04 ± 0.55 | 72.73 ± 1.18 | 67.61 ± 0.24 | 80.38 ± 0.37 | 69.70 ± 0.25 | 70.85 ± 0.17 | |
| NodeFormer | 88.48 ± 0.33 | 75.47 ± 0.46 | 75.74 ± 0.54 | 67.44 ± 0.42 | 77.30 ± 0.06 | 67.26 ± 0.71 | 69.60 ± 0.08 | |
| Traditional (PLM-based) | BERT | 79.70 ± 1.70 | 78.13 ± 0.63 | 71.92 ± 1.07 | 77.15 ± 0.09 | 68.79 ± 0.11 | 76.23 ± 0.19 | 72.75 ± 0.09 |
| RoBERTa-base | 78.49 ± 1.36 | 76.91 ± 0.69 | 71.66 ± 1.18 | 77.33 ± 0.16 | 69.12 ± 0.15 | 76.01 ± 0.14 | 72.51 ± 0.03 | |
| RoBERTa-large | 79.79 ± 1.31 | 77.79 ± 0.89 | 72.26 ± 1.80 | 77.70 ± 0.35 | 71.22 ± 0.09 | 76.29 ± 0.27 | 73.20 ± 0.13 | |
| Joint Structure–Text | GLEM | 87.61 ± 0.19 | 78.11 ± 0.61 | 77.51 ± 0.63 | 79.18 ± 0.21 | 81.47 ± 0.52 | 76.15 ± 0.32 | 74.46 ± 0.27 |
| SimTeG | 86.85 ± 1.81 | 79.77 ± 0.68 | 78.69 ± 1.12 | 79.31 ± 0.49 | 81.61 ± 0.18 | 76.46 ± 0.55 | 74.31 ± 0.14 | |
| ENGINE | 87.56 ± 1.48 | 77.97 ± 0.94 | 76.79 ± 1.38 | 78.34 ± 0.15 | 80.50 ± 0.33 | 77.80 ± 1.20 | 73.59 ± 0.14 | |
| GraphBridge | 92.14 ± 1.03 | 80.59 ± 0.47 | 85.32 ± 1.39 | 84.07 ± 0.34 | 83.84 ± 0.07 | 79.80 ± 0.19 | 74.89 ± 0.23 | |
| LLM- Augmented | TAPE | 87.82 ± 0.91 | – | – | 80.11 ± 0.20 | – | 79.46 ± 0.11 | 74.66 ± 0.07 |
| KEA | 90.44 ± 1.62 | 80.48 ± 0.31 | 75.55 ± 1.24 | 85.23 ± 0.47 | 78.27 ± 0.21 | 76.99 ± 0.37 | 73.40 ± 0.19 | |
| Ours | STAGE | 92.80 ± 1.65 | 81.99 ± 0.33 | 86.18 ± 1.36 | 85.66 ± 0.48 | 84.01 ± 0.22 | 81.08 ± 0.11 | 75.47 ± 0.16 |
| Variant | Main Difference | Accuracy (%) |
|---|---|---|
| STAGE w/o Semantic Injection | Original text only | 78.59 ± 0.22 |
| STAGE w/o Token Selector | Head truncation | 79.66 ± 0.17 |
| STAGE w/ Random Token Selection | Random retention | 80.32 ± 0.09 |
| Full STAGE | Graph-conditioned selector | 81.08 ± 0.11 |
| Walk Steps | Avg. Nodes | Avg. Raw Tokens | Avg. Final Tokens | Graph Build (s) | Seq. Build (s) | Token Sel. (s) | LM Epoch (s) | LM Mem. (GB) | GNN Acc. (%) |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.00 | 336.46 | 337.38 | 27.51 | 49.08 | 15.74 | 1121.81 | 11.78 | 76.07 ± 0.19 |
| 8 | 3.13 | 1066.49 | 486.49 | 52.17 | 87.49 | 35.43 | 1132.67 | 11.78 | 75.26 ± 0.19 |
| 16 | 4.91 | 1678.03 | 503.46 | 59.07 | 98.51 | 44.73 | 1128.82 | 11.78 | 75.09 ± 0.15 |
| 24 | 6.48 | 2216.42 | 506.47 | 63.99 | 121.53 | 57.74 | 1131.99 | 11.78 | 75.32 ± 0.14 |
| 32 | 7.87 | 2692.87 | 507.01 | 61.42 | 128.88 | 67.49 | 1334.78 | 11.78 | 75.16 ± 0.17 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Huang, S.; Xiao, S.; Zhang, X.-Y.; Zhu, S.; Liu, L.; Wang, D.-H. STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs. Mathematics 2026, 14, 1568. https://doi.org/10.3390/math14091568
Huang S, Xiao S, Zhang X-Y, Zhu S, Liu L, Wang D-H. STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs. Mathematics. 2026; 14(9):1568. https://doi.org/10.3390/math14091568
Chicago/Turabian StyleHuang, Shiwei, Shunxin Xiao, Xu-Yao Zhang, Shunzhi Zhu, Luoqi Liu, and Da-Han Wang. 2026. "STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs" Mathematics 14, no. 9: 1568. https://doi.org/10.3390/math14091568
APA StyleHuang, S., Xiao, S., Zhang, X.-Y., Zhu, S., Liu, L., & Wang, D.-H. (2026). STAGE: LLM-Driven Semantic and Topological Augmented Graph Embedding for Text-Attributed Graphs. Mathematics, 14(9), 1568. https://doi.org/10.3390/math14091568

