1. Introduction
The rapid development of large language models (LLMs) has made retrieval-augmented generation (RAG) an important paradigm for reducing hallucinations and improving factual grounding through access to external knowledge. In this paradigm, dense retrieval serves as a critical interface between large-scale knowledge sources and downstream generators, and its effectiveness often determines the upper bound of the overall system performance [
1,
2]. Although existing dense retrievers have achieved strong results on factoid retrieval and open-domain question answering, their performance remains limited on reasoning-intensive tasks that require multi-hop entity tracing, implicit evidence composition, and cross-document logical inference [
3,
4].
We argue that a major bottleneck of dense retrieval in reasoning-intensive settings is the lack of suitable training supervision. Existing training triplets are usually constructed from relevance labels, heuristic negatives, or semantically similar candidates. Although these signals are often sufficient for topical matching, they are inadequate for modeling structured reasoning behavior [
5,
6]. This limitation can be understood from two complementary perspectives.
First, there exists a logic gap between the training signal and the reasoning demands of downstream tasks. In many existing retrieval datasets, positive passages are associated with queries primarily through surface-level semantic relevance or lexical correspondence. However, reasoning-intensive retrieval often requires the retriever to identify evidence connected through latent relation chains, rather than merely matching isolated concepts. As a result, current supervision rarely teaches the model how to retrieve documents that are not only relevant in meaning but also consistent with an underlying multi-step reasoning path [
3,
7].
Second, there exists a decision boundary gap in negative supervision. Conventional hard negative mining strategies typically rely on lexical overlap, BM25 retrieval, or embedding similarity, and therefore mainly expose the model to semantically confusing but not necessarily structurally misleading examples [
6,
8,
9]. Yet, in reasoning-intensive retrieval, the most harmful distractors are often documents that appear highly relevant because they share entities, topics, or local descriptions with the query, while failing to support the critical relation chain required for correct reasoning. Without such challenging negatives during training, retrievers tend to overfit shallow semantic cues and remain vulnerable to logically inconsistent but semantically attractive candidates.
Recent studies have explored LLM-based synthetic data generation for retrieval, showing that generated queries, pseudo-documents, and weak supervision can substantially improve retriever training when annotations are limited [
10,
11,
12]. However, most of these methods target semantic plausibility rather than structural correctness. They diversify queries or documents, but offer limited control over whether a generated positive preserves a valid reasoning path or whether a negative contradicts that path in a way that sharpens the decision boundary. In parallel, graph-enhanced retrieval and knowledge-aware reasoning methods have demonstrated the value of explicit relational structure, but they usually inject structure during online retrieval, graph expansion, or downstream evidence organization, often at the cost of added architectural complexity or inference overhead. For instance, HybRAG combines semantic node retrieval with structure-aware path retrieval inside the online reasoning loop [
13], while KG-guided RAG frameworks such as KG
2RAG expand and organize retrieved chunks with KG signals after seed retrieval [
14]. This raises an important question: Can explicit structural knowledge be injected into retriever training through data construction alone, without modifying the underlying retrieval architecture or increasing online retrieval latency?
To address this question, we propose S-Gens, a structure-aware synthetic data generation framework for reasoning-intensive dense retrieval. Instead of using knowledge graphs only during retrieval or reranking, S-Gens uses relation paths in an external knowledge graph as reasoning scaffolds for offline training data synthesis. Specifically, S-Gens first extracts multi-hop relation paths and uses them to guide an LLM in generating queries and structurally consistent positive samples. It then constructs semantic-decoy hard negatives, namely, documents that remain semantically close to the query while being structurally inconsistent with the target reasoning path. To further improve data reliability, S-Gens incorporates a Siamese graph neural network (GNN)-based consistency filtering module for automatic scoring and filtering of synthetic instances. Because the framework operates entirely at the data level, it is model-agnostic, preserves the original inference-time architecture, and serves as an upstream complement to downstream graph-guided reasoning or RAG modules.
We evaluate S-Gens on five benchmark datasets, including NQ, TriviaQA, WebQSP, HotpotQA, and MS MARCO. Experimental results show that S-Gens consistently improves a wide range of trainable retrievers, with the most pronounced gains observed on reasoning-intensive benchmarks such as WebQSP and HotpotQA. These findings suggest that structure-aware synthetic supervision is an effective and practical way to alleviate the shortage of reasoning-oriented training signals in dense retrieval.
The main contributions of this work are summarized as follows:
We propose S-Gens, an offline structure-aware synthetic data generation framework that uses knowledge graph relation paths to construct reasoning-oriented supervision for dense retrieval while preserving the original retrieval architecture at inference time.
We introduce a semantic-decoy hard negative mining strategy that improves retriever robustness against semantically similar but logically inconsistent candidates.
We develop a Siamese-GNN-based consistency filtering mechanism for filtering low-quality synthetic training instances.
We demonstrate through extensive experiments that S-Gens is a plug-and-play and model-agnostic data augmentation framework that consistently improves diverse retrievers, especially on multi-hop reasoning tasks, and is complementary to graph-guided pipelines that use structure online.
3. Methodology
In this section, we present S-Gens, a structure-aware synthetic data generation framework designed to improve dense retrieval for reasoning-intensive tasks. The core idea is to move structural reasoning signals from the inference stage to the training data construction stage. Instead of relying solely on semantically relevant query–document pairs, S-Gens uses an external knowledge graph (KG) to provide explicit relational scaffolds for synthetic supervision. This design targets the quality of first-stage dense retriever supervision and is therefore orthogonal to graph-guided pipelines that inject structure during online retrieval or downstream reasoning.
The framework consists of three main components. First, we construct structurally grounded positive samples by extracting multi-hop relation paths from the KG and using them to guide query generation. Second, we mine semantic-decoy hard negatives, which are semantically similar to the query but structurally inconsistent with the target reasoning path. Third, we apply a Siamese graph neural network (GNN)-based consistency filtering module to score and filter synthetic instances before integrating them into retriever training. Because all structure-aware operations are performed offline during data construction, S-Gens does not require any modification to the downstream retriever architecture and adds no online inference cost. As illustrated in
Figure 1, the pipeline has three stages: path-based positive synthesis, structural hard negative construction, and Siamese-GNN-based consistency filtering.
In
Figure 1, the final training block should be interpreted generically as the optimization of a target downstream dense retriever rather than as a student-specific architecture. Likewise, the candidate analysis stage in the figure corresponds to the structural inconsistency scoring and semantic-decoy selection process formalized in
Section 3.4 and
Section 3.5.
3.7. Complexity Analysis and Discussion
In this subsection, we analyze the computational characteristics of S-Gens and discuss its practical deployment properties. As S-Gens is designed as an offline data augmentation framework, its additional cost is incurred primarily during synthetic data construction rather than during online retrieval. As a result, the final retriever maintains the same inference architecture and serving complexity as the original backbone model.
3.7.1. Offline Construction Cost
The offline cost of S-Gens mainly comes from three stages: reasoning path extraction, synthetic instance generation, and consistency filtering.
For path extraction, let
denote the number of anchor entities involved in the data construction process, and let
b and
L denote the average branching factor and the maximum path length in the knowledge graph, respectively. Under bounded depth-first search, the worst-case complexity of path discovery is approximately
In practice, however, this cost is substantially reduced by restricting the search depth, pruning low-frequency relations, and retaining only a small number of high-quality candidate paths for each anchor pair.
For synthetic query and document construction, let
denote the number of retained reasoning paths and let
denote the average generation cost per path under the language model. The total generation cost can be written as
Because this stage is performed offline and can be parallelized across paths, it does not affect retrieval-time latency.
For consistency filtering, let
denote the number of synthetic instances and let
denote the average cost of encoding one graph pair using the Siamese GNN. The overall filtering cost is
Similar to generation, this stage is also fully offline and can be batched efficiently.
To complement the asymptotic analysis above, we further report a compact practical cost summary under the default matched-budget setting across the five benchmarks. In this setting, we use , , up to two anchor entities per instance, at most five candidate paths per anchor pair, Qwen2.5-32B-Instruct for synthetic query generation, and a two-layer R-GCN consistency filter. On average, the pipeline retains about reasoning paths per instance and keeps approximately 180,000 filtered synthetic triplets in total for training across the five benchmarks. The offline query-generation stage takes about h, consistency filtering takes about h, and the total offline preprocessing time is about h on a machine with four RTX 3090 GPUs and one 20-core CPU. The remaining preprocessing time is mainly spent on entity linking, path extraction, and pruning. These figures should be understood as representative orders of magnitude rather than fixed constants, as the exact runtime varies with the benchmark and backbone. Importantly, all of these additional costs are incurred only during offline supervision construction, while the trained retriever preserves the same inference-time architecture and online complexity as the original backbone.
5. Conclusions and Future Work
In this work, we proposed S-Gens, a structure-aware synthetic data generation framework for enhancing dense retrieval in reasoning-intensive settings. The central motivation is that existing dense retrievers are still largely trained with supervision signals centered on shallow semantic relevance, which are often insufficient for tasks requiring multi-hop reasoning, implicit evidence composition, and relational chain preservation. To address this limitation, S-Gens shifts structural reasoning signals from online inference to offline data construction.
Specifically, the proposed framework introduces three complementary components. First, it uses multi-hop relation paths extracted from an external knowledge graph to synthesize structurally grounded positive samples, thereby reducing the mismatch between training supervision and the reasoning requirements of downstream retrieval tasks. Second, it constructs semantic-decoy hard negatives that remain semantically plausible while being structurally inconsistent with the target reasoning path, enabling the retriever to learn cleaner and more robust decision boundaries. Third, it incorporates a Siamese-GNN-based consistency filtering module to assess the structural reliability of generated instances and suppress low-quality synthetic supervision.
Extensive experiments on five benchmark datasets demonstrate that S-Gens consistently improves a diverse range of trainable retrievers, including classical dual-encoder models, distillation-based retrievers, and recent large embedding models. In particular, the gains are more pronounced on reasoning-intensive benchmarks such as WebQSP and HotpotQA, indicating that structure-aware synthetic supervision is especially effective when successful retrieval depends on latent relational structure rather than direct semantic overlap alone. Additional ablation studies further verify the importance of semantic-decoy negatives, consistency filtering, and an appropriate synthetic data ratio at the retriever level.
Overall, our findings suggest that improving dense retrieval for complex reasoning tasks does not necessarily require modifying the inference-time architecture or introducing expensive online reasoning modules. Instead, carefully designed structure-aware supervision at the data level can already provide substantial and generalizable benefits for retrieval-stage evidence acquisition. In this sense, S-Gens offers a practical and model-agnostic way to bridge the gap between semantic retrieval training and reasoning-oriented retrieval demands, and it should be viewed as a training-time complement to downstream graph-guided retrieval or RAG pipelines rather than as a substitute for them.
Despite these encouraging results, several limitations remain. First, the quality of the generated supervision still depends on the coverage and reliability of the external knowledge graph. In domains where relational structure is sparse, noisy, or incomplete, the effectiveness of path-based synthesis may be constrained. Second, although the framework is inference-efficient, the offline data construction pipeline introduces additional computational cost due to path extraction, synthetic generation, and consistency filtering. Third, the current framework mainly focuses on text retrieval and does not explicitly model multimodal evidence or interactive retrieval scenarios. Fourth, while our results show consistent gains in retrieval-stage evidence coverage and decoy rejection, the full downstream impact on end-to-end question answering or generation quality still requires dedicated reader-side evaluation.
In future work, we plan to extend S-Gens in several directions. One promising direction is to incorporate richer sources of structural knowledge, such as domain-specific ontologies or dynamically induced graphs, in order to improve coverage beyond fixed knowledge graphs. Another direction is to explore adaptive synthetic data scheduling, where the ratio and difficulty of generated instances are adjusted according to retriever training dynamics. It is also worthwhile to investigate whether structure-aware synthetic supervision can benefit related tasks such as reranking, retrieval-augmented generation, and agentic multi-step information seeking, and to evaluate these gains under fixed downstream readers or generators. Finally, we believe that combining structure-aware supervision with stronger generator models and more reliable automatic verification mechanisms may further improve the scalability and generalization of reasoning-oriented retrieval systems.
Author Contributions
Conceptualization, Z.L. and Y.X.; methodology, Y.X.; software, Y.X.; validation, Y.X. and S.C.; formal analysis, Z.L.; investigation, Y.X.; resources, S.C.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, Z.L. and S.C.; visualization, Y.X.; supervision, Z.L. and S.C.; project administration, S.C.; funding acquisition, Z.L. and S.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded in part by the Ministry of Education industry–university cooperative education project grant number 231101418285337 and in part by Shanghai University under grant number 22H00324.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available from the corresponding author upon reasonable request.
Acknowledgments
During the preparation of this manuscript, the author(s) used AI-assisted technologies strictly for the purposes of language polishing and English grammar correction. All scientific reasoning, experimental design, and data analysis were conducted independently by the authors. The authors have carefully reviewed and validated all outputs and take full responsibility for the final content of this publication.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| ANCE | Approximate Nearest Neighbor Negative Contrastive Learning |
| BM25 | Best Matching 25 |
| DPR | Dense Passage Retrieval |
| DR@10 | Decoy Rejection at 10 |
| FiD | Fusion-in-Decoder |
| GNN | Graph Neural Network |
| KG | Knowledge Graph |
| LLM | Large Language Model |
| LoRA | Low-Rank Adaptation |
| MDR | Multi-hop Dense Retrieval |
| MRR@10 | Mean Reciprocal Rank at 10 |
| NQ | Natural Questions |
| R@20 | Recall at 20 |
| RAG | Retrieval-Augmented Generation |
| S-Gens | Structure-Aware Synthetic Data Generation |
| WebQSP | WebQuestionsSP |
References
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
- Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; Association for Computational Linguistics: St. Stroudsburg, PA, USA, 2021; pp. 874–880. [Google Scholar] [CrossRef]
- Xiong, W.; Li, X.L.; Iyer, S.; Du, J.; Lewis, P.; Wang, W.Y.; Mehdad, Y.; Yih, S.; Riedel, S.; Kiela, D.; et al. Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; OpenReview.net: Alameda, CA, USA, 2021; pp. 12489–12507. [Google Scholar]
- Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; OpenReview.net: Alameda, CA, USA, 2024; pp. 9112–9141. [Google Scholar]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
- Xiong, L.; Xiong, C.; Li, Y.; Tang, K.; Liu, J.; Bennett, P.N.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; OpenReview.net: Alameda, CA, USA, 2021; pp. 12357–12372. [Google Scholar]
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2369–2380. [Google Scholar] [CrossRef]
- Qu, Y.; Ding, Y.; Liu, J.; Liu, K.; Ren, R.; Zhao, W.X.; Dong, D.; Wu, H.; Wang, H. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: St. Stroudsburg, PA, USA, 2021; pp. 5835–5847. [Google Scholar] [CrossRef]
- Hofstätter, S.; Althammer, S.; Schröder, M.; Sertkan, M.; Hanbury, A. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv 2020, arXiv:2010.02666. [Google Scholar] [CrossRef]
- Bonifacio, L.; Abonizio, H.; Fadaee, M.; Nogueira, R. InPars: Unsupervised Dataset Generation for Information Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2022; pp. 2387–2392. [Google Scholar] [CrossRef]
- Wang, L.; Yang, N.; Wei, F. Query2doc: Query Expansion with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 9414–9423. [Google Scholar] [CrossRef]
- Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 9–14 July 2023; pp. 1762–1777. [Google Scholar] [CrossRef]
- Lee, H.; Lim, S. Hybrid Retrieval-Augmented Generation: Semantic and Structural Integration for Large Language Model Reasoning. Appl. Sci. 2026, 16, 2244. [Google Scholar] [CrossRef]
- Zhu, X.; Xie, Y.; Liu, Y.; Li, Y.; Hu, W. Knowledge Graph-Guided Retrieval Augmented Generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 8912–8924. [Google Scholar] [CrossRef]
- Hofstätter, S.; Lin, S.C.; Yang, J.H.; Lin, J.; Hanbury, A. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2021; pp. 113–122. [Google Scholar] [CrossRef]
- Sun, H.; Dhingra, B.; Zaheer, M.; Mazaitis, K.; Salakhutdinov, R.; Cohen, W. Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4231–4242. [Google Scholar] [CrossRef]
- Sun, H.; Bedrax-Weiss, T.; Cohen, W. PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 2380–2390. [Google Scholar] [CrossRef]
- Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
- Gutiérrez, B.J.; Shu, Y.; Gu, Y.; Yasunaga, M.; Su, Y. HippoRAG: Neurobiologically inspired long-term memory for large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 10–15 December 2024; pp. 59532–59569. [Google Scholar]
- Dai, Z.; Zhao, V.Y.; Ma, J.; Luan, Y.; Ni, J.; Lu, J.; Bakalov, A.; Guu, K.; Hall, K.B.; Chang, M. Promptagator: Few-shot Dense Retrieval from 8 Examples. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; OpenReview.net: Alameda, CA, USA, 2023; pp. 31694–31715. [Google Scholar]
- Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, DC, USA, 18–21 October 2013; Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., Bethard, S., Eds.; Association for Computational Linguistics: Seattle, DC, USA, 18–21 October 2013; pp. 1533–1544. [Google Scholar]
- Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; Deng, L. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 Co-Located with the 30th Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 9 December 2016; CEUR-WS.org: Bonn, Germany, 2016; Volume 1773, pp. 1–10. [Google Scholar]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguist. 2019, 7, 452–466. [Google Scholar] [CrossRef]
- Joshi, M.; Choi, E.; Weld, D.; Zettlemoyer, L. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1601–1611. [Google Scholar] [CrossRef]
- Robertson, S.E.; Jones, K.S. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. Technol. 1976, 27, 129–146. [Google Scholar] [CrossRef]
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335. [Google Scholar] [CrossRef]
- Wang, L.; Yang, N.; Huang, X.; Yang, L.; Majumder, R.; Wei, F. Improving Text Embeddings with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 11897–11916. [Google Scholar] [CrossRef]
- Lee, C.; Roy, R.; Xu, M.; Raiman, J.; Shoeybi, M.; Catanzaro, B.; Ping, W. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 24–28 April 2025; OpenReview.net: Alameda, CA, USA, 2025; pp. 54876–54899. [Google Scholar]
Figure 1.
Overall framework of S-Gens. The framework first extracts multi-hop reasoning paths from an external knowledge graph to synthesize structurally consistent positive samples, then constructs semantically plausible but structurally inconsistent hard negatives, and, finally, applies a Siamese-GNN-based consistency filtering module before integrating the generated data into dense retriever training.
Figure 2.
Performance comparison of representative retrievers before and after applying S-Gens on reasoning-intensive benchmarks. The gains are consistently larger on WebQSP and HotpotQA than on general retrieval datasets, highlighting the advantage of structure-aware synthetic supervision in multi-hop reasoning scenarios.
Figure 3.
Ablation study of the core components in S-Gens. Path-based positives consistently improve retrieval performance over the raw ANCE baseline, while the full S-Gens framework yields the best results on WebQSP (R@20), HotpotQA (R@20), and HotpotQA (DR@10).
Figure 4.
Sensitivity analysis of the synthetic data ratio on WebQSP and HotpotQA. Both benchmarks achieve the best performance at , indicating that a moderate amount of structure-aware synthetic supervision provides the best balance between reasoning enhancement and distributional stability.
Table 1.
Positioning of S-Gens relative to representative graph-guided retrieval and RAG methods.
| Method | Structure Used at | Online Graph Use | Inference Cost Increase | Relation |
|---|
| HybRAG [13] | Retrieval/reasoning | Yes | Yes | Complementary |
| KG2RAG [14] | Chunk expansion/organization | Yes | Yes | Complementary |
| GraphRAG/HippoRAG | Retrieval/evidence organization | Yes | Yes | Complementary |
| S-Gens (ours) | Training-time supervision construction | No | No | – |
Table 2.
Main experimental results of different retrieval models on MS MARCO, NQ, and TriviaQA. BM25 is a non-trainable sparse baseline and is therefore not fine-tuned. Values in parentheses denote the absolute improvement brought by S-Gens.
| Model | MS MARCO | NQ | TriviaQA |
|---|
| MRR@10 | R@20 | R@20 |
|---|
| BM25 | 0.187 | 59.1 | 66.9 |
| DPR | 0.322 (+0.014) | 78.4 (+2.1) | 79.4 (+1.8) |
| ANCE | 0.341 (+0.012) | 81.9 (+1.7) | 80.3 (+1.6) |
| RocketQA | 0.370 (+0.010) | 83.2 (+1.5) | 82.1 (+1.4) |
| Margin-MSE | 0.375 (+0.009) | 83.8 (+1.3) | 82.5 (+1.5) |
| BGE-M3 | 0.385 (+0.008) | 84.9 (+1.1) | 84.2 (+1.1) |
| E5-Mistral-7B-Instruct | 0.392 (+0.007) | 85.5 (+1.0) | 85.0 (+1.0) |
| NV-Embed-v2 | 0.401 (+0.006) | 86.4 (+0.9) | 85.8 (+0.9) |
Table 3.
Main experimental results of different retrieval models on the reasoning-intensive benchmarks WebQSP and HotpotQA. Values in parentheses denote the absolute improvement brought by S-Gens.
| Model | WebQSP | HotpotQA |
|---|
| R@20 | R@20 |
|---|
| BM25 | 55.0 | 57.8 |
| DPR | 71.8 (+3.6) | 61.2 (+4.2) |
| ANCE | 76.5 (+3.4) | 71.0 (+3.7) |
| RocketQA | 74.5 (+2.9) | 69.5 (+3.3) |
| Margin-MSE | 75.2 (+2.6) | 70.1 (+3.2) |
| BGE-M3 | 77.8 (+2.4) | 72.5 (+2.9) |
| E5-Mistral-7B-Instruct | 78.6 (+2.2) | 73.8 (+2.5) |
| NV-Embed-v2 | 79.5 (+1.9) | 75.1 (+2.2) |
Table 4.
Matched-budget comparison between semantic-only and structure-aware synthetic augmentation. All results are averaged over three runs and reported as mean ± standard deviation under the same generation and training budget. “Struct.-Neg.” denotes structural negatives only.
| Backbone | Setting | WebQSP | HotpotQA | HotpotQA |
|---|
| R@20 | R@20 | DR@10 |
|---|
| ANCE | Original | 73.1 ± 0.17 | 67.3 ± 0.20 | 71.8 ± 0.40 |
| ANCE | Semantic-only | 74.0 ± 0.20 | 68.1 ± 0.20 | 73.1 ± 0.30 |
| ANCE | Path-Guided Positives | 74.8 ± 0.20 | 69.0 ± 0.20 | 74.2 ± 0.30 |
| ANCE | Struct.-Neg. Only | 74.4 ± 0.20 | 68.7 ± 0.20 | 76.8 ± 0.40 |
| ANCE | Full S-Gens | 76.5 ± 0.30 | 71.0 ± 0.30 | 78.6 ± 0.40 |
| BGE-M3 | Original | 75.4 ± 0.14 | 69.6 ± 0.18 | – |
| BGE-M3 | Semantic-only | 76.1 ± 0.17 | 70.4 ± 0.20 | – |
| BGE-M3 | Path-Guided Positives | 76.7 ± 0.16 | 71.0 ± 0.21 | – |
| BGE-M3 | Struct.-Neg. Only | 76.5 ± 0.18 | 70.8 ± 0.19 | – |
| BGE-M3 | Full S-Gens | 77.8 ± 0.22 | 72.5 ± 0.24 | – |
Table 5.
Ablation results of the core components in S-Gens using ANCE as the backbone retriever.
| Variant | WebQSP | HotpotQA | HotpotQA |
|---|
| R@20 | R@20 | DR@10 |
|---|
| Raw ANCE | 73.1 | 67.3 | 71.8 |
| + Path-Positives Only | 74.8 | 69.0 | 74.2 |
| + Path-Positives & BM25 Negatives | 75.1 | 69.3 | 74.5 |
| Full S-Gens | 76.5 | 71.0 | 78.6 |
Table 6.
Comparison of different filtering strategies using ANCE as the backbone retriever.
| Filter | WebQSP | HotpotQA | HotpotQA |
|---|
| R@20 | R@20 | DR@10 |
|---|
| No Filter | 75.2 | 69.8 | 76.0 |
| Heuristic Filter | 75.7 | 70.2 | 76.8 |
| Semantic Similarity Filter | 75.9 | 70.4 | 77.1 |
| Siamese GNN Filter | 76.5 | 71.0 | 78.6 |
Table 7.
Manual data quality analysis on a sampled subset of generated instances from WebQSP and HotpotQA.
| Setting | Positive Structural Faithfulness | Negative Decoy Validity |
|---|
| Before Filtering | 82% | 78% |
| Heuristic Filter | 87% | 84% |
| Siamese GNN Filter | 92% | 89% |
Table 8.
Reasoning-oriented retrieval evaluation on HotpotQA using both-support Recall@20.
| Backbone | Setting | HotpotQA Both-Support R@20 |
|---|
| ANCE | Original | 49.6 |
| ANCE | Semantic-only | 50.8 |
| ANCE | Path-Guided Positives | 52.1 |
| ANCE | Struct.-Neg. Only | 51.5 |
| ANCE | Full S-Gens | 53.9 |
| BGE-M3 | Original | 52.8 |
| BGE-M3 | Semantic-only | 53.6 |
| BGE-M3 | Path-Guided Positives | 54.5 |
| BGE-M3 | Full S-Gens | 56.1 |
Table 9.
Performance under different synthetic data ratios .
| Metric | 10% | 20% | 30% | 40% | 50% |
|---|
| WebQSP (R@20) | 74.3 | 75.6 | 76.5 | 76.1 | 75.5 |
| HotpotQA (R@20) | 68.6 | 70.0 | 71.0 | 70.7 | 69.8 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |