S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval

Lei, Zhou; Xu, Yanqi; Chen, Shengbo

doi:10.3390/info17050413

Open AccessArticle

S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval

by

Zhou Lei

,

Yanqi Xu

and

Shengbo Chen

^*

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 413; https://doi.org/10.3390/info17050413

Submission received: 18 March 2026 / Revised: 23 April 2026 / Accepted: 24 April 2026 / Published: 26 April 2026

(This article belongs to the Special Issue Advanced Retrieval-Augmented Generation Systems Based on Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

Dense retrievers rely heavily on high-quality training triplets, yet existing data construction strategies remain inadequate for reasoning-intensive retrieval tasks involving multi-hop reasoning, entity relation tracing, and implicit evidence composition. Positive samples are often based on shallow semantic relevance and fail to capture explicit reasoning chains, while negative samples are typically sampled from lexical overlap or random candidates and therefore provide limited supervision for learning clear decision boundaries. To address these issues, we propose S-Gens, a structure-aware synthetic data generation framework for enhancing reasoning-intensive dense retrieval. S-Gens uses relation paths in an external knowledge graph to synthesize queries and structurally consistent positive samples, and further constructs semantically similar but structurally inconsistent hard negatives. To improve data reliability, we introduce a Siamese graph neural network-based consistency filtering mechanism. Because S-Gens operates entirely during offline supervision construction, it remains model-agnostic, preserves the original inference architecture, and is complementary to graph-guided retrieval or RAG pipelines that inject structure online. Experiments on five benchmark datasets show that S-Gens consistently improves multiple trainable retrievers, with the largest gains on multi-hop reasoning tasks such as WebQSP and HotpotQA. These results indicate that structure-aware synthetic supervision can effectively improve dense retrieval in reasoning-intensive settings.

Keywords:

dense retrieval; synthetic data generation; knowledge graph; hard negatives; reasoning-enhanced retrieval

1. Introduction

The rapid development of large language models (LLMs) has made retrieval-augmented generation (RAG) an important paradigm for reducing hallucinations and improving factual grounding through access to external knowledge. In this paradigm, dense retrieval serves as a critical interface between large-scale knowledge sources and downstream generators, and its effectiveness often determines the upper bound of the overall system performance [1,2]. Although existing dense retrievers have achieved strong results on factoid retrieval and open-domain question answering, their performance remains limited on reasoning-intensive tasks that require multi-hop entity tracing, implicit evidence composition, and cross-document logical inference [3,4].

We argue that a major bottleneck of dense retrieval in reasoning-intensive settings is the lack of suitable training supervision. Existing training triplets are usually constructed from relevance labels, heuristic negatives, or semantically similar candidates. Although these signals are often sufficient for topical matching, they are inadequate for modeling structured reasoning behavior [5,6]. This limitation can be understood from two complementary perspectives.

First, there exists a logic gap between the training signal and the reasoning demands of downstream tasks. In many existing retrieval datasets, positive passages are associated with queries primarily through surface-level semantic relevance or lexical correspondence. However, reasoning-intensive retrieval often requires the retriever to identify evidence connected through latent relation chains, rather than merely matching isolated concepts. As a result, current supervision rarely teaches the model how to retrieve documents that are not only relevant in meaning but also consistent with an underlying multi-step reasoning path [3,7].

Second, there exists a decision boundary gap in negative supervision. Conventional hard negative mining strategies typically rely on lexical overlap, BM25 retrieval, or embedding similarity, and therefore mainly expose the model to semantically confusing but not necessarily structurally misleading examples [6,8,9]. Yet, in reasoning-intensive retrieval, the most harmful distractors are often documents that appear highly relevant because they share entities, topics, or local descriptions with the query, while failing to support the critical relation chain required for correct reasoning. Without such challenging negatives during training, retrievers tend to overfit shallow semantic cues and remain vulnerable to logically inconsistent but semantically attractive candidates.

Recent studies have explored LLM-based synthetic data generation for retrieval, showing that generated queries, pseudo-documents, and weak supervision can substantially improve retriever training when annotations are limited [10,11,12]. However, most of these methods target semantic plausibility rather than structural correctness. They diversify queries or documents, but offer limited control over whether a generated positive preserves a valid reasoning path or whether a negative contradicts that path in a way that sharpens the decision boundary. In parallel, graph-enhanced retrieval and knowledge-aware reasoning methods have demonstrated the value of explicit relational structure, but they usually inject structure during online retrieval, graph expansion, or downstream evidence organization, often at the cost of added architectural complexity or inference overhead. For instance, HybRAG combines semantic node retrieval with structure-aware path retrieval inside the online reasoning loop [13], while KG-guided RAG frameworks such as KG²RAG expand and organize retrieved chunks with KG signals after seed retrieval [14]. This raises an important question: Can explicit structural knowledge be injected into retriever training through data construction alone, without modifying the underlying retrieval architecture or increasing online retrieval latency?

To address this question, we propose S-Gens, a structure-aware synthetic data generation framework for reasoning-intensive dense retrieval. Instead of using knowledge graphs only during retrieval or reranking, S-Gens uses relation paths in an external knowledge graph as reasoning scaffolds for offline training data synthesis. Specifically, S-Gens first extracts multi-hop relation paths and uses them to guide an LLM in generating queries and structurally consistent positive samples. It then constructs semantic-decoy hard negatives, namely, documents that remain semantically close to the query while being structurally inconsistent with the target reasoning path. To further improve data reliability, S-Gens incorporates a Siamese graph neural network (GNN)-based consistency filtering module for automatic scoring and filtering of synthetic instances. Because the framework operates entirely at the data level, it is model-agnostic, preserves the original inference-time architecture, and serves as an upstream complement to downstream graph-guided reasoning or RAG modules.

We evaluate S-Gens on five benchmark datasets, including NQ, TriviaQA, WebQSP, HotpotQA, and MS MARCO. Experimental results show that S-Gens consistently improves a wide range of trainable retrievers, with the most pronounced gains observed on reasoning-intensive benchmarks such as WebQSP and HotpotQA. These findings suggest that structure-aware synthetic supervision is an effective and practical way to alleviate the shortage of reasoning-oriented training signals in dense retrieval.

The main contributions of this work are summarized as follows:

We propose S-Gens, an offline structure-aware synthetic data generation framework that uses knowledge graph relation paths to construct reasoning-oriented supervision for dense retrieval while preserving the original retrieval architecture at inference time.
We introduce a semantic-decoy hard negative mining strategy that improves retriever robustness against semantically similar but logically inconsistent candidates.
We develop a Siamese-GNN-based consistency filtering mechanism for filtering low-quality synthetic training instances.
We demonstrate through extensive experiments that S-Gens is a plug-and-play and model-agnostic data augmentation framework that consistently improves diverse retrievers, especially on multi-hop reasoning tasks, and is complementary to graph-guided pipelines that use structure online.

2. Related Work

2.1. Dense Retrieval

Dense retrieval has become a fundamental paradigm for open-domain question answering and neural information retrieval, as it maps queries and documents into a shared embedding space and performs retrieval through vector similarity. Compared with sparse retrieval methods based on exact lexical matching, dense retrievers are more effective at capturing semantic relevance and have achieved strong performance across a wide range of retrieval benchmarks. Early representative approaches, such as Dense Passage Retrieval (DPR) [5], demonstrated the effectiveness of dual-encoder architectures trained with question–passage pairs for open-domain retrieval. Subsequent studies further improved retrieval quality through more informative hard negative mining and distillation-based training strategies, including Approximate Nearest Neighbor Negative Contrastive Learning (ANCE) [6], RocketQA [8] and Margin-MSE [15]. Despite these advances, existing dense retrievers are still primarily optimized for semantic matching and often struggle when retrieval requires latent reasoning over multiple pieces of evidence rather than direct topic alignment.

2.2. Reasoning-Intensive Retrieval and Retrieval-Augmented Generation

Recent developments in retrieval-augmented generation (RAG) [1] have highlighted the importance of retrieval quality for downstream reasoning and generation tasks. RAG-based frameworks improve factual grounding by retrieving external knowledge before generation, but their effectiveness depends heavily on whether the retriever can identify evidence that is not only relevant but also logically sufficient for answering complex questions. This challenge becomes particularly salient in multi-hop and reasoning-intensive settings, where supporting evidence may be distributed across multiple passages and connected through implicit relational chains. Methods such as Fusion-in-Decoder (FiD) [2] and Multi-hop Dense Retrieval (MDR) [3] have shown that complex question answering often requires iterative or reasoning-aware retrieval, while more recent frameworks such as Self-RAG [4] further emphasize the dynamic interaction between retrieval and reasoning. However, much of this line of work focuses on downstream reasoning architectures or retrieval–generation interaction, rather than on improving the training supervision of the retriever itself.

2.3. Knowledge-Graph-Enhanced Retrieval and Graph-Based Reasoning

Knowledge graphs (KGs) and graph-structured representations provide an explicit way to encode relational dependencies among entities, making them particularly useful for multi-hop reasoning and evidence traversal. Prior work has incorporated graph structure into retrieval and question answering through graph neural reasoning, iterative evidence expansion, or graph-guided generation. Representative examples include GRAFT-Net [16] and PullNet [17], which integrate graph-based reasoning into multi-hop question answering, as well as more recent graph-aware retrieval and generation frameworks such as GraphRAG [18], HippoRAG [19], HybRAG [13], and KG²RAG [14]. These approaches demonstrate that explicit relational structure can substantially improve reasoning over complex information sources. HybRAG is particularly relevant because it explicitly combines semantic node retrieval with structural path retrieval inside the retrieval process, whereas KG²RAG performs KG-guided chunk expansion and organization after semantic seed retrieval. Nevertheless, these graph-enhanced methods require specialized retrieval architectures, graph-aware inference modules, or additional online computation at serving time. In contrast, our work explores how structural knowledge can be injected into dense retrieval through offline data construction, without modifying the retrieval architecture itself. In this sense, S-Gens should be viewed as complementary to online graph-guided retrieval or RAG pipelines rather than as a replacement for them.

2.4. LLM-Based Synthetic Data Generation for Retrieval

Large language models have recently been used to generate synthetic training data for retrieval and ranking tasks, especially in scenarios where labeled supervision is scarce or expensive. Existing approaches have explored synthetic query generation, pseudo-document generation, and query expansion to improve retriever generalization. InPars [10] uses large language models to generate synthetic queries for documents, thereby enabling effective zero-shot retrieval training. Promptagator [20] further demonstrates that prompted synthetic query generation can serve as strong supervision for dense retrievers under limited annotation. Query2doc [11] and HyDE [12] similarly show that generated textual expansions or hypothetical documents can improve retrieval effectiveness by enriching the semantic representation of the query. Although these methods have proven useful for improving semantic coverage and retrieval robustness, they generally focus on semantic plausibility rather than structural faithfulness. As a result, they provide limited supervision for reasoning-intensive retrieval settings in which preserving valid relation paths is essential.

2.5. Hard Negative Mining and the Supervision Gap in Reasoning-Oriented Retrieval

Hard negative mining has long been recognized as a key factor in dense retriever training, as informative negatives help shape more discriminative representation spaces. Existing strategies typically obtain hard negatives from BM25 retrieval, nearest neighbor retrieval, cross-encoder filtering, or teacher–student distillation, as seen in ANCE [6], RocketQA [8], and Margin-MSE [15]. These approaches improve semantic discrimination by exposing retrievers to confusing but non-relevant candidates. However, in reasoning-intensive retrieval, the most challenging distractors are often not merely semantically similar passages, but structurally misleading candidates that share entities, topics, or local descriptions while breaking the underlying reasoning chain required by the query. This type of supervision remains underexplored in prior work. Our method addresses this gap by constructing semantic-decoy hard negatives that are semantically plausible yet structurally inconsistent, thereby explicitly training the retriever to distinguish logical support from superficial relevance.

Taken together, the literature reviewed above suggests a clear positioning for S-Gens. Unlike graph-guided retrieval and RAG methods, which inject relational structure during online retrieval, expansion, or evidence organization, S-Gens uses structure only during offline supervision construction. Unlike generic synthetic augmentation methods, S-Gens is designed to preserve reasoning paths and generate structurally informative negatives rather than merely semantically plausible text. Therefore, the role of S-Gens in this paper is not to replace online graph-aware pipelines, but to strengthen the first-stage dense retriever through training-time structural supervision.

Table 1 summarizes this positioning.

3. Methodology

In this section, we present S-Gens, a structure-aware synthetic data generation framework designed to improve dense retrieval for reasoning-intensive tasks. The core idea is to move structural reasoning signals from the inference stage to the training data construction stage. Instead of relying solely on semantically relevant query–document pairs, S-Gens uses an external knowledge graph (KG) to provide explicit relational scaffolds for synthetic supervision. This design targets the quality of first-stage dense retriever supervision and is therefore orthogonal to graph-guided pipelines that inject structure during online retrieval or downstream reasoning.

The framework consists of three main components. First, we construct structurally grounded positive samples by extracting multi-hop relation paths from the KG and using them to guide query generation. Second, we mine semantic-decoy hard negatives, which are semantically similar to the query but structurally inconsistent with the target reasoning path. Third, we apply a Siamese graph neural network (GNN)-based consistency filtering module to score and filter synthetic instances before integrating them into retriever training. Because all structure-aware operations are performed offline during data construction, S-Gens does not require any modification to the downstream retriever architecture and adds no online inference cost. As illustrated in Figure 1, the pipeline has three stages: path-based positive synthesis, structural hard negative construction, and Siamese-GNN-based consistency filtering.

In Figure 1, the final training block should be interpreted generically as the optimization of a target downstream dense retriever rather than as a student-specific architecture. Likewise, the candidate analysis stage in the figure corresponds to the structural inconsistency scoring and semantic-decoy selection process formalized in Section 3.4 and Section 3.5.

3.1. Problem Setup and Notation

Let

D_{orig}

denote the original retrieval training set, where each instance consists of a query and its task-aligned supervision from the benchmark. Let

G

denote an external knowledge graph used only during offline supervision construction. For a training instance, S-Gens identifies anchor entities from the query, the seed passage, or related task context, and extracts a multi-hop reasoning path P from

G

. We use

V (P)

and

E (P)

to denote the node set and relational edge set induced by path P, respectively. Conditioned on P and a contextual snippet c, a query generator

G_{q}

produces a synthetic query

\tilde{q}

, which is paired with a candidate positive document

d^{+}

and one or more candidate hard negatives

d^{-}

.

The output of the offline construction pipeline is a filtered synthetic triplet set

D_{syn} = {(\tilde{q}, d^{+}, d^{-})}

, where the positive is expected to align structurally with the target path and the negative is expected to remain semantically plausible while violating that path. Throughout the paper, we use structural hard negatives and semantic-decoy negatives interchangeably. Likewise, consistency filtering denotes the same offline verification stage that is occasionally described as quality control. The objective of S-Gens is not to alter the downstream retriever architecture, but to improve retriever supervision by augmenting

D_{orig}

with structurally informed synthetic triplets derived from

G

.

3.2. Running Example

To make the data construction pipeline more concrete, we provide a representative example showing how path-guided positives and semantic-decoy negatives are formed. Consider the question intent “Which actor starred in the Christopher Nolan film about dream invasion?”. S-Gens identifies Christopher Nolan and Inception as anchor entities, and extracts the path

Christopher Nolan \overset{directed}{\to} Inception \overset{starring}{\to} Leonardo DiCaprio .

(1)

Conditioned on this path and a contextual snippet, the generator produces the synthetic query “Who led the cast of Nolan’s dream-heist blockbuster?”. A valid positive passage states that Inception is a 2010 science-fiction film written and directed by Christopher Nolan and starring Leonardo DiCaprio. In contrast, a semantic-decoy negative may describe Memento as a film directed by Christopher Nolan and starring Guy Pearce. The decoy remains semantically close because it shares the director entity and the film domain, but it breaks the target reasoning chain at the actor relation.

This example also illustrates the role of the different filtering signals used in S-Gens. For the accepted positive, the path coverage score reaches

C_{path} = 1.00

, semantic similarity remains high (

sim = 0.81

), and the graph consistency score is

Q = 0.84

. For the semantic-decoy negative, semantic similarity is still high (

sim = 0.79

), but the structural contribution and consistency scores are much lower (

S_{struct} = 0.18

,

Q = 0.29

), so it is retained as a hard negative rather than accepted as a positive. In this way, the retriever is trained to separate structurally faithful evidence from semantically attractive but logically mismatched passages.

3.3. Path-Guided Positive Construction

The first step of S-Gens is to construct positive training instances that explicitly encode reasoning structure. Unlike conventional positive sampling strategies, which typically rely on relevance labels or shallow semantic similarity, our approach aims to ensure that each synthetic query–document pair is supported by an identifiable multi-hop reasoning path in the knowledge graph.

3.3.1. Extraction of the Reasoning Backbone

Given an original query, seed passage, or task-specific entity anchor, we first identify a set of core entities denoted by

E

. For any entity pair

(e_{s}, e_{t}) \subseteq E

, we perform a depth-first search with a bounded path length on the external knowledge graph

G

to discover candidate reasoning paths connecting the source entity

e_{s}

and the target entity

e_{t}

.

Formally, a reasoning path of length L is defined as

P = (e_{s} \overset{r_{1}}{\to} e_{1} \overset{r_{2}}{\to} e_{2} \dots \overset{r_{L}}{\to} e_{t}),

(2)

where

e_{1}, e_{2}, \dots, e_{L - 1}

denote intermediate bridging entities and

r_{1}, r_{2}, \dots, r_{L}

denote predicate relations between adjacent entities.

To balance structural complexity and practical coverage, we focus on paths with

L \in {2, 3, 4}

, which are representative of common reasoning patterns in multi-hop retrieval benchmarks [21].

These paths serve as explicit reasoning backbones for subsequent query and positive sample generation.

3.3.2. Path-Guided Query Generation

After obtaining a reasoning path P, we use it as a structured logical scaffold rather than converting it into a rigid template-based question. Specifically, we feed the path into a query generation model

G_{q}

together with a contextual snippet c, which is typically sampled from an encyclopedic summary or supporting passage containing the entities along the path. The synthetic query

\tilde{q}

is generated as

\tilde{q} = G_{q} (P, c, {Prompt}_{inst}),

(3)

where

{Prompt}_{inst}

denotes an instruction prompt that encourages linguistic diversity, such as implicit references, temporal constraints, compositional descriptions, or spatial reasoning patterns.

This design allows the generated query to remain natural in surface form while preserving the latent reasoning logic induced by the path. For example, given a path that connects a film director, a movie, and an actor through successive relations, the generator can produce a compositional question whose answer requires recovering the hidden bridge entities rather than matching a single explicit fact [10].

3.3.3. Verification of Path Coverage

To ensure that a candidate positive document is at least broadly aligned with the entities involved in the target reasoning path, we introduce a path coverage score, denoted by

C_{path}

. For a candidate document d and a reasoning path P, the path coverage score is defined as

C_{path} (d, P) = \frac{1}{| V (P) |} \sum_{v \in V (P)} ⊮ [v \in d],

(4)

where

V (P)

denotes the set of all entities appearing on the path, including the source entity, the target entity, and all intermediate bridging entities, and

⊮ [\cdot]

is an indicator function.

A document is accepted as a candidate structurally consistent positive sample only if it satisfies two conditions. First, it must achieve a sufficiently high path coverage score, i.e.,

C_{path} (d, P) \geq τ_{p},

(5)

where

τ_{p}

is a predefined threshold. In our experiments, we set

τ_{p} = 0.75

. Second, the semantic similarity between the document and the synthetic query must exceed a predefined relevance threshold under the base dense encoder.

Through this dual constraint, S-Gens uses node coverage as a preliminary structural prefilter rather than as a complete proof of relational support. This step removes clearly unsuitable candidates before the later graph-based consistency filtering stage provides a stronger structural verification signal. As a result, the synthesized positive pairs are better aligned with the supervision requirements of reasoning-intensive dense retrieval. Algorithm 1 summarizes the path-based positive synthesis procedure.

Algorithm 1 Path-based positive synthesis

Require: Original training set $D_{orig}$ , knowledge graph $G$ , query generator $G_{q}$ , path coverage threshold $τ_{p}$
Ensure: Synthetic query–positive pairs $D_{syn}^{+}$

1:: Initialize $D_{syn}^{+} \leftarrow Ø$
2:: for all anchor instance $x \in D_{orig}$ do
3:: Identify core entities from x
4:: Extract candidate reasoning paths from $G$
5:: for all path P do
6:: Generate synthetic query $\tilde{q} \leftarrow G_{q} (P, c, {Prompt}_{inst})$
7:: Retrieve candidate documents
8:: Select documents satisfying $C_{path} (d, P) \geq τ_{p}$
9:: Add $(\tilde{q}, d^{+}, P)$ to $D_{syn}^{+}$
10:: end for
11:: end for
12:: return $D_{syn}^{+}$

3.4. Structural Hard Negative Construction

While path-based positive synthesis provides structurally grounded supervision for relevant evidence, effective retriever training also requires challenging negative instances that can sharpen the decision boundary. In reasoning-intensive retrieval, however, conventional hard negatives are often insufficient. Negatives mined by BM25, nearest neighbor retrieval, or in-batch similarity are typically selected based on lexical or semantic proximity, but they do not necessarily violate the reasoning structure required by the query. As a consequence, such negatives may improve topical discrimination while providing limited supervision for distinguishing logically valid evidence from structurally misleading candidates.

To address this limitation, S-Gens constructs structural hard negatives, also referred to as semantic-decoy negatives. These negatives are designed to remain semantically plausible with respect to the synthetic query, while breaking the underlying reasoning chain encoded by the target path. In this way, the retriever is explicitly trained to separate true supporting evidence from documents that appear relevant on the surface but are structurally inconsistent.

3.4.1. Candidate Negative Pool Construction

Given a synthetic query

\tilde{q}

and its associated reasoning path

P = (e_{s} \overset{r_{1}}{\to} e_{1} \overset{r_{2}}{\to} e_{2} \dots \overset{r_{L}}{\to} e_{t}),

(6)

we first construct a candidate pool of negative documents from two complementary sources. The first source consists of semantically similar documents retrieved by a base retriever, denoted by

D_{sem} (\tilde{q})

, which contains top-ranked passages that are highly related to the query in embedding space. The second source consists of entity-overlapping documents, denoted by

D_{ent} (P)

, which mention one or more entities appearing in the path but do not necessarily preserve the full relational structure.

The final candidate pool is defined as

D_{cand} = D_{sem} (\tilde{q}) \cup D_{ent} (P) .

(7)

This design increases the probability of sampling difficult distractors that are topically close to the query or share key entities with the target reasoning path.

3.4.2. Structural Inconsistency Modeling

Not all semantically related documents are informative negatives. To identify truly challenging distractors, we measure whether a candidate document preserves or breaks the reasoning structure of the path. For a candidate document

d^{-} \in D_{cand}

, we define its structural contribution score with respect to the path P as

S_{struct} (d^{-}, P) = \frac{1}{| E (P) |} \sum_{(u, r, v) \in E (P)} ⊮ [(u, r, v) is supported in d^{-}],

(8)

where

E (P)

denotes the set of relational edges in the reasoning path, and

⊮ [\cdot]

is an indicator function that evaluates whether the corresponding relational fact is supported by the candidate document.

A document is considered structurally inconsistent if its structural contribution score is below a threshold

τ_{n}

, i.e.,

S_{struct} (d^{-}, P) < τ_{n} .

(9)

In practice, this means that the document may mention some entities or local facts related to the query, yet fails to preserve the complete relational dependencies required by the target reasoning chain.

3.4.3. Semantic-Decoy Selection

To ensure that the selected negatives remain challenging, we further require semantic-decoy negatives to be sufficiently close to the query in semantic space. Let

sim (\tilde{q}, d^{-})

denote the similarity score between the synthetic query and the candidate document under the base dense encoder. A candidate is retained as a semantic-decoy negative only if it satisfies

sim (\tilde{q}, d^{-}) \geq τ_{s} and S_{struct} (d^{-}, P) < τ_{n},

(10)

where

τ_{s}

is a semantic similarity threshold.

This dual criterion enforces that the negative sample is simultaneously semantically attractive and structurally invalid. As a result, the retriever cannot solve the training objective through shallow keyword overlap or coarse semantic matching alone; instead, it must learn to assign higher scores to documents that better preserve the latent reasoning path.

3.4.4. Diversity-Aware Negative Sampling

To avoid overconcentration on a narrow set of distractors, we sample hard negatives from multiple structural conflict patterns. Specifically, we consider three common types of structural inconsistency:

1.: Entity-substitution conflict, where one or more bridge entities in the target path are replaced by semantically related but incorrect entities.
2.: Relation-break conflict, where the document mentions the relevant entities but omits or contradicts the key relation required by the reasoning chain.
3.: Partial-path conflict, where the document supports only a local fragment of the path while failing to provide sufficient evidence for the complete reasoning process.

By combining negatives from these different conflict types, S-Gens exposes the retriever to a richer set of reasoning-oriented distractors. This improves the robustness of the learned representation space and reduces the tendency to overfit a single negative mining pattern.

3.4.5. Training Role of Structural Hard Negatives

The resulting hard negatives complement the path-consistent positives introduced in Section 3.3. Together, they define a more informative supervision signal for reasoning-intensive dense retrieval: positive samples preserve the latent reasoning structure, whereas semantic-decoy negatives mimic semantic relevance while violating that structure. This positive–negative contrast is expected to produce a cleaner separation margin in representation space, especially for queries whose correct retrieval depends on multi-hop relational composition rather than direct semantic matching.

3.5. Siamese-GNN-Based Consistency Filtering

Although the path-based positive synthesis and structural hard negative construction described above provide richer supervision for reasoning-intensive retrieval, the quality of automatically generated instances may still vary due to noise in entity linking, path extraction, and language generation. In particular, some generated positives may only partially preserve the intended reasoning chain, while some candidate negatives may be either overly trivial or accidentally aligned with the target structure. To improve the reliability of synthetic supervision, S-Gens introduces a Siamese graph neural network (GNN)-based consistency filtering module. In the remainder of this section, consistency filtering is the primary term for this offline verification stage, while quality control is used only as a descriptive synonym when helpful.

The goal of this module is not to replace the downstream retriever, but to provide an additional structural verification signal during offline data construction. Given a synthetic query–document pair and its associated reasoning path, the filtering module evaluates whether the relational evidence contained in the document is sufficiently consistent with the target path. Only those instances that satisfy the required structural criteria are retained for final training.

3.5.1. Graph Views of Reasoning Paths and Candidate Documents

For each synthetic instance, we construct two graph views. The first graph, denoted by

G_{P}

, is derived directly from the target reasoning path

P = (e_{s} \overset{r_{1}}{\to} e_{1} \overset{r_{2}}{\to} e_{2} \dots \overset{r_{L}}{\to} e_{t}),

(11)

where nodes correspond to entities and edges correspond to relations along the path.

The second graph, denoted by

G_{d}

, is extracted from the candidate document d by grounding recognized entities and relation cues to the same external knowledge graph. Formally, we represent

G_{P} = (V_{P}, E_{P}), G_{d} = (V_{d}, E_{d}),

(12)

where

V_{P}

and

V_{d}

denote node sets, and

E_{P}

and

E_{d}

denote edge sets for the path graph and the document graph, respectively.

In this formulation,

G_{P}

serves as the structural reference graph, while

G_{d}

captures the relational evidence expressed in the candidate document. A high-quality positive sample is expected to preserve the major dependencies in

G_{P}

, whereas a valid structural hard negative should remain semantically plausible but exhibit clear graph-level inconsistency.

3.5.2. Shared Graph Encoding

To compare these two graph views, we adopt a Siamese GNN encoder with shared parameters. Let

f_{θ} (\cdot)

denote the graph encoder. The graph representations of the path graph and the document graph are computed as

h_{P} = f_{θ} (G_{P}), h_{d} = f_{θ} (G_{d}),

(13)

where

h_{P}, h_{d} \in R^{m}

are graph-level embeddings.

The use of shared parameters ensures that both graphs are projected into the same structural representation space, making their similarity directly comparable. Intuitively, if the candidate document faithfully reflects the target reasoning structure, the two graph embeddings should be close; otherwise, the distance between them should increase.

3.5.3. Consistency Scoring

Based on the encoded graph representations, we define a structural consistency score for a synthetic instance as

Q (d, P) = cos (h_{P}, h_{d}),

(14)

where

cos (\cdot, \cdot)

denotes cosine similarity.

This score measures the extent to which the graph-derived evidence in the candidate document aligns with the target reasoning path. Compared with plain semantic similarity, the consistency score is more sensitive to whether the candidate preserves the key entity transitions and relational dependencies required by the intended reasoning process.

For positive candidates, a larger value of

Q (d, P)

indicates stronger structural alignment with the target path. For candidate hard negatives, a lower score suggests that the document, despite possible semantic overlap with the query, fails to support the reasoning chain in a structurally faithful manner.

3.5.4. Filtering Strategy

The consistency score is used as an offline filtering signal during synthetic data construction. For a candidate positive document

d^{+}

, we retain the instance only if

Q (d^{+}, P) \geq τ_{q}^{+},

(15)

where

τ_{q}^{+}

is the acceptance threshold for structurally reliable positives.

For a candidate structural hard negative

d^{-}

, we retain the instance only if

Q (d^{-}, P) \leq τ_{q}^{-},

(16)

where

τ_{q}^{-}

is the upper threshold for structurally inconsistent negatives.

This asymmetric filtering rule removes two major sources of noise. First, it excludes weak positives whose supporting evidence does not sufficiently cover the target reasoning structure. Second, it filters out uninformative negatives that are either too unrelated to the query or accidentally preserve too much of the original path structure. In this way, the final synthetic triplets become more reliable and more discriminative for retriever training.

3.5.5. Effect on Synthetic Supervision

The Siamese-GNN-based consistency filtering module serves as the final filtering step before training data integration. Combined with the path-consistent positives in Section 3.3 and the semantic-decoy negatives in Section 3.4, it helps ensure that S-Gens produces synthetic supervision that is both semantically informative and structurally coherent.

Importantly, this module is used only during offline data construction and does not participate in online retrieval. Therefore, it improves the quality of the training signal without introducing additional inference-time cost to the downstream dense retriever. Algorithm 2 summarizes the construction of structural hard negatives and the subsequent consistency filtering process.

Algorithm 2 Hard negative construction and consistency filtering

Require: Synthetic query–positive pairs $D_{syn}^{+}$ , base retriever R, consistency encoder $f_{θ}$ , thresholds $τ_{n}$ , $τ_{s}$ , $τ_{q}^{+}$ , $τ_{q}^{-}$
Ensure: Filtered synthetic triplets $D_{syn}$

1:: Initialize $D_{syn} \leftarrow Ø$
2:: for all $(\tilde{q}, d^{+}, P) \in D_{syn}^{+}$ do
3:: Construct candidate negative pool $D_{cand}^{-}$
4:: Retain candidates satisfying structural inconsistency and semantic similarity constraints
5:: Compute consistency scores for $d^{+}$ and retained negatives using $f_{θ}$
6:: Filter positives with $Q (d^{+}, P) \geq τ_{q}^{+}$
7:: Filter negatives with $Q (d^{-}, P) \leq τ_{q}^{-}$
8:: Add valid triplets $(\tilde{q}, d^{+}, d^{-})$ to $D_{syn}$
9:: end for
10:: return $D_{syn}$

3.6. Training Integration and Optimization

After path-based positive synthesis, structural hard negative construction, and Siamese-GNN-based consistency filtering, S-Gens produces a set of high-quality synthetic training triplets of the form

(\tilde{q}, d^{+}, d^{-})

. The final step is to integrate these synthetic instances into retriever training in a way that improves reasoning-oriented supervision while preserving the robustness of the original training distribution.

3.6.1. Synthetic–Original Data Mixture

Let

D_{orig}

denote the original training set and

D_{syn}

denote the filtered synthetic triplet set produced by S-Gens. Instead of replacing the original supervision, we augment it by constructing a mixed training set

D_{train} \sim Mix (D_{orig}, D_{syn}; η),

(17)

where

η \in [0, 1]

controls the proportion of synthetic instances.

This design reflects the complementary roles of the two supervision sources. The original data provides stable task-aligned retrieval signals derived from human annotation or benchmark supervision, while the synthetic data introduces additional structural contrast specifically targeted at reasoning-intensive retrieval. By adjusting

η

, S-Gens can balance distributional stability and reasoning-oriented augmentation.

3.6.2. Retriever-Agnostic Integration

A key property of S-Gens is that it operates purely at the data level. Once the filtered synthetic triplets are generated, they can be directly used to train a wide range of dense retrievers without modifying the underlying model architecture. For a dual-encoder retriever, the query encoder

f_{q} (\cdot)

and the document encoder

f_{d} (\cdot)

map a query q and a document d into a shared embedding space:

z_{q} = f_{q} (q), z_{d} = f_{d} (d) .

(18)

The relevance score between a query and a document is then computed by inner product or cosine similarity:

s (q, d) = z_{q}^{⊤} z_{d} .

(19)

Because S-Gens changes only the composition of the training triplets rather than the scorer or encoder structure, it can be seamlessly integrated with classical dual-encoder retrievers, distillation-based retrieval models, and recent embedding-based retrievers.

3.6.3. Contrastive Training Objective

Given a query q, its positive document

d^{+}

, and a set of negative documents

N (q)

, the retriever is trained with a contrastive objective:

L_{ret} = - log \frac{exp (s (q, d^{+}) / γ)}{exp (s (q, d^{+}) / γ) + \sum_{d^{-} \in N (q)} exp (s (q, d^{-}) / γ)},

(20)

where

γ

is a temperature parameter.

When S-Gens is used, the negative set

N (q)

may include both conventional hard negatives and the structural hard negatives introduced in Section 3.4. Compared with standard training, this leads to a more informative contrastive signal, as the model is required not only to separate relevant from irrelevant documents, but also to distinguish structurally valid evidence from semantically attractive distractors.

3.6.4. Optional Weighted Supervision with Synthetic Reliability

To account for varying confidence in synthetic instances, we optionally assign each synthetic triplet a quality-aware weight derived from the consistency filtering score. This variant is treated as an auxiliary training option rather than the default setting for all reported results. Let

w_{i}

denote the importance weight of the i-th training instance. The weighted retrieval loss is then written as

L_{train} = \sum_{i = 1}^{N} w_{i} L_{ret}^{(i)},

(21)

where N is the number of training instances in the mini-batch.

For original training instances, we set

w_{i} = 1

. For synthetic instances, the weight can be defined as a monotonic function of the consistency score produced in Section 3.5, so that more structurally reliable samples contribute more strongly to parameter updates. In practice, this strategy further stabilizes training when the synthetic set is large or contains samples of varying difficulty.

3.6.5. Optimization Effect

The integration strategy of S-Gens improves retriever learning in two complementary ways. First, the path-consistent positives encourage the model to associate queries with documents that better preserve latent reasoning structure. Second, the semantic-decoy negatives sharpen the representation boundary by forcing the model to reject superficially relevant but structurally invalid evidence. As a result, the retriever learns a representation space that is more robust for multi-hop and reasoning-intensive retrieval scenarios.

As all synthetic generation and filtering procedures are performed offline, the final retriever preserves the same inference-time architecture and computational cost as its original backbone. Therefore, S-Gens offers a practical way to enhance reasoning-oriented retrieval performance without introducing additional online complexity. Algorithm 3 summarizes how the filtered synthetic triplets are integrated into retriever training.

Algorithm 3 Training integration with synthetic triplets

Require: Original training set $D_{orig}$ , synthetic triplets $D_{syn}$ , retriever R, synthetic ratio $η$
Ensure: Trained retriever R

1:: Construct mixed training set $D_{train} \leftarrow (1 - η) D_{orig} \cup η D_{syn}$
2:: for all mini-batch $B \subset D_{train}$ do
3:: Encode queries and documents with R
4:: Compute contrastive retrieval loss
5:: Update retriever parameters
6:: end for
7:: returnR

3.7. Complexity Analysis and Discussion

In this subsection, we analyze the computational characteristics of S-Gens and discuss its practical deployment properties. As S-Gens is designed as an offline data augmentation framework, its additional cost is incurred primarily during synthetic data construction rather than during online retrieval. As a result, the final retriever maintains the same inference architecture and serving complexity as the original backbone model.

3.7.1. Offline Construction Cost

The offline cost of S-Gens mainly comes from three stages: reasoning path extraction, synthetic instance generation, and consistency filtering.

For path extraction, let

| E |

denote the number of anchor entities involved in the data construction process, and let b and L denote the average branching factor and the maximum path length in the knowledge graph, respectively. Under bounded depth-first search, the worst-case complexity of path discovery is approximately

O (| E | \cdot b^{L}) .

(22)

In practice, however, this cost is substantially reduced by restricting the search depth, pruning low-frequency relations, and retaining only a small number of high-quality candidate paths for each anchor pair.

For synthetic query and document construction, let

N_{p}

denote the number of retained reasoning paths and let

C_{gen}

denote the average generation cost per path under the language model. The total generation cost can be written as

O (N_{p} \cdot C_{gen}) .

(23)

Because this stage is performed offline and can be parallelized across paths, it does not affect retrieval-time latency.

For consistency filtering, let

N_{s}

denote the number of synthetic instances and let

C_{gnn}

denote the average cost of encoding one graph pair using the Siamese GNN. The overall filtering cost is

O (N_{s} \cdot C_{gnn}) .

(24)

Similar to generation, this stage is also fully offline and can be batched efficiently.

To complement the asymptotic analysis above, we further report a compact practical cost summary under the default matched-budget setting across the five benchmarks. In this setting, we use

L \in {2, 3, 4}

,

η = 30 %

, up to two anchor entities per instance, at most five candidate paths per anchor pair, Qwen2.5-32B-Instruct for synthetic query generation, and a two-layer R-GCN consistency filter. On average, the pipeline retains about

2.7

reasoning paths per instance and keeps approximately 180,000 filtered synthetic triplets in total for training across the five benchmarks. The offline query-generation stage takes about

19.6

h, consistency filtering takes about

4.3

h, and the total offline preprocessing time is about

27.4

h on a machine with four RTX 3090 GPUs and one 20-core CPU. The remaining preprocessing time is mainly spent on entity linking, path extraction, and pruning. These figures should be understood as representative orders of magnitude rather than fixed constants, as the exact runtime varies with the benchmark and backbone. Importantly, all of these additional costs are incurred only during offline supervision construction, while the trained retriever preserves the same inference-time architecture and online complexity as the original backbone.

3.7.2. Training-Time Overhead

During retriever training, the main additional cost comes from the increased number of training triplets after synthetic augmentation. Let

η

denote the synthetic data ratio introduced in Section 3.6. The training cost scales approximately linearly with the total number of instances in the mixed dataset. Therefore, compared with the original training procedure, the additional optimization cost is mainly determined by the value of

η

and the number of structural hard negatives used per query.

Importantly, S-Gens does not require any modification to the retriever architecture, scoring function, or training objective beyond the use of augmented triplets. This means that existing training pipelines for dual-encoder or distillation-based retrievers can incorporate S-Gens with minimal engineering overhead.

3.7.3. Inference-Time Efficiency

A central design goal of S-Gens is to improve reasoning-intensive retrieval without introducing extra online complexity. After training is completed, the downstream retriever operates in exactly the same way as the original model. Given a query, the system still performs standard dense encoding and vector similarity search over the document corpus. No knowledge graph traversal, graph neural inference, or synthetic generation is required at serving time.

Therefore, if the original retriever has inference complexity

O (C_{enc} + C_{index}),

(25)

where

C_{enc}

denotes the query encoding cost and

C_{index}

denotes the vector index search cost, then the S-Gens-enhanced retriever preserves the same order of online complexity:

O (C_{enc} + C_{index}) .

(26)

This property distinguishes S-Gens from graph-enhanced retrieval pipelines that rely on online graph expansion, iterative reasoning, or additional reranking modules during inference.

3.7.4. Scalability and Practical Considerations

In practice, the scalability of S-Gens depends on several controllable factors, including the maximum path length, the number of candidate paths retained per anchor pair, the synthetic data ratio

η

, and the filtering thresholds used in Section 3.3, Section 3.4 and Section 3.5. These design choices allow the framework to be adjusted according to available computational resources and target application requirements.

A moderate synthetic ratio is typically sufficient to provide meaningful structural supervision while avoiding excessive expansion of the training set. Likewise, bounded path length and threshold-based filtering help control noise accumulation and keep the offline construction pipeline computationally manageable. As all additional costs are incurred before deployment, S-Gens is particularly suitable for applications where training-time augmentation is acceptable but inference-time efficiency must be preserved.

3.7.5. Discussion

The complexity profile of S-Gens reflects a deliberate trade-off: it shifts part of the burden of reasoning enhancement from online retrieval to offline supervision construction. This trade-off is attractive in many dense retrieval scenarios, especially when the retriever is expected to serve at scale and low latency. By investing additional computation during data construction, S-Gens improves the structural quality of supervision and enables the downstream retriever to internalize reasoning-oriented distinctions within its embedding space.

Overall, S-Gens provides a practical compromise between reasoning-aware retrieval quality and deployment efficiency. It enriches training supervision through structure-aware synthetic data, while preserving the simplicity and speed of standard dense retrieval at inference time.

4. Experiments

In this section, we conduct comprehensive experiments to evaluate the effectiveness of S-Gens in enhancing dense retrieval for reasoning-intensive tasks. Our evaluation is designed to answer the following research questions.

RQ1 (Overall Performance): Can the structure-aware synthetic data generated by S-Gens consistently improve retriever performance across different model architectures?
RQ2 (Reasoning Enhancement): Compared with conventional semantic supervision, to what extent does S-Gens improve multi-hop reasoning and evidence-chain retrieval?
RQ3 (Robustness and Ablation): How do structure-aware hard negatives, especially semantic decoys, contribute to decision boundary shaping and robustness against misleading evidence?

4.1. Experimental Setup

4.1.1. Evaluation Datasets

To comprehensively evaluate retriever performance under different levels of reasoning difficulty, we consider five widely used public benchmark datasets and group them into three categories.

General Retrieval

MS MARCO Passage Ranking [22] is one of the most established large-scale benchmarks for passage retrieval, containing approximately 8.8 million candidate passages. Following the standard evaluation protocol, we use the development set containing 6980 queries for evaluation.

Complex Reasoning Question Answering

HotpotQA [7] is a representative multi-hop question answering benchmark that requires retrieving and combining multiple supporting documents to answer a query correctly. WebQuestionsSP (WebQSP) [21] is a classical knowledge-intensive question answering dataset in which many questions exhibit limited direct lexical overlap with supporting evidence and therefore require more implicit structural reasoning.

General Factoid Question Answering

Natural Questions (NQ) [23] is a large-scale open-domain question answering dataset derived from real user search queries. TriviaQA [24] is a factoid question answering benchmark with relatively low lexical overlap between questions and answers, making it suitable for evaluating semantic generalization in retrieval.

4.1.2. Evaluation Protocol

For evaluation metrics, we adopt the official MRR@10 metric for MS MARCO. For the other four open-domain question answering datasets, we use Recall@20 (R@20) as the primary metric, as our focus is on the retrieval stage and the coverage of relevant evidence. In our setup, R@20 follows the benchmark-specific relevance annotations provided by the corresponding retrieval corpora: for NQ and TriviaQA, a query is counted as covered when the top-20 retrieved results contain at least one answer-bearing passage; for WebQSP and HotpotQA, a query is counted as covered when the retrieved set contains annotated supporting evidence sufficient for the benchmark answer under the retrieval setting. We do not treat superficial lexical overlap alone as successful retrieval.

Formally, for a query q and top-k retrieved results

Top k (q)

, let

y_{k} (q) \in {0, 1}

denote the benchmark-specific coverage indicator under the criterion described above. Recall@k is then computed as

R @ k = \frac{1}{| Q |} \sum_{q \in Q} y_{k} (q) .

(27)

In this work, we report

k = 20

for the QA-style datasets. This formulation allows the relevance condition to remain dataset-specific: for single-evidence settings such as NQ and TriviaQA,

y_{k} (q) = 1

when at least one answer-bearing passage is retrieved, whereas for WebQSP and HotpotQA it follows the benchmark-specific supporting-evidence coverage criterion used in our retrieval setup.

In addition, the decoy-rejection metric reported later is defined with respect to the semantic-decoy pool constructed for each synthetic query. For a query q with decoy set

D_{decoy} (q)

, we define DR@10 as

DR @ 10 = \frac{1}{| Q |} \sum_{q \in Q} ⊮ [Top 10 (q) \cap D_{decoy} (q) = Ø] .

(28)

This metric therefore measures whether semantically attractive but structurally misleading decoys are successfully excluded from the top-ranked retrieval results.

Unless otherwise stated, all reported means and standard deviations are computed over three runs with different random seeds.

Statistical significance was assessed using a two-sided paired t-test over three runs with matched random seeds, and results were considered statistically significant when

p < 0.01

. Because the ablation experiments were conducted with only three seeds, we interpret these significance results as supportive rather than exhaustive statistical evidence.

Because the present study focuses on retriever training rather than downstream reader optimization, these metrics should be interpreted as retrieval-stage measures of evidence coverage and decoy rejection. They do not by themselves constitute a full end-to-end QA evaluation, but they provide a direct estimate of whether S-Gens improves the retrieval signal delivered to downstream reasoning modules.

4.1.3. Baseline Models

To demonstrate the model-agnostic property of S-Gens and its effectiveness across diverse retrieval architectures, we compare against eight representative baselines covering four technical paradigms.

Sparse Retrieval Baseline

BM25 [25] is included as a classical term-matching baseline and serves as a non-trainable lower-bound reference.

Classical Dual-Encoder Retrievers and Hard-Negative Mining Models

DPR [5] represents the standard dual-encoder retrieval framework trained with in-batch negatives. ANCE [6] extends this framework by introducing global approximate nearest neighbor negative mining. RocketQA [8] further improves dual-encoder training through cross-batch negatives and denoising strategies.

Knowledge Distillation-Based Retriever

Margin-MSE [15] is included as a representative distillation-based retrieval model that transfers ranking knowledge by minimizing margin differences between teacher and student scores.

Recent LLM-Based Embedding Models

To verify that S-Gens remains effective in the era of large embedding models, we additionally evaluate three recent strong embedding backbones. BGE-M3 [26] is a versatile embedding model designed for multilingual, multi-granularity, and multi-function retrieval. E5-Mistral-7B-Instruct [27] is an instruction-tuned embedding model built on a large language model backbone. NV-Embed-v2 [28] is a recent high-performing general-purpose embedding model with strong representation capability across retrieval benchmarks.

For all trainable baselines except BM25, we fine-tune or continue training the retriever on the mixed corpus formed by the original training data and the synthetic data generated by S-Gens, in order to measure the relative performance gains brought by structure-aware augmentation.

4.1.4. Implementation Details

For classical dual-encoder retrievers such as DPR and ANCE, both query and document encoders are initialized with bert-base-uncased. For LLM-based embedding models such as E5-Mistral, we freeze most backbone parameters and apply parameter-efficient fine-tuning using LoRA to keep the training cost manageable. All experiments are conducted on a multi-GPU cluster. For large embedding models, we additionally use memory-saving strategies such as gradient checkpointing and parameter-efficient adaptation.

For S-Gens, the maximum reasoning path length is restricted to

L \in {2, 3, 4}

during positive synthesis, and the synthetic data ratio is set to

30 %

unless otherwise specified. We use AdamW as the optimizer. The learning rate is set to

2 \times 10^{- 5}

for classical dual-encoder retrievers and

1 \times 10^{- 4}

for LLM-based embedding models, with a linear warmup ratio of

10 %

. Unless otherwise specified, the main results use the standard contrastive objective in Section 3.6 with temperature parameter

γ = 0.05

. The quality-aware weighting strategy is treated as an optional training variant rather than a default component of all reported experiments.

The external KG is dataset-dependent: we use Freebase for WebQSP and Wikidata for HotpotQA, NQ, TriviaQA, and MS MARCO. Entity linking is performed with BLINK-base, using a confidence threshold of

0.85

by default and

0.80

for WebQSP. For each training instance, we retain up to two anchor entities; if a topic entity is explicitly available in WebQSP, it is always preserved as an anchor. Candidate reasoning paths are extracted with bounded-depth DFS, keeping at most five paths per anchor pair after pruning highly generic relations and hub-dominated paths according to relation specificity and corpus-support heuristics.

Synthetic query generation is performed with Qwen2.5-32B-Instruct. We use a matched-budget setup across all synthetic augmentation variants: the generator, instruction family, decoding settings, synthetic sample count, synthetic ratio, and retriever training budget are fixed, while only the presence or absence of path conditioning and structural negatives is changed. Specifically, path-guided variants use a prompt containing task instruction, path triples, and a contextual snippet; the semantic-only control removes the path triples while keeping the remaining prompt template unchanged. Decoding uses temperature

0.7

, top-p

0.9

, maximum generation length 64, and two returned candidates per prompt. The contextual snippet is selected from the original positive passage or an entity-linked Wikipedia summary and truncated to 120 tokens.

For relation support detection, we combine entity alias matching, same-sentence co-occurrence, and relation verbalizer matching against a small synonym lexicon. The consistency filter uses a two-layer R-GCN with hidden size 128, relation embedding size 64, and mean pooling for graph-level aggregation. It is trained with binary pair classification plus a margin-ranking objective over matched and mismatched path–document graph pairs.

Throughout the method, the threshold symbols play distinct roles:

τ_{p}

denotes the node-coverage prefilter threshold for candidate positives,

τ_{n}

denotes the structural inconsistency threshold for candidate negatives,

τ_{s}

denotes the minimum semantic similarity required for semantic-decoy selection, and

τ_{q}^{+}

and

τ_{q}^{-}

denote the acceptance and rejection thresholds used by the Siamese-GNN-based consistency filter. In our experiments, these thresholds are set to

τ_{p} = 0.75

,

τ_{n} = 0.35

,

τ_{s} = 0.80

,

τ_{q}^{+} = 0.72

, and

τ_{q}^{-} = 0.48

. The symbol

γ

is reserved for the temperature parameter in the retriever contrastive objective.

4.1.5. Reproducibility and Data Leakage Prevention

To avoid leakage between supervision construction and evaluation, all synthetic generation procedures are restricted to the training split of each benchmark. In particular, anchor extraction, reasoning path discovery, synthetic query generation, semantic-decoy construction, and consistency filtering are performed only for training instances. Development and test queries are never used as seeds for synthetic augmentation, and no gold evidence from evaluation splits is used to construct additional training triplets. The retrieval benchmarks are then evaluated on their standard held-out splits under the protocol described in Section 4.1.2.

4.2. Main Results and Analysis

To answer RQ1(overall performance improvement) and RQ2 (reasoning enhancement), we evaluate all baseline retrievers with and without S-Gens on five public benchmark datasets. Table 2 and Table 3 report the MRR@10 results on MS MARCO and the Recall@20 (R@20) results on NQ, TriviaQA, WebQSP, and HotpotQA. The values in parentheses indicate the absolute improvement obtained after joint training with the structure-aware synthetic data generated by S-Gens. Unless explicitly stated otherwise, the discussion in this subsection should be interpreted as evidence about retrieval-stage supervision quality and evidence coverage rather than as a direct end-to-end QA comparison.

4.2.1. Universal Improvement and Model-Agnosticity

As shown in Table 2 and Table 3, S-Gens brings consistent performance gains to all trainable baselines. The improvements are observed not only for classical dual-encoder retrievers such as DPR and ANCE, but also for stronger baselines with more advanced sampling or distillation strategies, including RocketQA and Margin-MSE. Similar gains are further observed on recent large embedding models such as BGE-M3, E5-Mistral-7B-Instruct, and NV-Embed-v2, indicating that the benefits of S-Gens generalize across both conventional retrievers and modern large-scale embedding models.

These results support the model-agnostic nature of S-Gens. As the framework operates entirely at the data level, it does not depend on modifications to the encoder architecture, retrieval scorer, or inference pipeline. Instead, it improves retriever learning by introducing higher-quality supervision that better reflects the structural requirements of reasoning-intensive retrieval.

4.2.2. Larger Gains on Reasoning-Intensive Benchmarks

A clear pattern in Table 2 and Table 3 is that the gains brought by S-Gens are more pronounced on reasoning-intensive datasets than on general retrieval benchmarks. On MS MARCO, NQ, and TriviaQA, the improvements are stable but moderate. For example, ANCE achieves a gain of 1.7 points on NQ and 1.6 points on TriviaQA. These tasks still benefit from stronger supervision, but they are less dependent on explicit multi-hop evidence composition.

In contrast, much larger gains are observed on WebQSP and HotpotQA. DPR improves by 3.6 points on WebQSP and 4.2 points on HotpotQA, while ANCE gains 3.4 and 3.7 points on the same benchmarks, respectively. Margin-MSE also shows clear improvements of 2.6 points on WebQSP and 3.2 points on HotpotQA. This asymmetric improvement pattern is consistent with our motivation: WebQSP relies heavily on implicit relational structure, whereas HotpotQA requires cross-document reasoning and bridge-entity composition. Conventional semantic supervision is often insufficient for these settings because it cannot effectively teach the retriever how to distinguish structurally valid evidence from semantically attractive but logically incomplete candidates.

Figure 2 further illustrates this trend. Across different retriever families, the relative improvements on reasoning-intensive benchmarks are consistently larger than those on general retrieval tasks, indicating that S-Gens is particularly effective when successful retrieval depends on preserving latent evidence chains rather than matching isolated semantic cues.

4.2.3. Complementarity with Strong LLM-Based Embeddings

Another notable observation is that S-Gens remains effective even when applied to strong large-scale embedding models. As shown in Table 2 and Table 3, BGE-M3, E5-Mistral-7B-Instruct, and NV-Embed-v2 already achieve strong retrieval performance due to large-scale pretraining and broad semantic coverage. Nevertheless, S-Gens still yields measurable and consistent gains for all of them.

For example, NV-Embed-v2 achieves an additional 2.2-point improvement on HotpotQA after incorporating S-Gens, reaching 75.1 in R@20. BGE-M3 and E5-Mistral-7B-Instruct also show gains of 2.9 and 2.5 points on the same benchmark, respectively. These results suggest that large parameter scale and generic pretraining alone do not automatically provide sufficiently strong reasoning-oriented discrimination. The structure-aware positives and semantic-decoy negatives introduced by S-Gens provide a complementary form of supervision, pushing even strong embedding models toward more precise structural judgment in complex retrieval scenarios.

4.3. Ablation Study

To further understand the contribution of each component in S-Gens, we conduct a series of ablation experiments on two challenging reasoning-intensive benchmarks, namely, WebQSP and HotpotQA. Unless otherwise specified, all experiments in this section use ANCEas the backbone retriever, which provides a strong and stable baseline for analyzing the effect of different structure-aware augmentation strategies.

4.3.1. Matched-Budget Comparison with Semantic-Only Augmentation

To isolate the contribution of structural supervision from the generic benefit of synthetic data augmentation, we compare S-Gens against matched-budget semantic-only controls. In all variants, we keep the generator, prompt family, decoding parameters, synthetic sample count, synthetic ratio, and retriever training budget fixed. The only differences are whether KG paths are provided to the generator and whether structural hard negatives are explicitly constructed. Table 4 reports the resulting performance.

The matched-budget results show that synthetic data alone cannot explain the full gains of S-Gens. On ANCE, semantic-only augmentation improves WebQSP from 73.1 to 74.0 and HotpotQA from 67.3 to 68.1, indicating that additional generated supervision is useful. However, path-guided positives yield larger gains under the same generation budget, and the full S-Gens framework performs best throughout. This gap is especially clear on HotpotQA DR@10, where structural negatives alone already raise the score from 71.8 to 76.8, showing that their main effect is to sharpen the decision boundary against semantically attractive but structurally invalid evidence.

The same pattern holds on the stronger BGE-M3 backbone. Although the absolute gains are smaller, semantic-only augmentation still underperforms path-guided supervision, and the full framework remains best on both WebQSP and HotpotQA. On ANCE, Full S-Gens is significantly better than both original and semantic-only augmentation under the paired t-test criterion (

p < 0.01

), and it also yields a significant improvement in HotpotQA DR@10 over path-guided positives. On BGE-M3, Full S-Gens is significantly better than original (

p < 0.01

), whereas the gains over semantic-only augmentation do not reach this threshold. These results support our central claim that the benefit of S-Gens comes from structure-aware supervision itself rather than from merely adding more synthetic queries.

4.3.2. Effectiveness of Structural Hard Negatives

One of the key design choices in S-Gens is the introduction of semantic-decoy hard negatives, namely, documents that remain lexically or semantically similar to the query while being structurally inconsistent with the target reasoning path. To verify the contribution of this design, we compare the following four variants:

1.: Raw ANCE, trained only on the original data;
2.: + Path-Positives Only, which adds path-based synthetic positives while keeping the original negative setting;
3.: + Path-Positives & BM25 Negatives, which further introduces conventional BM25-mined hard negatives;
4.: Full S-Gens, which uses both path-consistent positives and semantic-decoy structural hard negatives.

In addition to Recall@20, we further report DR@10 (Decoy Rejection@10), as defined in Section 4.1.2, which measures the proportion of cases in which semantically attractive but misleading decoy passages are successfully excluded from the top-10 retrieved results. In this ablation, the decoy set for each query is instantiated from the retained semantic-decoy candidate pool produced by the construction and filtering procedure in Section 3.4 and Section 3.5.

As shown in Table 5, introducing path-based positives alone already improves WebQSP from 73.1 to 74.8 and HotpotQA from 67.3 to 69.0 in R@20, indicating that structurally grounded positives help alleviate the logic gap in reasoning-oriented retrieval. However, when conventional BM25 negatives are added on top of the synthetic positives, the additional gain remains limited, and DR@10 improves only marginally from 74.2 to 74.5.

In contrast, the complete S-Gens framework achieves the best performance on all metrics, reaching 76.5 on WebQSP, 71.0 on HotpotQA, and 78.6 on HotpotQA DR@10. As further illustrated in Figure 3, the gain from the full framework is especially pronounced on the decoy rejection metric. This result suggests that semantic-decoy negatives are substantially more effective than ordinary lexical hard negatives for shaping a cleaner decision boundary and improving robustness to structurally misleading evidence.

4.3.3. Necessity of the Quality Control Module

Large-scale synthetic data generation inevitably introduces noise, including entity mismatches, factual inconsistencies, and structurally invalid query–document pairs. To evaluate the necessity of the consistency filtering module described in Section 3.5, we remove the Siamese-GNN-based consistency filtering step and directly mix all generated samples into the training set with a synthetic ratio of

30 %

.

To further justify the necessity of the filtering stage, Table 6 compares the proposed Siamese GNN filter against simpler alternatives. The heuristic filter keeps instances based on entity coverage and alias consistency only, while the semantic filter uses semantic similarity thresholds without explicit graph consistency modeling.

Without consistency filtering, the R@20 score on WebQSP drops from 76.5 to 75.2. Training also becomes noticeably less stable across random seeds, with the standard deviation increasing from 0.31 to 0.65. Heuristic and semantic filters recover part of the loss, but they remain clearly weaker than the graph-based consistency filter, especially on HotpotQA DR@10. This pattern is consistent with the intended role of the module: simple coverage or semantic thresholds can remove obvious noise, but they are less effective at rejecting semantically plausible yet structurally misleading decoys. Therefore, the Siamese-GNN-based consistency filtering module is important for suppressing noisy synthetic instances and stabilizing training.

4.3.4. Manual Data Quality Analysis

To directly inspect the quality of the generated supervision, we manually annotate a small sample of synthetic instances drawn from WebQSP and HotpotQA. The sample contains 100 generated positives and 100 generated negatives. For positives, we evaluate whether the passage faithfully preserves the intended reasoning structure. For negatives, we evaluate whether the passage is a valid semantic decoy, namely, semantically plausible yet structurally inconsistent with the target path. Table 7 summarizes the precision before and after filtering.

The manual inspection is consistent with the downstream retrieval results. Even before filtering, most generated instances are usable, but there is still non-trivial noise from entity mismatch, partial-path support, and weak decoys. The heuristic filter removes part of this noise, while the Siamese GNN filter yields the cleanest supervision, improving positive structural faithfulness from 82% to 92% and negative decoy validity from 78% to 89%. These gains help explain why graph-based consistency filtering delivers the strongest final retrieval performance.

4.3.5. Reasoning-Oriented Evidence Coverage

To complement the standard retrieval metrics, we additionally evaluate whether the retriever covers the complete multi-hop evidence chain required by HotpotQA. Specifically, we report both-support Recall@20, which counts a query as successful only when the top-20 retrieved results simultaneously contain both official supporting passages. Table 8 summarizes the results.

This reasoning-oriented metric leads to the same conclusion as the standard retrieval results, but in a more direct form. The gain from S-Gens is not limited to retrieving a single answer-bearing passage; instead, the method more reliably retrieves the complete supporting evidence set required for multi-hop reasoning. In particular, the improvement from 49.6 to 53.9 on ANCE indicates that structure-aware supervision improves evidence-chain coverage beyond what semantic-only augmentation can provide.

4.3.6. Sensitivity to the Synthetic Data Ratio

Synthetic supervision should complement, rather than replace, the original training distribution. If the synthetic ratio is too low, the retriever may not receive sufficient reasoning-oriented supervision; if it is too high, the model may overfit the distributional patterns of generated queries and lose generalization on real user queries. To study this trade-off, we vary the synthetic data ratio

η \in {10 %, 20 %, 30 %, 40 %, 50 %}

and evaluate the resulting performance on WebQSP and HotpotQA.

Table 9 and Figure 4 show a consistent trend. Performance improves steadily as

η

increases from 10% to 30%, suggesting that additional structure-aware supervision helps refine the decision boundary and strengthen evidence-chain retrieval. However, when the synthetic ratio exceeds 30%, the gain begins to saturate and eventually declines slightly, indicating that excessive reliance on generated data may dilute the original task distribution and reduce generalization.

Based on these results, we set

η = 30 %

as the default configuration in the rest of our experiments. This choice provides a favorable balance between introducing sufficient structure-aware supervision and preserving the generalization benefits of the original training distribution.

5. Conclusions and Future Work

In this work, we proposed S-Gens, a structure-aware synthetic data generation framework for enhancing dense retrieval in reasoning-intensive settings. The central motivation is that existing dense retrievers are still largely trained with supervision signals centered on shallow semantic relevance, which are often insufficient for tasks requiring multi-hop reasoning, implicit evidence composition, and relational chain preservation. To address this limitation, S-Gens shifts structural reasoning signals from online inference to offline data construction.

Specifically, the proposed framework introduces three complementary components. First, it uses multi-hop relation paths extracted from an external knowledge graph to synthesize structurally grounded positive samples, thereby reducing the mismatch between training supervision and the reasoning requirements of downstream retrieval tasks. Second, it constructs semantic-decoy hard negatives that remain semantically plausible while being structurally inconsistent with the target reasoning path, enabling the retriever to learn cleaner and more robust decision boundaries. Third, it incorporates a Siamese-GNN-based consistency filtering module to assess the structural reliability of generated instances and suppress low-quality synthetic supervision.

Extensive experiments on five benchmark datasets demonstrate that S-Gens consistently improves a diverse range of trainable retrievers, including classical dual-encoder models, distillation-based retrievers, and recent large embedding models. In particular, the gains are more pronounced on reasoning-intensive benchmarks such as WebQSP and HotpotQA, indicating that structure-aware synthetic supervision is especially effective when successful retrieval depends on latent relational structure rather than direct semantic overlap alone. Additional ablation studies further verify the importance of semantic-decoy negatives, consistency filtering, and an appropriate synthetic data ratio at the retriever level.

Overall, our findings suggest that improving dense retrieval for complex reasoning tasks does not necessarily require modifying the inference-time architecture or introducing expensive online reasoning modules. Instead, carefully designed structure-aware supervision at the data level can already provide substantial and generalizable benefits for retrieval-stage evidence acquisition. In this sense, S-Gens offers a practical and model-agnostic way to bridge the gap between semantic retrieval training and reasoning-oriented retrieval demands, and it should be viewed as a training-time complement to downstream graph-guided retrieval or RAG pipelines rather than as a substitute for them.

Despite these encouraging results, several limitations remain. First, the quality of the generated supervision still depends on the coverage and reliability of the external knowledge graph. In domains where relational structure is sparse, noisy, or incomplete, the effectiveness of path-based synthesis may be constrained. Second, although the framework is inference-efficient, the offline data construction pipeline introduces additional computational cost due to path extraction, synthetic generation, and consistency filtering. Third, the current framework mainly focuses on text retrieval and does not explicitly model multimodal evidence or interactive retrieval scenarios. Fourth, while our results show consistent gains in retrieval-stage evidence coverage and decoy rejection, the full downstream impact on end-to-end question answering or generation quality still requires dedicated reader-side evaluation.

In future work, we plan to extend S-Gens in several directions. One promising direction is to incorporate richer sources of structural knowledge, such as domain-specific ontologies or dynamically induced graphs, in order to improve coverage beyond fixed knowledge graphs. Another direction is to explore adaptive synthetic data scheduling, where the ratio and difficulty of generated instances are adjusted according to retriever training dynamics. It is also worthwhile to investigate whether structure-aware synthetic supervision can benefit related tasks such as reranking, retrieval-augmented generation, and agentic multi-step information seeking, and to evaluate these gains under fixed downstream readers or generators. Finally, we believe that combining structure-aware supervision with stronger generator models and more reliable automatic verification mechanisms may further improve the scalability and generalization of reasoning-oriented retrieval systems.

Author Contributions

Conceptualization, Z.L. and Y.X.; methodology, Y.X.; software, Y.X.; validation, Y.X. and S.C.; formal analysis, Z.L.; investigation, Y.X.; resources, S.C.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, Z.L. and S.C.; visualization, Y.X.; supervision, Z.L. and S.C.; project administration, S.C.; funding acquisition, Z.L. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Ministry of Education industry–university cooperative education project grant number 231101418285337 and in part by Shanghai University under grant number 22H00324.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available from the corresponding author upon reasonable request.

Acknowledgments

During the preparation of this manuscript, the author(s) used AI-assisted technologies strictly for the purposes of language polishing and English grammar correction. All scientific reasoning, experimental design, and data analysis were conducted independently by the authors. The authors have carefully reviewed and validated all outputs and take full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANCE	Approximate Nearest Neighbor Negative Contrastive Learning
BM25	Best Matching 25
DPR	Dense Passage Retrieval
DR@10	Decoy Rejection at 10
FiD	Fusion-in-Decoder
GNN	Graph Neural Network
KG	Knowledge Graph
LLM	Large Language Model
LoRA	Low-Rank Adaptation
MDR	Multi-hop Dense Retrieval
MRR@10	Mean Reciprocal Rank at 10
NQ	Natural Questions
R@20	Recall at 20
RAG	Retrieval-Augmented Generation
S-Gens	Structure-Aware Synthetic Data Generation
WebQSP	WebQuestionsSP

References

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; Association for Computational Linguistics: St. Stroudsburg, PA, USA, 2021; pp. 874–880. [Google Scholar] [CrossRef]
Xiong, W.; Li, X.L.; Iyer, S.; Du, J.; Lewis, P.; Wang, W.Y.; Mehdad, Y.; Yih, S.; Riedel, S.; Kiela, D.; et al. Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; OpenReview.net: Alameda, CA, USA, 2021; pp. 12489–12507. [Google Scholar]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; OpenReview.net: Alameda, CA, USA, 2024; pp. 9112–9141. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
Xiong, L.; Xiong, C.; Li, Y.; Tang, K.; Liu, J.; Bennett, P.N.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; OpenReview.net: Alameda, CA, USA, 2021; pp. 12357–12372. [Google Scholar]
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2369–2380. [Google Scholar] [CrossRef]
Qu, Y.; Ding, Y.; Liu, J.; Liu, K.; Ren, R.; Zhao, W.X.; Dong, D.; Wu, H.; Wang, H. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Association for Computational Linguistics: St. Stroudsburg, PA, USA, 2021; pp. 5835–5847. [Google Scholar] [CrossRef]
Hofstätter, S.; Althammer, S.; Schröder, M.; Sertkan, M.; Hanbury, A. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv 2020, arXiv:2010.02666. [Google Scholar] [CrossRef]
Bonifacio, L.; Abonizio, H.; Fadaee, M.; Nogueira, R. InPars: Unsupervised Dataset Generation for Information Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2022; pp. 2387–2392. [Google Scholar] [CrossRef]
Wang, L.; Yang, N.; Wei, F. Query2doc: Query Expansion with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 9414–9423. [Google Scholar] [CrossRef]
Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 9–14 July 2023; pp. 1762–1777. [Google Scholar] [CrossRef]
Lee, H.; Lim, S. Hybrid Retrieval-Augmented Generation: Semantic and Structural Integration for Large Language Model Reasoning. Appl. Sci. 2026, 16, 2244. [Google Scholar] [CrossRef]
Zhu, X.; Xie, Y.; Liu, Y.; Li, Y.; Hu, W. Knowledge Graph-Guided Retrieval Augmented Generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 8912–8924. [Google Scholar] [CrossRef]
Hofstätter, S.; Lin, S.C.; Yang, J.H.; Lin, J.; Hanbury, A. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 July 2021; pp. 113–122. [Google Scholar] [CrossRef]
Sun, H.; Dhingra, B.; Zaheer, M.; Mazaitis, K.; Salakhutdinov, R.; Cohen, W. Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4231–4242. [Google Scholar] [CrossRef]
Sun, H.; Bedrax-Weiss, T.; Cohen, W. PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 2380–2390. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Gutiérrez, B.J.; Shu, Y.; Gu, Y.; Yasunaga, M.; Su, Y. HippoRAG: Neurobiologically inspired long-term memory for large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 10–15 December 2024; pp. 59532–59569. [Google Scholar]
Dai, Z.; Zhao, V.Y.; Ma, J.; Luan, Y.; Ni, J.; Lu, J.; Bakalov, A.; Guu, K.; Hall, K.B.; Chang, M. Promptagator: Few-shot Dense Retrieval from 8 Examples. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; OpenReview.net: Alameda, CA, USA, 2023; pp. 31694–31715. [Google Scholar]
Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, DC, USA, 18–21 October 2013; Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., Bethard, S., Eds.; Association for Computational Linguistics: Seattle, DC, USA, 18–21 October 2013; pp. 1533–1544. [Google Scholar]
Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; Deng, L. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 Co-Located with the 30th Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 9 December 2016; CEUR-WS.org: Bonn, Germany, 2016; Volume 1773, pp. 1–10. [Google Scholar]
Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguist. 2019, 7, 452–466. [Google Scholar] [CrossRef]
Joshi, M.; Choi, E.; Weld, D.; Zettlemoyer, L. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1601–1611. [Google Scholar] [CrossRef]
Robertson, S.E.; Jones, K.S. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. Technol. 1976, 27, 129–146. [Google Scholar] [CrossRef]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335. [Google Scholar] [CrossRef]
Wang, L.; Yang, N.; Huang, X.; Yang, L.; Majumder, R.; Wei, F. Improving Text Embeddings with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 11897–11916. [Google Scholar] [CrossRef]
Lee, C.; Roy, R.; Xu, M.; Raiman, J.; Shoeybi, M.; Catanzaro, B.; Ping, W. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. In Proceedings of the 13th International Conference on Learning Representations, Singapore, 24–28 April 2025; OpenReview.net: Alameda, CA, USA, 2025; pp. 54876–54899. [Google Scholar]

Figure 1. Overall framework of S-Gens. The framework first extracts multi-hop reasoning paths from an external knowledge graph to synthesize structurally consistent positive samples, then constructs semantically plausible but structurally inconsistent hard negatives, and, finally, applies a Siamese-GNN-based consistency filtering module before integrating the generated data into dense retriever training.

Figure 2. Performance comparison of representative retrievers before and after applying S-Gens on reasoning-intensive benchmarks. The gains are consistently larger on WebQSP and HotpotQA than on general retrieval datasets, highlighting the advantage of structure-aware synthetic supervision in multi-hop reasoning scenarios.

Figure 3. Ablation study of the core components in S-Gens. Path-based positives consistently improve retrieval performance over the raw ANCE baseline, while the full S-Gens framework yields the best results on WebQSP (R@20), HotpotQA (R@20), and HotpotQA (DR@10).

Figure 4. Sensitivity analysis of the synthetic data ratio

η

on WebQSP and HotpotQA. Both benchmarks achieve the best performance at

η = 30 %

, indicating that a moderate amount of structure-aware synthetic supervision provides the best balance between reasoning enhancement and distributional stability.

Figure 4. Sensitivity analysis of the synthetic data ratio

η

on WebQSP and HotpotQA. Both benchmarks achieve the best performance at

η = 30 %

, indicating that a moderate amount of structure-aware synthetic supervision provides the best balance between reasoning enhancement and distributional stability.

Table 1. Positioning of S-Gens relative to representative graph-guided retrieval and RAG methods.

Method	Structure Used at	Online Graph Use	Inference Cost Increase	Relation
HybRAG [13]	Retrieval/reasoning	Yes	Yes	Complementary
KG²RAG [14]	Chunk expansion/organization	Yes	Yes	Complementary
GraphRAG/HippoRAG	Retrieval/evidence organization	Yes	Yes	Complementary
S-Gens (ours)	Training-time supervision construction	No	No	–

Table 2. Main experimental results of different retrieval models on MS MARCO, NQ, and TriviaQA. BM25 is a non-trainable sparse baseline and is therefore not fine-tuned. Values in parentheses denote the absolute improvement brought by S-Gens.

Model	MS MARCO	NQ	TriviaQA
Model	MRR@10	R@20	R@20
BM25	0.187	59.1	66.9
DPR	0.322 (+0.014)	78.4 (+2.1)	79.4 (+1.8)
ANCE	0.341 (+0.012)	81.9 (+1.7)	80.3 (+1.6)
RocketQA	0.370 (+0.010)	83.2 (+1.5)	82.1 (+1.4)
Margin-MSE	0.375 (+0.009)	83.8 (+1.3)	82.5 (+1.5)
BGE-M3	0.385 (+0.008)	84.9 (+1.1)	84.2 (+1.1)
E5-Mistral-7B-Instruct	0.392 (+0.007)	85.5 (+1.0)	85.0 (+1.0)
NV-Embed-v2	0.401 (+0.006)	86.4 (+0.9)	85.8 (+0.9)

Bold values indicate the best result in each metric column.

Table 3. Main experimental results of different retrieval models on the reasoning-intensive benchmarks WebQSP and HotpotQA. Values in parentheses denote the absolute improvement brought by S-Gens.

Model	WebQSP	HotpotQA
Model	R@20	R@20
BM25	55.0	57.8
DPR	71.8 (+3.6)	61.2 (+4.2)
ANCE	76.5 (+3.4)	71.0 (+3.7)
RocketQA	74.5 (+2.9)	69.5 (+3.3)
Margin-MSE	75.2 (+2.6)	70.1 (+3.2)
BGE-M3	77.8 (+2.4)	72.5 (+2.9)
E5-Mistral-7B-Instruct	78.6 (+2.2)	73.8 (+2.5)
NV-Embed-v2	79.5 (+1.9)	75.1 (+2.2)

Bold values indicate the best result in each metric column.

Table 4. Matched-budget comparison between semantic-only and structure-aware synthetic augmentation. All results are averaged over three runs and reported as mean ± standard deviation under the same generation and training budget. “Struct.-Neg.” denotes structural negatives only.

Backbone	Setting	WebQSP	HotpotQA	HotpotQA
Backbone	Setting	R@20	R@20	DR@10
ANCE	Original	73.1 ± 0.17	67.3 ± 0.20	71.8 ± 0.40
ANCE	Semantic-only	74.0 ± 0.20	68.1 ± 0.20	73.1 ± 0.30
ANCE	Path-Guided Positives	74.8 ± 0.20	69.0 ± 0.20	74.2 ± 0.30
ANCE	Struct.-Neg. Only	74.4 ± 0.20	68.7 ± 0.20	76.8 ± 0.40
ANCE	Full S-Gens	76.5 ± 0.30	71.0 ± 0.30	78.6 ± 0.40
BGE-M3	Original	75.4 ± 0.14	69.6 ± 0.18	–
BGE-M3	Semantic-only	76.1 ± 0.17	70.4 ± 0.20	–
BGE-M3	Path-Guided Positives	76.7 ± 0.16	71.0 ± 0.21	–
BGE-M3	Struct.-Neg. Only	76.5 ± 0.18	70.8 ± 0.19	–
BGE-M3	Full S-Gens	77.8 ± 0.22	72.5 ± 0.24	–

Bold values indicate the best result in each metric column. “–” indicates not applicable (N/A).

Table 5. Ablation results of the core components in S-Gens using ANCE as the backbone retriever.

Variant	WebQSP	HotpotQA	HotpotQA
Variant	R@20	R@20	DR@10
Raw ANCE	73.1	67.3	71.8
+ Path-Positives Only	74.8	69.0	74.2
+ Path-Positives & BM25 Negatives	75.1	69.3	74.5
Full S-Gens	76.5	71.0	78.6

Table 6. Comparison of different filtering strategies using ANCE as the backbone retriever.

Filter	WebQSP	HotpotQA	HotpotQA
Filter	R@20	R@20	DR@10
No Filter	75.2	69.8	76.0
Heuristic Filter	75.7	70.2	76.8
Semantic Similarity Filter	75.9	70.4	77.1
Siamese GNN Filter	76.5	71.0	78.6

Table 7. Manual data quality analysis on a sampled subset of generated instances from WebQSP and HotpotQA.

Setting	Positive Structural Faithfulness	Negative Decoy Validity
Before Filtering	82%	78%
Heuristic Filter	87%	84%
Siamese GNN Filter	92%	89%

Table 8. Reasoning-oriented retrieval evaluation on HotpotQA using both-support Recall@20.

Backbone	Setting	HotpotQA Both-Support R@20
ANCE	Original	49.6
ANCE	Semantic-only	50.8
ANCE	Path-Guided Positives	52.1
ANCE	Struct.-Neg. Only	51.5
ANCE	Full S-Gens	53.9
BGE-M3	Original	52.8
BGE-M3	Semantic-only	53.6
BGE-M3	Path-Guided Positives	54.5
BGE-M3	Full S-Gens	56.1

Table 9. Performance under different synthetic data ratios

η

.

Table 9. Performance under different synthetic data ratios

η

.

Metric	10%	20%	30%	40%	50%
WebQSP (R@20)	74.3	75.6	76.5	76.1	75.5
HotpotQA (R@20)	68.6	70.0	71.0	70.7	69.8

Bold values indicate the best result in each metric column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lei, Z.; Xu, Y.; Chen, S. S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval. Information 2026, 17, 413. https://doi.org/10.3390/info17050413

AMA Style

Lei Z, Xu Y, Chen S. S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval. Information. 2026; 17(5):413. https://doi.org/10.3390/info17050413

Chicago/Turabian Style

Lei, Zhou, Yanqi Xu, and Shengbo Chen. 2026. "S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval" Information 17, no. 5: 413. https://doi.org/10.3390/info17050413

APA Style

Lei, Z., Xu, Y., & Chen, S. (2026). S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval. Information, 17(5), 413. https://doi.org/10.3390/info17050413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

S-Gens: Structure-Aware Synthetic Data Generation for Enhancing Reasoning-Intensive Dense Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Dense Retrieval

2.2. Reasoning-Intensive Retrieval and Retrieval-Augmented Generation

2.3. Knowledge-Graph-Enhanced Retrieval and Graph-Based Reasoning

2.4. LLM-Based Synthetic Data Generation for Retrieval

2.5. Hard Negative Mining and the Supervision Gap in Reasoning-Oriented Retrieval

3. Methodology

3.1. Problem Setup and Notation

3.2. Running Example

3.3. Path-Guided Positive Construction

3.3.1. Extraction of the Reasoning Backbone

3.3.2. Path-Guided Query Generation

3.3.3. Verification of Path Coverage

3.4. Structural Hard Negative Construction

3.4.1. Candidate Negative Pool Construction

3.4.2. Structural Inconsistency Modeling

3.4.3. Semantic-Decoy Selection

3.4.4. Diversity-Aware Negative Sampling

3.4.5. Training Role of Structural Hard Negatives

3.5. Siamese-GNN-Based Consistency Filtering

3.5.1. Graph Views of Reasoning Paths and Candidate Documents

3.5.2. Shared Graph Encoding

3.5.3. Consistency Scoring

3.5.4. Filtering Strategy

3.5.5. Effect on Synthetic Supervision

3.6. Training Integration and Optimization

3.6.1. Synthetic–Original Data Mixture

3.6.2. Retriever-Agnostic Integration

3.6.3. Contrastive Training Objective

3.6.4. Optional Weighted Supervision with Synthetic Reliability

3.6.5. Optimization Effect

3.7. Complexity Analysis and Discussion

3.7.1. Offline Construction Cost

3.7.2. Training-Time Overhead

3.7.3. Inference-Time Efficiency

3.7.4. Scalability and Practical Considerations

3.7.5. Discussion

4. Experiments

4.1. Experimental Setup

4.1.1. Evaluation Datasets

General Retrieval

Complex Reasoning Question Answering

General Factoid Question Answering

4.1.2. Evaluation Protocol

4.1.3. Baseline Models

Sparse Retrieval Baseline

Classical Dual-Encoder Retrievers and Hard-Negative Mining Models

Knowledge Distillation-Based Retriever

Recent LLM-Based Embedding Models

4.1.4. Implementation Details

4.1.5. Reproducibility and Data Leakage Prevention

4.2. Main Results and Analysis

4.2.1. Universal Improvement and Model-Agnosticity

4.2.2. Larger Gains on Reasoning-Intensive Benchmarks

4.2.3. Complementarity with Strong LLM-Based Embeddings

4.3. Ablation Study

4.3.1. Matched-Budget Comparison with Semantic-Only Augmentation

4.3.2. Effectiveness of Structural Hard Negatives

4.3.3. Necessity of the Quality Control Module

4.3.4. Manual Data Quality Analysis

4.3.5. Reasoning-Oriented Evidence Coverage

4.3.6. Sensitivity to the Synthetic Data Ratio

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information