GraphTrace: A Modular Retrieval Framework Combining Knowledge Graphs and Large Language Models for Multi-Hop Question Answering

Osipjan, Anna; Khorashadizadeh, Hanieh; Kessel, Akasha-Leonie; Groppe, Sven; Groppe, Jinghua

doi:10.3390/computers14090382

Open AccessArticle

GraphTrace: A Modular Retrieval Framework Combining Knowledge Graphs and Large Language Models for Multi-Hop Question Answering

by

Anna Osipjan

,

Hanieh Khorashadizadeh

,

Akasha-Leonie Kessel

,

Sven Groppe

and

Jinghua Groppe

^*

The Institute of Information Systems, University of Lübeck, 23562 Lübeck, Germany

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(9), 382; https://doi.org/10.3390/computers14090382

Submission received: 15 July 2025 / Revised: 2 September 2025 / Accepted: 3 September 2025 / Published: 11 September 2025

Download

Browse Figures

Versions Notes

Abstract

This paper introduces GraphTrace, a novel retrieval framework that integrates a domain-specific knowledge graph (KG) with a large language model (LLM) to improve information retrieval for complex, multi-hop queries. Built on structured economic data related to the COVID-19 pandemic, GraphTrace adopts a modular architecture comprising entity extraction, path finding, query decomposition, semantic path ranking, and context aggregation, followed by LLM-based answer generation. GraphTrace is compared against baseline retrieval-augmented generation (RAG) and graph-based RAG (GraphRAG) approaches in both retrieval and generation settings. Experimental results show that GraphTrace consistently outperforms the baselines across evaluation metrics, particularly in handling mid-complexity (5–6-hop) queries and achieving top scores in directness during the generation evaluation. These gains are attributed to GraphTrace’s alignment of semantic reasoning with structured KG traversal, combining modular components for more targeted and interpretable retrieval.

Keywords:

knowledge graph; large language model; retrieval augmented generation; GraphRAG

1. Introduction

Large language models (LLMs) have demonstrated impressive capabilities in generating fluent, context-aware text across a wide range of natural language processing tasks. However, their effectiveness in knowledge-intensive applications such as question answering remains limited due to their reliance on static, pretrained knowledge. This often results in factual inconsistencies or hallucinations, outputs that deviate from user input, context, or real-world knowledge [1,2,3].

To mitigate this, retrieval-augmented generation (RAG) approaches have been proposed [4,5], which enhance LLMs by incorporating external information retrieval. By integrating external data, RAG reduces factual errors and improves the contextual accuracy [6,7]. Despite these advancements, conventional RAG methods often lack the capacity for structured reasoning or sensemaking, as they typically retrieve from unstructured sources without semantic or relational context [8]. Recent work has explored the use of knowledge graphs (KGs) in RAG systems, referred to as GraphRAG, combining the retrieval power of RAG with the semantic structure of KGs. Approaches such as GraphRAG [9], Think on Graphs (ToG) [10], and LEGO-GraphRAG [11] demonstrate the promise of using graph-based retrieval to better capture relationships between concepts and enable multi-hop reasoning.

Existing GraphRAG methods struggle with complex, multi-hop questions, as they often rely on simple one-hop retrieval strategies such as breadth-first search (BFS), which can miss relevant but distant information or introduce noise [12]. While expanding to larger subgraphs is a potential alternative, it introduces challenges in selecting the appropriate expansion depth and in assessing the relevance of retrieved paths, especially when they are only implicitly related to the input query. Here, GraphTrace is assessed against existing RAG and GraphRAG approaches in both retrieval and generation settings, with a particular focus on the question complexity (measured via hop count) and qualitative answer quality.

This paper presents the architecture of GraphTrace, describes the benchmark and evaluation setup, and discusses empirical results across different task dimensions. The findings show that GraphTrace achieves consistently strong performance, especially in mid-complexity queries, while also revealing insights into the interplay between the graph structure, query formulation, and answer quality. The code is publicly available at https://github.com/AnnaO8/kg-llm-ir (accessed on 9 April 2025).

2. Related Work

Recent years have seen a surge in interest in retrieval-augmented generation (RAG) techniques that combine large language models (LLMs) with external knowledge sources [13]. This section reviews key developments in this area and attempts to situate our approach, GraphTrace, within the landscape of existing methods. For this, Table 1 shows the most important features of existing solutions compared to the features of GraphTrace, the framework that was developed during this work.

Naive RAG represents the simplest retrieval-augmented architecture [4]. A user’s query is forwarded to a dense or sparse retriever (e.g., BM25 or DPR), and a fixed number of top-ranked documents is injected directly into the LLM’s context window. This approach is efficient and easy to implement but suffers from limited reasoning capabilities, especially in complex multi-hop or entity-centric questions, due to the lack of structural awareness and query decomposition. Advanced RAG frameworks aim to improve over this baseline by incorporating entity extraction, context-aware query expansion, or reranking techniques [14]. These pipelines often include NLP preprocessing steps that segment the query into salient components or refine the retrieval intent. However, they remain largely linear in architecture and do not model relationships between documents or entities in a structured manner. Modular RAG systems embrace a compositional architecture where retrieval and reasoning components are cleanly separated and can be dynamically reconfigured [15]. These systems often support plugin-style tool calls, document chaining, and hierarchical retrieval. While modularity improves robustness and interpretability, most implementations operate over unstructured text and do not leverage graph-based representations or topological reasoning.

Microsoft’s GraphRAG introduces the notion of using a knowledge graph (KG) as the retrieval substrate [9]. Queries are mapped to entities or relations in the KG, and relevant subgraphs (e.g., k-hop neighborhoods or semantic paths) are extracted and linearized into text for prompt injection. While this method benefits from structural compression and thus improved relevance and factual grounding, it remains inherently static: communities are precomputed, and the summarization step flattens fine-grained relationships between nodes. Consequently, the system lacks support for query-specific exploration and interpretable path reasoning. LazyGraphRAG addresses some of these limitations by introducing adaptive graph traversal and cost-aware reasoning [8]. The system incrementally explores the graph and performs retrieval steps only as needed, balancing latency and answer quality. It combines best-first search based on vector similarity with breadth-first exploration along graph neighborhoods. While this reduces the computational overhead, it still relies heavily on vector-based heuristics to guide the search process. The absence of symbolic reasoning and explicit path modeling means that relevance is approximated rather than semantically evaluated. Moreover, neither GraphRAG nor LazyGraphRAG supports the decomposition of the input query into smaller subquestions or their alignment with specific paths in the graph. As a result, both approaches struggle with complex, multifaceted queries that require the assembly of information from multiple, semantically distinct subgraphs.

Taken together, the approaches each offer valuable contributions: Naive RAG excels in simplicity and speed, Advanced RAG improves the input quality through preprocessing, Modular RAG emphasizes composability, GraphRAG enables structured retrieval, and LazyGraphRAG introduces adaptivity. Another recent system, graph retrieval-augmented generation (GRAG), proposes a dual-view approach that combines soft prompting and symbolic exploration over KGs [16]. GRAG constructs an ego-centric subgraph around query entities using relevance-guided expansion and then transforms these into two views: a hard prompt representing textualized triples and a soft prompt derived from learned graph embeddings. These are jointly injected into the LLM for answer generation. While GRAG introduces a flexible prompting mechanism and leverages both symbolic and neural representations, it does not explicitly decompose queries or semantically align subgraphs with query components. Moreover, the ego-graph construction lacks robust path modeling and canonicalization, making it challenging to assess whether the retrieved context truly matches the user’s information needs—particularly for multifaceted queries that span divergent or converging subgraphs.

LEGO-GraphRAG extends the modular RAG paradigm to graph-based retrieval by introducing a compositional framework that supports plug-and-play components for entity recognition, graph traversal, semantic ranking, and context integration [11]. Each module in LEGO-GraphRAG can be configured independently, enabling flexible adaptation to task-specific requirements or resource constraints. Notably, it supports the semantic reranking of paths and offers basic graph traversal strategies beyond simple BFS. However, the system relies primarily on predefined modules, without a unified semantic alignment mechanism across subqueries and retrieved paths.

Despite all of these advancements, a fundamental issue in RAG lies in assessing the relevance of multi-hop paths [6,17,18]. These paths may not exhibit a direct or obvious connection to the query but can nevertheless contain crucial contextual information. Evaluating such latent relevance is difficult: simple heuristics based on path lengths, shared nodes, or edge frequencies often fail to capture the semantic alignment between query intent and path content. What is still lacking is a system that combines explicit entity extraction, semantic path finding and ranking, adaptive context construction, and LLM-compatible output formatting into a cohesive, end-to-end pipeline. This fragmentation motivates the development of an integrated framework that draws from the strengths of the existing approaches while addressing their individual limitations. By merging structured retrieval, decomposition strategies, and semantic ranking with LLM-based generation, our GraphTrace system can support complex, multi-hop queries more effectively than any of its components alone.

This work builds upon a knowledge graph about the impacts of the COVID-19 pandemic [19] on the economy, introduced by [20], which was constructed from a curated set of 62 reports and economic bulletins by organizations such as the World Bank, IMF, and European Central Bank. The data were structured using LLM-generated question–answer pairs to facilitate triple extraction and entity linking via Wikidata. The resulting KG comprises over 30,000 triples and serves as the foundation for the benchmark dataset used in our evaluation. The question–answer dataset, derived from this KG, is explained further in Section 4.

3. Methodology Overview

This study adopts an experimental computer science methodology. We present GraphTrace as a research artifact that is developed as a retrieval framework for multi-hop question answering. Following this approach, we (i) introduce the design of the artifact, (ii) demonstrate its use on a domain-specific KG, and (iii) assess its performance through standardized retrieval and generation evaluation.

Existing GraphRAG methods face significant challenges in handling complex, multi-hop questions. Most approaches rely on one-hop breadth-first search (BFS) to retrieve information from the immediate neighborhood of extracted entities [12]. This strategy is often inadequate for two primary reasons: (1) the direct neighborhood may contain either insufficient information or excessive noise if not properly filtered, and (2) many real-world questions require multi-hop reasoning to access semantically related but more distant nodes in the KG. While retrieving larger subgraphs is a potential alternative, it presents its own challenges, particularly in determining the optimal depth and direction of graph expansion. Furthermore, assessing the relevance of multi-hop paths is inherently difficult, as these paths may not have an explicit connection to the input query, yet they can contain information that is essential in generating an accurate answer.

To address these limitations, we propose GraphTrace, a modular retrieval pipeline designed to enhance semantic understanding and reasoning over KGs. The framework consists of five core components: entity extraction, path finding, query decomposition, semantic path ranking, and context aggregation. Together, these components enable the semantically guided exploration of the KG and the targeted integration of relevant information, tailored to the input query. Figure 1 provides an overview of the conceptual design of the GraphTrace pipeline, highlighting the individual stages and their interactions. To illustrate how this design operates in practice, Figure 2 presents an example.

3.1. Entity Extraction

The starting point of the GraphTrace system is the user query, which is processed by the first component: entity extraction. The goal of this component is to identify key entities in the input query and determine appropriate starting points in the KG. The input is the user query in unstructured text form, and the output is a list of entities represented as strings. This step is performed using the LLM provided by OpenAI (gpt-4o-mini) [21]. The model is prompted with a system message containing detailed instructions for the entity extraction task, including the expected output format, a JSON object consisting of a list of extracted entity strings. Figure 3 illustrates the entity extraction process, including a concrete example from the utilized question–answer dataset.

3.2. Path Finding

The extracted entities serve as input to the next component: path finding. As illustrated in Figure 4, this component first verifies whether the extracted entities exist as nodes in the KG. This verification is performed by executing Cypher queries that match nodes based on their name attributes. For entities found in the KG, the next step is to identify all paths between them using the allShortestPaths((e1)-[*]-(e2)) function. This retrieves all shortest paths between entity pairs, regardless of the relationship direction or path length. To maintain efficiency and reduce noise, self-loops (i.e., paths where the start and end nodes are the same) are explicitly excluded. The use of allShortestPaths() provides a balance between computational efficiency and comprehensive coverage, capturing diverse yet concise semantic connections between verified entities. Following path extraction, a canonicalization step is performed to eliminate redundant representations and ensure consistency in the retrieved paths. The output of this component is a list of extracted paths (Path 1, …, Path n), which are subsequently passed to downstream components. An example of the process is depicted in Figure 4.

3.3. Query Decomposition

In addition to entity extraction, the user query is also passed to the query decomposition component. The purpose of this step is to divide the original query into smaller, manageable subqueries, thereby reducing its overall complexity and facilitating more precise retrieval. The input to this component is the user query in unstructured text form, and the output is a list of subqueries represented as strings. The model is prompted with a system message containing detailed instructions for query decomposition, including the expected output format—a JSON object consisting of a list of strings corresponding to the generated subqueries. This structured output format ensures consistency and simplifies integration with subsequent pipeline components. The user query is then provided as a message to the LLM, triggering the decomposition process. Figure 5 illustrates the query decomposition component, including an example derived from the utilized question–answer dataset.

3.4. Path Ranking

The subsequent component in the GraphTRACE pipeline is path ranking, which takes as input the previously extracted paths and the generated subqueries. Its primary objective is to evaluate and rank all paths for each subquery based on semantic relevance. Scoring and ranking are performed using the ms-marco-MiniLM-L-6-v2 CrossEncoder model [22]. This model is specifically trained for task ranking and computes a relevance score for each (subquery, path) pair by jointly encoding both elements as a single input. This joint encoding allows the model to capture subtle semantic dependencies and contextual relationships, making it well suited for comparing a fixed subquery against multiple candidate paths. The MiniLM variant offers a favorable balance between computational efficiency and ranking quality, which is why it was selected for this task. For each subquery, the model scores all associated paths and ranks them in descending order of predicted relevance. The output of this component is a ranked list of paths for each subquery, from most to least relevant. This process is illustrated in Figure 6, including an example visualization of the scoring and ranking procedure.

3.5. Aggregation

The final component of the retrieval pipeline is the aggregation module. In this step, the top three most relevant paths for each subquery, as determined by the path ranking component, are selected and aggregated to form the final retrieved context. Since the subqueries represent logical subdivisions of the original query, aggregating the most relevant paths for each subquery ensures that the resulting context is comprehensive and semantically aligned with the original information need. This aggregated set of paths serves as the foundation for subsequent answer generation in the overall GraphRAG pipeline. A complete example of this aggregation process is illustrated in Figure 7.

4. Evaluation

The following section presents the evaluation of the proposed retrieval framework, GraphTrace, in comparison to the baseline RAG and GraphRAG methods. The evaluation is conducted using an early version of the EcoRAG dataset [23], captured during its initial development phase before the full set of question–answer (QA) pairs was available. It is a multi-hop economic QA dataset constructed over the Economic_KG, the domain-specific knowledge graph. EcoRAG is designed for deep, multi-hop retrieval with queries requiring up to seven hops and includes complex subgraph topologies such as converging, divergent, and linear patterns.

4.1. Evaluation Setup

In this section, the setup for the evaluation is explained in detail. First, the dataset used for the evaluation of the system is described. After this, the different approaches’ methods of accessing the KG, acquiring information from it, and generating answers are discussed.

Question–Answer Dataset: The question–answer dataset created by [20] includes structured queries and annotated ground-truth paths in the graph. It consists of three types of graph reasoning structures—linear, converging, and divergent—each reflecting increasing levels of complexity in semantic graph traversal.

Linear Paths (Chain Reasoning)
Linear paths represent a direct sequence of connected entities and relationships. Each step in the path depends on the preceding one, forming a straight-line inference structure.
Structure: A → B → C → D
Converging Paths (Directed Acyclic Graphs—DAGs)
Converging paths involve multiple reasoning branches that lead to a common node. These structures are used when synthesizing multiple sources of information to reach a unified conclusion.
Structure: (A → B → C), (D → E → C)
Divergent Paths (Polytrees)
Divergent paths originate from a single entity that connects to multiple downstream branches, each representing an independent line of inference.
Structure: A → B, A → C, A → D

In contrast, the majority of existing GraphRAG approaches [9] follow a multi-stage pipeline that typically begins with a corpus of text documents and first constructs a KG through an initial KG construction step. This graph is then used for downstream information retrieval and answer generation.

Since the KG construction step has already been performed and documented by [20], this pipeline directly adopts their resulting KG as the dataset. This approach enables a focused evaluation of the retrieval and generation components without reintroducing variability from the construction phase. As part of the evaluation methodology, the following methods are assessed. Each approach differs in how it accesses the KG, performs retrieval, and generates answers.

Naive RAG: This is a basic retrieval-augmented generation approach that performs a semantic search over a CSV-based knowledge graph. It uses SentenceTransformer (all-MiniLM-L6-v2) with FAISS [24] to embed and index KG triples. For each query, the top six most relevant triples are retrieved and combined with the query and then passed to OpenAI’s gpt-4o-mini to generate an answer.
Naive RAG with Subquery: This extends the basic Naive RAG approach by decomposing complex queries into simpler subqueries using an LLM (gpt-4o-mini). Each subquery is processed individually to improve the retrieval effectiveness through query simplification.
Hybrid RAG: This combines dense (vector-based) and sparse (BM25) retrieval. Results are merged and used as context for generation.
Rerank RAG: This retrieves top triples using dense embeddings and then reranks them with a CrossEncoder to improve the result quality before answer generation.
Naive GraphRAG: This applies BFS and DFS over the KG starting from entities extracted via KeyBERT [25]. Aggregated paths are used for generation.
KG RAG: Based on [26], this is a graph-based retrieval method that uses the LLM gpt-4o-mini to guide the step-by-step exploration of a knowledge graph. Based on the input query, the LLM creates a plan to decide whether to explore nodes or relationships. For node exploration, it finds the top five candidates using vector search, and the LLM selects the most relevant ones. For relationships, it identifies paths between important nodes and verifies their relevance. This process continues until enough information is gathered to answer the query. If the plan fails after three revisions, the system stops and does not return an answer.
Think on Graphs (ToG): This is graph-based retrieval method from [10] that explores the knowledge graph iteratively. Since the original version was designed for Freebase [27] or Wikidata [28], this study adapts it to work with the Economic_KG by integrating KeyBERT for entity recognition. The system starts from identified key entities and uses beam search to explore surrounding nodes and relations. Irrelevant results are filtered through pruning, and reasoning steps (guided by gpt-4o-mini) determine whether enough context has been retrieved or more exploration is needed. This process continues until the system decides that an answer can be generated.

Given that the proposed approach comprises two main components—retrieval and generation—it is essential to evaluate these stages independently. This separation enables a more fine-grained analysis of system performance, allowing us to identify the strengths and limitations of each component in contributing to the overall task of knowledge-augmented question answering.

4.2. Retrieval Evaluation

The retrieval performance is evaluated using the metrics of the mean reciprocal rank (MRR), mean average precision (MAP), and Hit@10. These metrics assess the quality of the retrieved information across different methods. The MRR is defined as follows:

MRR = \frac{1}{| Q |} \sum_{i = 1}^{| Q |} \frac{1}{{rank}_{i}}

where

{rank}_{i}

refers to the rank position of the first relevant document for the i-th query.

Each query

q \in Q

is assigned a reciprocal rank score between 0 and 1, reflecting the rank position of the first relevant document in the result list. The MRR aggregates these scores into a single value within the interval [0, 1] that represents the average ranking performance across all queries in the dataset. The higher the MRR, the better the system’s performance in ranking relevant documents at the top. However, it is important to note that the MRR considers only the first relevant result for each query.

The MAP is defined as follows:

MAP = \frac{\sum_{q = 1}^{Q} AP (q)}{Q}

where Q is the number of queries .

The MAP evaluates both the rank positions and the number of relevant documents in the result list across the entire set of queries. For this, it computes the AP for each query by identifying all positions of relevant documents and calculating their precision at each of these positions. Based on this, the average of these precision values yields the AP for each query. For the MAP, the mean value of all AP scores is derived and lies in the interval [0, 1]. As with the MRR, a higher value here also indicates better system performance. In contrast to the MRR, the MAP includes all relevant results of all queries.

The Hit@10 indicates, for each query, whether at least one relevant document appears in the top 10 results. This metric uses an indicator function to check whether the condition is true or not and returns a respective value of 1 or 0. The overall Hit@10 score is calculated as the mean value over all queries in the dataset and lies in the interval [0, 1]. The Hit@10 does not take the exact positions of relevant documents into account, but it is useful in evaluating whether the system is capable of retrieving relevant information in the first 10 positions. The same applies here as for the MRR and MAP: a higher value indicates better performance.

Table 2 presents the results for the aggregated QA dataset, which combines the converging, divergent, and linear subsets, resulting in a total of 267 question–answer pairs. GraphTRACE achieves the best performance in terms of the MRR (0.4477) and MAP (0.1906) and remains competitive for the Hit@10 (0.8127). To confirm the robustness of the results, the experiments were repeated five times. The performance remained stable across runs, with standard deviations below 0.015 across all metrics (MRR = 0.4429 ± 0.0102, MAP = 0.1915 ± 0.0014, Hit@10 = 0.8230 ± 0.0112). Naive RAG with Subquery shows notable improvements over Naive RAG, especially in the MAP (0.1136 vs. 0.0873), highlighting the benefit of query decomposition. Hybrid RAG performs similarly to Naive RAG, with only minor differences.

Table 3 presents the retrieval evaluation results for the converging dataset, which contains 90 question–answer pairs. GraphTRACE achieves the best performance in both the MRR (

0.484

) and MAP (

0.2314

) and performs competitively regarding the Hit@10 (

0.8444

). Naive RAG + Subquery outperforms standard Naive RAG across all metrics and achieves the highest Hit@10 score among all evaluated methods (

0.8778

), demonstrating the benefit of query decomposition. Hybrid RAG performs similarly to Naive RAG, with only minor differences across metrics. Rerank RAG shows strong performance regarding the Hit@10 (

0.8333

), but its MRR score (

0.3102

) is lower than that of the other RAG-based baselines. Naive GraphRAG and KG RAG exhibit poor performance across all retrieval metrics.

Table 4 presents the retrieval evaluation results for the divergent dataset, which consists of 90 question–answer pairs. GraphTRACE shows competitive performance in the MRR (

0.4149

) and Hit@10 (

0.8000

), while achieving the highest score for the MAP (

0.1750

). Naive RAG attains the highest MRR (

0.4261

) but is outperformed by Naive RAG + Subquery in both the MAP (

0.1164

vs.

0.1038

) and Hit@10 (

0.8333

vs.

0.8000

), highlighting the benefit of query decomposition. Hybrid RAG surpasses Naive RAG with Subquery in terms of the MRR but performs worse in the other metrics. Rerank RAG produces similar results to Hybrid RAG for both the MRR and MAP, while achieving the best Hit@10 score (

0.8444

). Once again, the graph-based baselines—Naive GraphRAG and KG RAG—show poor performance across all retrieval metrics.

Table 5 presents the retrieval evaluation results for the linear dataset, which consists of 87 question–answer pairs. GraphTRACE achieves the highest scores in both the MRR (

0.4442

) and MAP (

0.1644

). For the Hit@10, it ties with Naive RAG (

0.7931

) and ranks third overall. Naive RAG shows solid overall performance and slightly outperforms Naive RAG + Subquery in the MRR (

0.3464

vs.

0.3336

) and Hit@10 (

0.7931

vs.

0.7471

). Hybrid RAG performs similarly to Naive RAG, with marginal improvements in the MAP and Hit@10. Rerank RAG achieves the best Hit@10 score (

0.8506

) but ranks lower than Naive RAG + Subquery in the MRR and MAP. As in previous evaluations, the graph-based methods—Naive GraphRAG and KG RAG—perform significantly worse across all metrics.

Following the dataset-level analysis, this section evaluates the impact of the hop count on the retrieval performance. The hop count refers to the number of relationships traversed in the KG to answer a query, with higher counts typically indicating greater complexity. Each dataset (converging, divergent, linear) includes questions with varying hop counts. Figure 8 shows the distribution, and the retrieval metrics (MRR, MAP, Hit@10) are analyzed per hop level. Table 6, Table 7 and Table 8 show the retrieval results across different hop counts for the converging, divergent, and linear datasets. Each is evaluated using the MRR, MAP, and Hit@10. Figure 8 shows the distribution of the hop counts, while Figure 9 visualizes the performance trends by hop level.

Figure 8 also summarizes the hop-based complexity across datasets. The converging set is dominated by seven-hop questions (47), followed by six (23) and eight (17). The divergent set is mostly six-hop (70), with a few five-hop (10) and seven-hop (8) questions. In the linear set, six-hop questions prevail (77), with very few five-hop (6) and four-hop (4) instances. In the converging dataset, GraphTRACE performs best in terms of the MRR and MAP for five- to seven-hop questions—the most frequent categories—while Naive RAG + Subquery achieves the top Hit@10 scores for seven- and eight-hop questions. Most other methods show a drop in performance beyond six hops. Graph-based methods (Naive GraphRAG, KG RAG) perform poorly across all metrics. In the divergent dataset, dominated by six-hop questions, Naive RAG performs well in the MRR, while GraphTRACE achieves the highest MAP across nearly all hop levels. Rerank RAG delivers the strongest Hit@10 performance, particularly at higher hop counts, suggesting that reranking is useful in retrieving dispersed or weakly connected information. The graph-based baselines again show the weakest results. In the linear dataset, where the hop counts range from four to six, GraphTRACE consistently outperforms other methods in both the MRR and MAP. Although Rerank RAG achieves the highest Hit@10 at six hops, GraphTRACE performs strongly overall. The simpler structure of this dataset results in smaller performance gaps among methods.

Overall, GraphTRACE performs robustly across datasets and hop levels, particularly on medium-complexity queries (5–7 hops). Naive RAG + Subquery is effective in converging topologies, while Rerank RAG excels in scenarios with less direct connectivity. In contrast, Naive GraphRAG and KG RAG show consistently poor performance, regardless of the hop count or dataset.

4.3. Generation Evaluation

To evaluate the generated answers, LLMs are employed as evaluators. Prior studies [29,30] have demonstrated the effectiveness of LLMs in assessing natural language generation, often achieving state-of-the-art or competitive results when compared to human judgment [9]. In this evaluation, retrieval metrics provide quantitative comparisons based on ground truth data, whereas generation metrics focus on qualitative aspects. For the latter, an LLM is used as a judge to perform head-to-head comparisons between the generated answers. The selected evaluation criteria include comprehensiveness, diversity, empowerment, and directness, as proposed by Edge et al. [9]. These metrics collectively aim to capture the sensemaking quality of the answers.

Comprehensiveness measures the level of detail and completeness in the answer. A comprehensive response is consistent, covers all relevant aspects of the query, and incorporates contextual and adjacent information aligned with the question’s complexity.
Diversity evaluates the variety and richness of content. This includes the presentation of different perspectives, insights, or arguments and ensures that the answer offers novel information rather than paraphrasing existing responses.
Empowerment assesses the extent to which an answer enables the reader to understand the topic and make informed decisions. This metric emphasizes critical thinking, user autonomy, and the capacity to promote reflective understanding.
Directness captures how precisely the answer addresses the core of the question. A direct response avoids digressions and unnecessary elaboration, while remaining clear, structured, and specific. Excessive detail is only acceptable if it supports comprehension.

Since directness can conflict with comprehensiveness and diversity, it is not expected that any one method will outperform others across all four criteria [9]. Therefore, an additional overall score is computed to determine a final winner in the comparative evaluations. For the evaluation process, the LLM is provided with the original question, a description of the target evaluation metric, and the answers generated by each competing approach. The LLM is then prompted to assess which answer best satisfies the specified metric, accompanied by a brief explanation for its choice. Following the protocol proposed by [9], the LLM is instructed to declare a single winner if and only if a clear distinction exists. If the responses are fundamentally similar and the differences are negligible, the evaluation results in a tie.

To mitigate the inherent stochasticity of LLM outputs, the evaluation is repeated three times per instance, and the final scores are computed by averaging the outcomes across runs. The generation evaluation assesses the quality of answers produced by each method using four qualitative metrics: comprehensiveness, diversity, empowerment, and directness. Evaluations were conducted using gpt-4o-mini, which was tasked with identifying the best answer per criterion. To ensure robustness, each evaluation was repeated three times, and the average relative win rates were calculated (Table 9).

Across all datasets, a different method dominates each criterion. Naive RAG with Subqueryachieves the highest win rates for diversity in all three datasets, likely due to its query decomposition strategy, which encourages diverse subquery paths. GraphTRACE is the consistent top performer for directness, suggesting that its path reranking mechanism helps to generate more focused and concise answers. Naive GraphRAG performs best in comprehensiveness for two datasets and ranks highest overall in the converging set. Meanwhile, KG RAG produces the most empowering answers in two datasets.

Interestingly, the graph-based methods Naive GraphRAG and KG RAG, despite underperforming in the retrieval metrics, excel in generation quality. This contrast may be explained by their structured, exploration-based retrieval, which, when successful, enables the generation of more comprehensive and informative answers. KG RAG, for example, follows a strict exploration plan and returns no answer if the plan fails—degrading the retrieval metrics but benefiting generation quality when the plan succeeds.

In contrast, methods like Naive RAG, Hybrid RAG, and Rerank RAG perform moderately or poorly across most generation metrics. GraphTRACE stands out as the most balanced approach, consistently ranking among the top two for most metrics and demonstrating both precision (directness) and general quality across datasets.

To complement the LLM-based evaluation, we conducted a small-scale human study with 40 samples (stratified by dataset and hop count) rated by two independent annotators. The annotators used the same criteria as in our LLM evaluation. We observed an inter-annotator agreement rate of 67%. The human evaluation confirms the LLM-based evaluation: graph-based approaches achieve strong results in answer generation. Notably, GraphTrace was rated even more highly in the “empowerment” criterion by the human annotators than in the LLM-based assessment.

5. Conclusions

This paper introduces GraphTrace, a modular retrieval framework that combines the structured reasoning capabilities of knowledge graphs (KGs) with the generative power of large language models (LLMs) to answer complex, multi-hop questions. By leveraging a domain-specific economic KG centered on the impacts of the COVID-19 pandemic, GraphTrace was designed to navigate and extract relevant information across multiple graph hops and entity relationships. The framework integrates key components—entity extraction, query decomposition, path ranking, and aggregation—into a cohesive pipeline that enables precise retrieval and informed answer generation. Across a range of evaluation metrics and datasets, GraphTrace consistently outperformed the baseline RAG and GraphRAG methods, particularly excelling in medium-complexity questions (5–6 hops) and in generating direct, focused answers.

Despite these strengths, challenges remain. GraphTrace shows diminished performance on low- and high-complexity questions, and generation metrics reveal room for improvement in areas such as empowerment and comprehensiveness. These findings underscore the importance of continued refinement in both evaluation design and system adaptability. Future work will focus on extending the benchmark to include more diverse question types and hop distributions, as well as incorporating adaptive mechanisms into the retrieval pipeline. Furthermore, the evaluation will be extended by considering additional aspects such as correctness, hallucination rates, and semantic similarity to provide an even more comprehensive assessment of the framework. These enhancements will aim to further improve the system’s robustness and effectiveness in real-world knowledge retrieval scenarios.

Author Contributions

Conceptualization, A.O. and H.K.; methodology, A.O., H.K., J.G. and S.G.; software, A.O.; validation, A.O., H.K. and J.G.; formal analysis, A.O.; investigation, A.O., H.K., A.-L.K. and J.G.; resources, A.O., H.K. and A.-L.K.; data curation, A.O. and H.K.; writing—original draft preparation, A.O., H.K. and A.-L.K.; writing—review and editing, A.O., H.K. and A.-L.K.; visualization, A.O.; supervision, H.K., S.G. and J.G.; project administration, S.G. and J.G. funding acquisition, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the German Research Foundation under project number 490998901.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv 2023, arXiv:2309.01219. [Google Scholar] [CrossRef]
Adlakha, V.; BehnamGhader, P.; Lu, X.H.; Meade, N.; Reddy, S. Evaluating correctness and faithfulness of instruction-following models for question answering. Trans. Assoc. Comput. Linguist. 2024, 12, 681–699. [Google Scholar] [CrossRef]
Liu, T.; Zhang, Y.; Brockett, C.; Mao, Y.; Sui, Z.; Chen, W.; Dolan, B. A token-level reference-free hallucination detection benchmark for free-form text generation. arXiv 2021, arXiv:2104.08704. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Tang, Y.; Yang, Y. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries. arXiv 2024, arXiv:2401.15391. [Google Scholar]
Peng, B.; Zhu, Y.; Liu, Y.; Bo, X.; Shi, H.; Hong, C.; Zhang, Y.; Tang, S. Graph retrieval-augmented generation: A survey. arXiv 2024, arXiv:2408.08921. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Larson, J. LazyGraphRAG: Setting a New Standard for Quality and Cost. 2024. Available online: https://www.microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost/ (accessed on 1 July 2025).
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2025, arXiv:2404.16130. [Google Scholar] [CrossRef]
Sun, J.; Xu, C.; Tang, L.; Wang, S.; Lin, C.; Gong, Y.; Ni, L.M.; Shum, H.Y.; Guo, J. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. arXiv 2023, arXiv:2307.07697. [Google Scholar]
Cao, Y.; Gao, Z.; Li, Z.; Xie, X.; Zhou, K.; Xu, J. LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration. arXiv 2025, arXiv:2411.05844. [Google Scholar] [CrossRef]
Liu, R.; Jiang, H.; Yan, X.; Tang, B.; Li, J. PolyG: Effective and Efficient GraphRAG with Adaptive Graph Traversal. arXiv 2025, arXiv:2504.02112. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of Retrieval-Augmented Generation: A Survey. In Proceedings of the Big Data; Zhu, W., Xiong, H., Cheng, X., Cui, L., Dou, Z., Dong, J., Pang, S., Wang, L., Kong, L., Chen, Z., Eds.; Springer: Singapore, 2025; pp. 102–120. [Google Scholar]
Gao, Y.; Xiong, Y.; Wang, M.; Wang, H. Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks. arXiv 2024, arXiv:2407.21059. [Google Scholar] [CrossRef]
Hu, Y.; Lei, Z.; Zhang, Z.; Pan, B.; Ling, C.; Zhao, L. GRAG: Graph Retrieval-Augmented Generation. arXiv 2024, arXiv:2405.16506. [Google Scholar] [CrossRef]
Shi, Y.; Tan, Q.; Wu, X.; Zhong, S.; Zhou, K.; Liu, N. Retrieval-enhanced Knowledge Editing in Language Models for Multi-Hop Question Answering. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, New York, NY, USA, 21–25 October 2024; CIKM ’24. pp. 2056–2066. [Google Scholar] [CrossRef]
Mavi, V.; Jangra, A.; Jatowt, A. Multi-hop Question Answering. Found. Trends Inf. Retr. 2024, 17, 457–586. [Google Scholar] [CrossRef]
Gruenwald, L.; Jain, S.; Groppe, S. (Eds.) Leveraging Artificial Intelligence in Global Epidemics; Elsevier: Amsterdam, The Netherlands, 2021. [Google Scholar]
Khorashadizadeh, H.; Mihindukulasooriya, N.; Ranji, N.; Ezzabady, M.; Ieng, F.; Groppe, J.; Benamara, F.; Groppe, S. Construction and Canonicalization of Economic Knowledge Graphs with LLMs. In Proceedings of the International Knowledge Graph and Semantic Web Conference; Springer: Cham, Switzerland, 2024; pp. 334–343. [Google Scholar]
Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
Sentence Transformers Team msmarco-MiniLM-L6-en-de-v1. 2021. Available online: https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1 (accessed on 30 June 2025).
Khorashadizadeh, H.; Tiwari, S.; Benamara, F.; Mihindukulasooriya, N.; Groppe, J.; Sahri, S.; Ezzabady, M.; Ieng, F.; Groppe, S. EcoRAG: A Multi-hop Economic QA Benchmark for Retrieval Augmented Generation Using Knowledge Graphs. In Proceedings of the Natural Language Processing and Information Systems (NLDB), Kanazawa, Japan, 4–6 July 2025. [Google Scholar]
Cross-Encoder Team all-MiniLM-L6-v2. 2021. Available online: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 30 June 2025).
Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT. 2020. Available online: https://www.maartengrootendorst.com/blog/keybert/ (accessed on 1 July 2025).
Sanmartin, D. Kg-rag: Bridging the gap between knowledge and creativity. arXiv 2024, arXiv:2405.12035. [Google Scholar] [CrossRef]
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1247–1250. [Google Scholar]
Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
Wang, J.; Liang, Y.; Meng, F.; Sun, Z.; Shi, H.; Li, Z.; Xu, J.; Qu, J.; Zhou, J. Is chatgpt a good nlg evaluator? a preliminary study. arXiv 2023, arXiv:2303.04048. [Google Scholar] [CrossRef]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]

Figure 1. Conceptual architecture of the GraphTrace retrieval pipeline. The framework consists of five modular components (1. Entity Extraction, 2. Path Finding, 3. Query Decomposition, 4. Path Ranking, 5. Aggregation).

Figure 2. Step-by-step example to illustrate GraphTrace’s conceptual design.

Figure 3. Entity extraction component.

Figure 4. Path finding component.

Figure 5. Query decomposition component; see https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 10 July 2025).

Figure 6. Path ranking component.

Figure 7. Full example.

Figure 8. Hop count distribution for converging, divergent, and linear datasets.

Figure 9. Comparison of evaluation results for different hop counts.

Table 1. Comparing existing work with GraphTrace. ✓ = indicates the feature is supported,

\times

= indicates it is not supported.

Table 1. Comparing existing work with GraphTrace. ✓ = indicates the feature is supported,

\times

= indicates it is not supported.

	Uses KG	Entity Extraction	Path Finding	Query Decomposition	Semantic Path Ranking	Context Aggregation	LLM-Aware Retrieval	Modular Architecture	Latency/Quality Tradeoff	Answer Uses KG Context	Query-to-Path Alignment	Adaptive Multi-Hop Control	Subgraph Diversity Handling	Path Canonicalization	LLM-Based Evaluation Strategy	Hop-Wise Breakdown
Naive RAG [4]	✗	✗	✗	✗	✗	✗	✗	✗	✓	✗	✗	✗	✗	✗	✗	✗
Advanced RAG [14]	✗	✓	✗	✓	✗	✓	✓	(✗) ¹	(✗) ²	✗	✗	✗	✗	✗	✗	✗
Modular RAG [15]	✗	✓	✗	✓	(✗) ³	✓	✓	✓	(✗) ²	✗	✗	✗	✗	✗	✗	✗
GraphRAG [9]	✓	✓	✓	(✗) ⁴	✓	✓	(✗) ⁵	✗	✗	✓	✗	(✗) ⁶	(✗) ⁷	✗	✗	✗
LazyGraphRAG [8]	✓	✓	(✓) ⁸	(✗) ⁹	✓	✓	✓	✓	✓	✓	(✗) ¹⁰	(✓) ¹¹	(✗) ¹²	✗	✗	✗
LEGO-GraphRAG [11]	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	(✗) ¹³	(✓) ¹⁴	(✓) ¹⁵	✗	✗	✗
GRAG [16]	✓	✓	✓	✗	(✓) ¹⁶	✓	✓	✓	✓	✓	✗	(✓) ¹⁷	(✗) ¹⁸	✗	✗	✗
GraphTrace (our approach)	✓	✓	✓	✓	✓	✓	✓	✓	(✓) ²	✓	✓	✓	✓	✓	✓	✓

Legend: 1: Consists of multiple components but lacks modular or exchangeable units. 2: No explicit mechanism for balancing latency and output quality. 3: Uses scoring mechanisms but lacks semantic path interpretation. 4: Operates on complex queries but does not decompose them explicitly. 5: Retrieval does not consider LLM-specific input constraints or behavior. 6: Uses fixed-hop search; not adaptive to query complexity. 7: No strategy for ensuring diversity among retrieved subgraphs. 8: Performs traversal but not via dedicated path-finding algorithms. 9: Processes entire query without decomposition into subparts. 10: Matches embeddings but lacks explicit segment-to-path alignment. 11: Traversal halts when no improvements occur, but it lacks semantic control. 12: Ranks by score but does not explicitly enforce diversity. 13: Assembles paths from modules but without query-based alignment. 14: Combines modules with variable depth but no dynamic hop adaptation. 15: Combines diverse components but without explicit diversity control. 16: Applies LLM scoring but not deeply semantic (e.g., relation-aware). 17: Allows variable hops but lacks query-driven adaptation. 18: Focuses on best path, not diversity-aware selection.

Table 2. Retrieval evaluation results on the aggregated QA dataset]Retrieval evaluation results on the aggregated question–answer dataset (converging, divergent, and linear). Best results are highlighted in bold. GraphTrace outperformed the baselines in MRR and MAP.

	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	Graph TRACE
MRR	0.3745	0.3597	0.3543	0.3318	0.0546	0.0808	0.4477
MAP	0.0873	0.1136	0.0870	0.0873	0.0137	0.0143	0.1906
Hit@10	0.7940	0.8202	0.7940	0.8427	0.1573	0.1723	0.8127

Table 3. Retrieval evaluation results on the converging dataset]Retrieval evaluation results on the converging dataset. Best results are highlighted in bold. GraphTrace outperformed the baselines in MRR and MAP.

	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	Graph TRACE
MRR	0.3502	0.368	0.3431	0.3102	0.0581	0.1013	0.484
MAP	0.0814	0.1332	0.0826	0.086	0.0161	0.0179	0.2314
Hit@10	0.7889	0.8778	0.7667	0.8333	0.2667	0.2556	0.8444

Table 4. Retrieval evaluation results on the divergent dataset]Retrieval evaluation results on the divergent dataset. Best results are highlighted in bold. GraphTrace outperformed the baselines in MAP.

	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	Graph TRACE
MRR	0.4261	0.3766	0.3925	0.3551	0.017	0.0863	0.4149
MAP	0.1038	0.1164	0.0995	0.0926	0.0049	0.0144	0.175
Hit@10	0.8	0.8333	0.8111	0.8444	0.0556	0.1667	0.8

Table 5. Retrieval evaluation results on the linear dataset]Retrieval evaluation results on the linear dataset. Best results are highlighted in bold. GraphTrace outperformed the baselines in MRR and MAP.

	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	Graph TRACE
MRR	0.3464	0.3336	0.3262	0.33	0.0897	0.054	0.4442
MAP	0.0763	0.0905	0.0785	0.0831	0.0203	0.0103	0.1644
Hit@10	0.7931	0.7471	0.8046	0.8506	0.1494	0.092	0.7931

Table 6. Retrieval performance in the converging dataset, broken down by hop count. The table reports the MRR, MAP, and Hit@10 across all methods for questions requiring between four and eight reasoning hops. Best results are highlighted in bold.

Hop Count	Q/A Count	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	GraphTrace
MRR
4	1	0.5000	0.5000	0.3333	0.5000	0.0000	0.0000	0.2000
5	2	0.7500	0.2667	0.3750	0.5000	0.0000	0.0000	0.7500
6	23	0.4094	0.4650	0.5345	0.4301	0.0605	0.0336	0.5765
7	47	0.3482	0.3501	0.3055	0.2889	0.0607	0.1606	0.5377
8	17	0.2196	0.2908	0.1849	0.1737	0.0580	0.0470	0.1960
MAP
4	1	0.1250	0.2500	0.0833	0.2083	0.0000	0.0000	0.1056
5	2	0.1900	0.1517	0.1083	0.2429	0.0000	0.0000	0.3100
6	23	0.1250	0.2073	0.1597	0.1366	0.0187	0.0072	0.3276
7	47	0.0677	0.1078	0.0591	0.0714	0.0145	0.0285	0.2433
8	17	0.0452	0.0940	0.0402	0.0322	0.0196	0.0064	0.0667
Hit@10
4	1	1.0000	1.0000	1.0000	1.0000	0.0000	0.0000	1.0000
5	2	1.0000	1.0000	1.0000	1.0000	0.0000	0.0000	1.0000
6	23	0.9565	0.9130	0.9565	0.9565	0.3913	0.1739	1.0000
7	47	0.7660	0.8936	0.7447	0.8511	0.2128	0.3617	0.8511
8	17	0.5882	0.7647	0.5294	0.5882	0.2941	0.1176	0.5882

Table 7. Retrieval performance in the divergent dataset, broken down by hop count. The table reports the MRR, MAP, and Hit@10 across all methods for questions requiring between four and eight reasoning hops. Best results are highlighted in bold.

Hop Count	Q/A Count	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	GraphTrace
MRR
4	1	0.2500	0.5000	0.2500	0.2500	0.0000	0.0000	0.2000
5	10	0.2917	0.2010	0.3617	0.3017	0.0324	0.1091	0.3169
6	70	0.4433	0.4037	0.3993	0.3689	0.0148	0.0798	0.4356
7	8	0.3937	0.2661	0.3762	0.2958	0.0104	0.0114	0.3721
8	1	1.0000	1.0000	0.5000	0.5000	0.0909	1.0000	0.5000
MAP
4	1	0.0625	0.1250	0.0625	0.0625	0.0000	0.0000	0.2339
5	10	0.1010	0.0813	0.1310	0.0913	0.0090	0.0218	0.1377
6	70	0.1060	0.1233	0.0985	0.0943	0.0035	0.0133	0.1797
7	8	0.0836	0.0704	0.0716	0.0703	0.0015	0.0031	0.1650
8	1	0.1750	0.3438	0.1125	0.1994	0.0968	0.1250	0.2396
Hit@10
4	1	1.0000	1.0000	1.0000	1.0000	0.0000	0.0000	1.0000
5	10	0.7000	0.8000	0.7000	0.9000	0.1000	0.1000	0.8000
6	70	0.8000	0.8143	0.8143	0.8143	0.0571	0.1857	0.7857
7	8	0.8750	1.0000	0.8750	1.0000	0.0000	0.0000	0.8750
8	1	1.0000	1.0000	1.0000	1.0000	0.0000	1.0000	1.0000

Table 8. Retrieval performance in linear dataset by hop count] Retrieval performance in the linear dataset, broken down by hop count. The table reports the MRR, MAP, and Hit@10 across all methods for questions requiring between four and six reasoning hops. Best results are highlighted in bold.

Hop Count	Q/A Count	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	GraphTrace
MRR
4	4	0.2500	0.3542	0.1542	0.3208	0.0625	0.0000	0.4750
5	6	0.3667	0.2750	0.3111	0.3472	0.0833	0.1667	0.3806
6	77	0.3498	0.3371	0.3364	0.3292	0.0916	0.0480	0.4475
MAP
4	4	0.0625	0.1522	0.0385	0.1576	0.0156	0.0000	0.3552
5	6	0.0978	0.0645	0.0717	0.0885	0.0167	0.0333	0.1083
6	77	0.0754	0.0893	0.0811	0.0788	0.0208	0.0090	0.1589
Hit@10
4	4	1.0000	1.0000	0.7500	1.0000	0.2500	0.0000	1.0000
5	6	0.6667	0.6667	0.6667	0.8333	0.1667	0.1667	0.8333
6	77	0.7922	0.7403	0.8182	0.8442	0.1429	0.0909	0.7792

Table 9. Generation performance in the converging, divergent, and linear question–answer datasets. Best results are highlighted in bold.

Criterion	Naive RAG	Naive RAG + Subquery	Hybrid RAG	Rerank RAG	Naive GraphRAG	KG RAG	GraphTrace
Converging
Comprehensiveness	3.70%	16.30%	0.37%	15.93%	30.74%	12.96%	20.00%
Diversity	6.30%	26.67%	3.33%	16.30%	15.93%	20.37%	11.11%
Empowerment	2.59%	17.04%	2.22%	13.33%	26.30%	20.74%	17.78%
Directness	6.30%	19.26%	4.07%	12.22%	3.70%	23.70%	30.74%
Overall	3.70%	17.04%	0.37%	15.93%	30.37%	12.59%	20.00%
Divergent
Comprehensiveness	13.33%	10.74%	5.56%	14.44%	20.37%	15.93%	19.63%
Diversity	11.11%	24.81%	5.56%	13.33%	10.00%	16.67%	18.52%
Empowerment	13.33%	10.74%	4.07%	13.33%	18.89%	21.85%	17.78%
Directness	19.26%	11.11%	8.89%	14.07%	0.37%	17.04%	29.26%
Overall	13.70%	10.74%	5.56%	14.81%	19.63%	15.56%	20.00%
Linear
Comprehensiveness	19.54%	11.88%	2.68%	17.62%	17.62%	14.94%	15.71%
Diversity	15.71%	22.99%	7.28%	12.26%	15.33%	9.58%	16.86%
Empowerment	18.39%	11.88%	2.68%	14.94%	18.01%	18.77%	15.33%
Directness	21.07%	8.81%	2.30%	15.71%	5.75%	22.22%	24.14%
Overall	18.39%	13.41%	2.30%	17.24%	17.24%	15.71%	15.71%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Osipjan, A.; Khorashadizadeh, H.; Kessel, A.-L.; Groppe, S.; Groppe, J. GraphTrace: A Modular Retrieval Framework Combining Knowledge Graphs and Large Language Models for Multi-Hop Question Answering. Computers 2025, 14, 382. https://doi.org/10.3390/computers14090382

AMA Style

Osipjan A, Khorashadizadeh H, Kessel A-L, Groppe S, Groppe J. GraphTrace: A Modular Retrieval Framework Combining Knowledge Graphs and Large Language Models for Multi-Hop Question Answering. Computers. 2025; 14(9):382. https://doi.org/10.3390/computers14090382

Chicago/Turabian Style

Osipjan, Anna, Hanieh Khorashadizadeh, Akasha-Leonie Kessel, Sven Groppe, and Jinghua Groppe. 2025. "GraphTrace: A Modular Retrieval Framework Combining Knowledge Graphs and Large Language Models for Multi-Hop Question Answering" Computers 14, no. 9: 382. https://doi.org/10.3390/computers14090382

APA Style

Osipjan, A., Khorashadizadeh, H., Kessel, A.-L., Groppe, S., & Groppe, J. (2025). GraphTrace: A Modular Retrieval Framework Combining Knowledge Graphs and Large Language Models for Multi-Hop Question Answering. Computers, 14(9), 382. https://doi.org/10.3390/computers14090382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GraphTrace: A Modular Retrieval Framework Combining Knowledge Graphs and Large Language Models for Multi-Hop Question Answering

Abstract

1. Introduction

2. Related Work

3. Methodology Overview

3.1. Entity Extraction

3.2. Path Finding

3.3. Query Decomposition

3.4. Path Ranking

3.5. Aggregation

4. Evaluation

4.1. Evaluation Setup

4.2. Retrieval Evaluation

4.3. Generation Evaluation

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI